Two major issues complicate managing data and code in open scientific research: it is frequently not possible to store large datasets alongside code in repository platforms such as GitHub, and iterative analysis can result in unnoticed changes to data, raising the possibility that analyses are predicted on older versions of the data.
With the help of a straightforward Data Manifest design and a quick, concurrent command-line tool, SciDataFlow makes it easier to track data changes, post data to remote repositories, and retrieve all the data required to replicate a computational investigation.
SciDataFlow is available at https://github.com/vsbuffalo/scidataflow.
Reference:
Buffalo V. (2024) SciDataFlow: a tool for improving the flow of data through science. Bioinformatics 40(1): btad754.