How to store and share nextflow/nfcore results files

I have a very general question. During analyses we create results files that go into the output directory, which very frequently take a long time to generate. These results files can subsequently be used for graphs and reports. Where/how do you store those so that they can subsequently be used from the laptop or shared with a collaborator?

For example, I may run Sarek and obtain alignments, RNA results and variants on a directory within the HPC file system, but then on my laptop I would do the report with graphs, summaries, tables, etc. The code and report code I can share between computer and collaborators using git, but how about the actual results files? What do you do? I don’t have a good systematic method (I rely on rsync). In other words, in collaborations (or between my HPC and the laptop) it is easy to share the code using git, but how about sharing intermediate results files?

Any suggestions appreciated.

If you use git such that you clone your Github repo to the HPC, and then clone the HPC to your laptop, or some similar setup, you can use git remote get-url to provide you with a path to fetch data relative to that. Something I’m working on is a Pixi environment for genome assembly. Since Pixi is also a command runner as well as a package manager, I can add tasks like below where I can fetch subfolders locally.

# Fetch manual curation folders locally (assumes you added repo from HPC as remote)
fetch-curation = { cmd = "rsync -av $OPTS \"$(git remote get-url $ORIGIN)/data/outputs/01_ebp-assembly-workflow/08_rapid_curation\" \"data/outputs/01_ebp-assembly-workflow/\"", env = { OPTS = "", ORIGIN = "hpc" } }

So other collaborators would only then need to git clone the hpc repository, and then on their local machine, do pixi run fetch-curation to get the curation files locally.

Also make sure your git repo ignores the data directories, but allows you to commit the rest so you can safely use a single structured project folder to work (i.e. only rsync commands fetch data, not git pull). As long as you work in a structured way, it’ll make it easier to share directories, and code. Then in the end you can put the final data in appropriate data archives like ENA, etc, or data repositories like Figshare.

Another option is to see if dvc works for you. Also take a look at GitHub - Boehringer-Ingelheim/dso: Data Science Operations (dso) command line tool which is a wrapper around dvc, and there’s an nf-core bytesize talk on it too.

1 Like