Resume workflow based on files in publishDir (or other external directory)

Hello – I am considering using Nextflow to process files and save them into a large complex directory structure – our “curated datasets”. Is there a way to have Nextflow’s resume functionality determine whether to run a process based on the presence (and size/modified date) of files in an external directory (not work)? I’d like my input, intermediate, and final output files to all be stored in a file structure that I specify outside of work, but to still use the same resume logic as though those files were in the work directory managed by Nextflow.

I think I could use the publishDir directive to duplicate (copy) the work files outside of work in a specified subfolder structure. However, my files are very large and numerous and I can’t reserve the space for multiple copies. I can’t use the symlink mode of publishDir either because I need the external data folder to contain the actual data (so that it will persist even if I clear the Nextflow cache in work to free space).

Are there any Nextflow settings or tips to accomplish what I’m after? Thanks in advance!

1 Like

Thinking more about this, it seems that using the hard link (link) mode of publishDir may roughly achieve what I’m after, because the file in publishDir (and its contents) should persist even if the original file in work is deleted.

I suspect what I would give up, though , is the ability to resume, because if the file in work is deleted, even if the rest of the cache remains, the workflow would think the file needs to be recreated.

Thus, I’m still interested to know if there is any way to have the canonical location of the output files (the files that are checked when determining how to resume) be in a custom directory structure outside of work.

In order for the resume feature to work, you need both work and .nextflow. If any of them is corrupted due to some change you made, resume won’t work accordingly. You can change the location of these folders, though, but no, you can’t make resume work based on some arbitrary folder structure that you created.

I am interested in this feature as well, similarly to what snakemake or make do.
Does it mean that the only way to do it is “manually”, with ad hoc workflow code? Something like this with example processes generate_the_file and consume_the_file:

workflow {
    starting = file('mytable.csv', checkIfExists: true)
    intermediate = file('output/intermediate.txt')

    if (intermediate.exists() && (intermediate.lastModified() > starting.lastModified()) ){
        consume_the_file(intermediate)
    } else {
        generate_the_file(starting) | consume_the_file
    }
}

Hi @Gullumluvl,

No, you can’t manually do that this way. Nextflow automatically handles cache/resuming tasks. It already does what you’re describing, but in a more powerful way. No reason to reinvent the wheel.

May I ask what you mean by “you can’t”? The above code works (although ignoring the -resume flag). Did you mean “you shouldn’t”?

I agree it seems inappropriate, and I would like to avoid that but here is my situation:

  • I develop the workflow while running the analysis. So the workflow is iteratively built while successive steps of the analysis are obtained. This may cause difficulties to resume given the strictness of the criteria (sure, once finished I will test the full workflow…)
  • that one task to resume was extremely long (3 weeks, I can imagine worst situations). I find it legit that someone wants to start an analysis by providing the intermediate data, even if it was obtained outside of the nextflow workflow.
  • for the details: to speed up my computation, I found out how to parallelize it efficiently halfway, so I actually ran additional parts of it without nextflow and then had to merge the results. I don’t think it would be a good idea to put back “manual” outputs into the work directory.

Are there better ways then? Decomposing the workflow in subworkflows?

You can write a Nextflow pipeline that starts at a later stage, having never run the first steps, given that you provide the input file to that later stage step. Think of a RNAseq pipeline in which you want to provide the BAM files, instead of a reference genome and the sequence files to align. The important bit here is that this is not the same as resuming a pipeline with Nextflow’s cache feature. This is just ignoring some steps because you already have the files you need for a later stage process, and you wrote in your pipeline code that steps could be ignored if you had this file. You can manually do that for every process of your pipeline, but this is not what I would recommend.

Thank you for your explanations! I start to understand the differences in conception but I’ll need some more time to get a refined picture about caching.

Oh, but (sorry for bumping again) I am re-reading the docs, and the storeDir directive seems very relevant for the original question:

allows you to define a directory that is used as a permanent cache for your process results.

It is relevant, but the warning conflicts with what the author intended to do with the directory, unless I misunderstood it. Besides, it’s not the same thing as the resume cache feature. For example, if you set storeDir, it will kick in regardless of you using the resume feature or not. If you don’t want to resume your pipeline, it will still refrain from generating the file you have saved in your storeDir.

They’re slightly different things, but may be what you’re looking for.