Resume workflow based on files in publishDir (or other external directory)

dyoung · January 20, 2024, 6:54pm

Hello – I am considering using Nextflow to process files and save them into a large complex directory structure – our “curated datasets”. Is there a way to have Nextflow’s resume functionality determine whether to run a process based on the presence (and size/modified date) of files in an external directory (not work)? I’d like my input, intermediate, and final output files to all be stored in a file structure that I specify outside of work, but to still use the same resume logic as though those files were in the work directory managed by Nextflow.

I think I could use the publishDir directive to duplicate (copy) the work files outside of work in a specified subfolder structure. However, my files are very large and numerous and I can’t reserve the space for multiple copies. I can’t use the symlink mode of publishDir either because I need the external data folder to contain the actual data (so that it will persist even if I clear the Nextflow cache in work to free space).

Are there any Nextflow settings or tips to accomplish what I’m after? Thanks in advance!

dyoung · January 20, 2024, 7:21pm

Thinking more about this, it seems that using the hard link (link) mode of publishDir may roughly achieve what I’m after, because the file in publishDir (and its contents) should persist even if the original file in work is deleted.

I suspect what I would give up, though , is the ability to resume, because if the file in work is deleted, even if the rest of the cache remains, the workflow would think the file needs to be recreated.

Thus, I’m still interested to know if there is any way to have the canonical location of the output files (the files that are checked when determining how to resume) be in a custom directory structure outside of work.

mribeirodantas · January 21, 2024, 2:16am

In order for the resume feature to work, you need both work and .nextflow. If any of them is corrupted due to some change you made, resume won’t work accordingly. You can change the location of these folders, though, but no, you can’t make resume work based on some arbitrary folder structure that you created.

Gullumluvl · September 25, 2024, 12:25pm

I am interested in this feature as well, similarly to what snakemake or make do.
Does it mean that the only way to do it is “manually”, with ad hoc workflow code? Something like this with example processes generate_the_file and consume_the_file:

workflow {
    starting = file('mytable.csv', checkIfExists: true)
    intermediate = file('output/intermediate.txt')

    if (intermediate.exists() && (intermediate.lastModified() > starting.lastModified()) ){
        consume_the_file(intermediate)
    } else {
        generate_the_file(starting) | consume_the_file
    }
}

mribeirodantas · September 26, 2024, 3:20pm

Hi @Gullumluvl,

No, you can’t manually do that this way. Nextflow automatically handles cache/resuming tasks. It already does what you’re describing, but in a more powerful way. No reason to reinvent the wheel.

Gullumluvl · September 26, 2024, 4:47pm

May I ask what you mean by “you can’t”? The above code works (although ignoring the -resume flag). Did you mean “you shouldn’t”?

I agree it seems inappropriate, and I would like to avoid that but here is my situation:

I develop the workflow while running the analysis. So the workflow is iteratively built while successive steps of the analysis are obtained. This may cause difficulties to resume given the strictness of the criteria (sure, once finished I will test the full workflow…)
that one task to resume was extremely long (3 weeks, I can imagine worst situations). I find it legit that someone wants to start an analysis by providing the intermediate data, even if it was obtained outside of the nextflow workflow.
for the details: to speed up my computation, I found out how to parallelize it efficiently halfway, so I actually ran additional parts of it without nextflow and then had to merge the results. I don’t think it would be a good idea to put back “manual” outputs into the work directory.

Are there better ways then? Decomposing the workflow in subworkflows?

mribeirodantas · September 26, 2024, 11:11pm

You can write a Nextflow pipeline that starts at a later stage, having never run the first steps, given that you provide the input file to that later stage step. Think of a RNAseq pipeline in which you want to provide the BAM files, instead of a reference genome and the sequence files to align. The important bit here is that this is not the same as resuming a pipeline with Nextflow’s cache feature. This is just ignoring some steps because you already have the files you need for a later stage process, and you wrote in your pipeline code that steps could be ignored if you had this file. You can manually do that for every process of your pipeline, but this is not what I would recommend.

Gullumluvl · September 27, 2024, 8:28am

Thank you for your explanations! I start to understand the differences in conception but I’ll need some more time to get a refined picture about caching.

Gullumluvl · September 27, 2024, 8:46am

Oh, but (sorry for bumping again) I am re-reading the docs, and the storeDir directive seems very relevant for the original question:

allows you to define a directory that is used as a permanent cache for your process results.

mribeirodantas · September 27, 2024, 3:04pm

It is relevant, but the warning conflicts with what the author intended to do with the directory, unless I misunderstood it. Besides, it’s not the same thing as the resume cache feature. For example, if you set storeDir, it will kick in regardless of you using the resume feature or not. If you don’t want to resume your pipeline, it will still refrain from generating the file you have saved in your storeDir.

They’re slightly different things, but may be what you’re looking for.

Topic		Replies	Views
Task.hash on resume null Ask for help	8	176	February 19, 2024
Show cached tasks in nextflow run preview Ask for help nextflow	8	65	December 4, 2024
Enabling resume for specific processes Tips & Tricks nextflow	0	329	March 4, 2024
Caching doesn't work always \|\| already processed data fails Ask for help	1	209	February 14, 2024
How to skip specific/failed samples on next `-resume` Tips & Tricks nextflow	2	105	April 16, 2025

Resume workflow based on files in publishDir (or other external directory)

Related topics