I’m running a Nextflow pipeline on 191 samples (trim → bismark align → deduplicate → methylation extraction) and have hit my HPC storage limit. The trimmed FASTQs in my work directory are taking up ~3-4TB.
125 of my samples have already completed bismark alignment. Since all downstream steps use the BAM files, the trimmed FASTQs are no longer needed as inputs to any future process.
If I delete the trimmed FASTQs for those 125 completed samples and resume with -resume, will Nextflow re-run trimming?
That’s just how Nextflow handles cache + resume. It runs the entire pipeline every time and checks every task that needs to be run and looks to see if it can find a cached version. If it can’t, it re-runs it - and every downstream task.
I find it does cleanup more aggressively than regular cleanup = true. It seems to do clean up as the pipeline is running, once it determines a file is no longer needed for downstream process, it delete it.
Yup, that’s exactly what it does. But it also breaks -resume. So I’m not sure that it’ll help here.
Generally if you have more samples than you have space for with intermediate files, your options are:
Split into batches and run separate Nextflow jobs for each, sequentially. Cleaning up intermediate files after each run.
Run Nextflow with some variation of scratch usage
Just enabling it can help, as only named output files will be copied back to the work directory from the node
You can run independent batches on a single node using executor: local, submitting the entire Nextflow pipeline run as a single HPC job. Set the work/ directory to the node’s scratch dir and save the final results to networked storage.