It is not possible to run an orchestration indefinitely on HPC systems. Typical wall time limits means there are often limits of 24 hours and up to a few days. This might not be enough to complete an entire workflow.
Enhancement request
I was wondering if there is scope for adding a wall time limit for the orchestration process so that after this limit, the tool will not longer submit new jobs even if there are more processes to run and also not submit new jobs that have wall times that would exceed this wall time limit at the point they are submitted.
Not only will the orchestration tool not submit jobs but will ensure all necessary information is stored (locally or on a remote database or file) the state of the current workflow, allowing for a quick restart.
Thanks for sharing this (and other) feedback Pascal.
Why are you suggesting to kill the orchestrator process?
A couple of comments on this:
I know that a typical default is for this process to kill off its “child” processes upon termination, however the run command has an option called -disable-jobs-cancellation which allows you to cancel a run without killing all in-progress jobs. Nextflow will simply exit and leave those jobs to finish on their own, and if you were to wait for those jobs to finish and resume your run, I believe those jobs would be cached.
Also, there is a -resume option that enables restarting a pipeline from the previous checkpoint (as in: any successfully completed tasks will not be re-run, provided the input data are the same)
The reason I ask is that disable job cancellation is good practice for HPC since the wall time limits could cause the orchestration process to be terminated but I would think it would be cleaner to have the process know that it has some limited time to run and so prepare for cancellation.
If provided a limit, and a time-span in which to submit no more jobs (and the ability for this time span to be zero, so keep submitting right to the end), and generating a resume command for a workflow where it stores already completed parts of the workflow and indicates what parts of the pipeline need to be checked to see if they have completed, the restart would be faster. And perhaps a bit more portable if files are moved.
I think the last point would be key but perhaps I do not know nextflow well enough. If I run something and have the workflow not complete, but move all the data produced by the portion that has completed to an archive off the filesystem used to run stuff, how would nextflow know what to restart?
Nextflow uses your .nextflow and work directories to resume your run. By default, .nextflow is at your $HOME and work in the path where you ran your pipeline from. If you correctly move these folders somewhere else, you can resume your pipeline somewhere else.