(I originally posted this message on the nf-core Slack channel, but I was advised that this post was more suited to the Nextflow channel, so I am re-posting here.)
I’ve been managing the excellent nf-core pipelines for several years now at my institute and generally they run very well. However, there is one problem we are encountering more often, which is that jobs fail to submit to our cluster.
To give more details, when we start an nf-core pipeline, many jobs are submitted – we use Slurm and we can see the jobs running using squeue.
However, after a period of time, no new jobs will be submitted – i.e. squeue returns 0 running jobs. We can wait for hours, or even days, and no new jobs will be submitted.
Consequently, we kill the background Java process which terminates the pipeline. We then restart the pipeline using the -resume flag. After doing this, more jobs will be submitted to the Slurm queue.
Sometimes we have to stop and restart a pipeline multiple times, but on other occasions the pipeline will run to completion without any restart needed. This seems to be non-deterministic i.e. using the same pipeline with the same data may require a different number of restarts on different occasions.
Has anyone else encountered this?
I’m sorry if this has been resolved elsewhere, but does anyone know why the background Java process runs on the head node without submitting more Slurm jobs?
thanks for sharing your issue on the community forum!
This sounds like a tricky problem to troubleshoot, since as you mention it’s non-deterministic. It’s therefore really difficult to know exactly where to start looking for potential issues. I have not observed this particular behavior working on Slurm HPCs. From what you describe, I would suspect that this is more of an underlying HPC problem. Here are some things I can imagine could impact your job submissions on HPC:
Do your new jobs get not submitted due to some job submission limit on your cluster? Some HPC admins are quite strict with how many jobs can be submitted. Maybe your new jobs don’t get submitted due to some inherent limit.
How big is the queue you are trying to submit jobs to? Maybe all of the available nodes are occupied by other users and nextflow somehow can’t even schedule jobs to begin when resources are available? Do you have another submission queue on your HPC you can try?
From a Nextflow point of view, I can’t think of something that could explain such long waiting times and non-deterministic behaviour. You can maybe still check queueSize and submitRateLimit parameters in your configs to see if they are set to some strange values that might influence your submission behaviour (Configuration — Nextflow documentation).
Hope these suggestions can help, sorry I don’t have more helpful suggestions.
The person who manages the job submission aspect of the HPC is out of the office at the moment, so I’ll need to wait for a reply.
(I’ve also attached the log files generated in the Launch Directory that were generated after the pipeline had be stopped-started and run to completion. I’ve not been able to work out from these what the problem is with the HPC settings when running pipelines.)
Has your cluster gone under maintenance/update recently? If it is SLURM I had a similar issue and it was indeed related to submitRateLimit and co. SLURM sometimes cannot keep up with nextflow speed and to protect it the new update has implemented a limit of jobs that is extremely difficult to quantify because it counts the times you submit and also any other slurm commands done in the background. It is a known issue in SLURM and hopefully they will fix it at some point.
Thanks for your message - I’ve been on annual leave else would have replied sooner.
I’ve updated my config file and will try this…fingers crossed! (Unless I am mistaken, I seem to remember we discussed this exact same issue over a burrito during lunch at a Nextflow workshop at the Sanger.)
Anyway, I hope this solves the problem. Thanks once again.