Slurm jobs not being submitted

(I originally posted this message on the nf-core Slack channel, but I was advised that this post was more suited to the Nextflow channel, so I am re-posting here.)

I’ve been managing the excellent nf-core pipelines for several years now at my institute and generally they run very well. However, there is one problem we are encountering more often, which is that jobs fail to submit to our cluster.

To give more details, when we start an nf-core pipeline, many jobs are submitted – we use Slurm and we can see the jobs running using squeue.

However, after a period of time, no new jobs will be submitted – i.e. squeue returns 0 running jobs. We can wait for hours, or even days, and no new jobs will be submitted.

Consequently, we kill the background Java process which terminates the pipeline. We then restart the pipeline using the -resume flag. After doing this, more jobs will be submitted to the Slurm queue.

Sometimes we have to stop and restart a pipeline multiple times, but on other occasions the pipeline will run to completion without any restart needed. This seems to be non-deterministic i.e. using the same pipeline with the same data may require a different number of restarts on different occasions.

Has anyone else encountered this?

I’m sorry if this has been resolved elsewhere, but does anyone know why the background Java process runs on the head node without submitting more Slurm jobs?

Many thanks for your help.
Steven (edited)

Hi @StevenWingett ,

thanks for sharing your issue on the community forum!

This sounds like a tricky problem to troubleshoot, since as you mention it’s non-deterministic. It’s therefore really difficult to know exactly where to start looking for potential issues. I have not observed this particular behavior working on Slurm HPCs. From what you describe, I would suspect that this is more of an underlying HPC problem. Here are some things I can imagine could impact your job submissions on HPC:

  • Do your new jobs get not submitted due to some job submission limit on your cluster? Some HPC admins are quite strict with how many jobs can be submitted. Maybe your new jobs don’t get submitted due to some inherent limit.
  • How big is the queue you are trying to submit jobs to? Maybe all of the available nodes are occupied by other users and nextflow somehow can’t even schedule jobs to begin when resources are available? Do you have another submission queue on your HPC you can try?

From a Nextflow point of view, I can’t think of something that could explain such long waiting times and non-deterministic behaviour. You can maybe still check queueSize and submitRateLimit parameters in your configs to see if they are set to some strange values that might influence your submission behaviour (Configuration — Nextflow documentation).

Hope these suggestions can help, sorry I don’t have more helpful suggestions.