Slurm jobs not being submitted

StevenWingett · July 10, 2024, 1:20pm

(I originally posted this message on the nf-core Slack channel, but I was advised that this post was more suited to the Nextflow channel, so I am re-posting here.)

I’ve been managing the excellent nf-core pipelines for several years now at my institute and generally they run very well. However, there is one problem we are encountering more often, which is that jobs fail to submit to our cluster.

To give more details, when we start an nf-core pipeline, many jobs are submitted – we use Slurm and we can see the jobs running using squeue.

However, after a period of time, no new jobs will be submitted – i.e. squeue returns 0 running jobs. We can wait for hours, or even days, and no new jobs will be submitted.

Consequently, we kill the background Java process which terminates the pipeline. We then restart the pipeline using the -resume flag. After doing this, more jobs will be submitted to the Slurm queue.

Sometimes we have to stop and restart a pipeline multiple times, but on other occasions the pipeline will run to completion without any restart needed. This seems to be non-deterministic i.e. using the same pipeline with the same data may require a different number of restarts on different occasions.

Has anyone else encountered this?

I’m sorry if this has been resolved elsewhere, but does anyone know why the background Java process runs on the head node without submitting more Slurm jobs?

Many thanks for your help.
Steven (edited)

FloWuenne · July 10, 2024, 8:47pm

Hi @StevenWingett ,

thanks for sharing your issue on the community forum!

This sounds like a tricky problem to troubleshoot, since as you mention it’s non-deterministic. It’s therefore really difficult to know exactly where to start looking for potential issues. I have not observed this particular behavior working on Slurm HPCs. From what you describe, I would suspect that this is more of an underlying HPC problem. Here are some things I can imagine could impact your job submissions on HPC:

Do your new jobs get not submitted due to some job submission limit on your cluster? Some HPC admins are quite strict with how many jobs can be submitted. Maybe your new jobs don’t get submitted due to some inherent limit.
How big is the queue you are trying to submit jobs to? Maybe all of the available nodes are occupied by other users and nextflow somehow can’t even schedule jobs to begin when resources are available? Do you have another submission queue on your HPC you can try?

From a Nextflow point of view, I can’t think of something that could explain such long waiting times and non-deterministic behaviour. You can maybe still check queueSize and submitRateLimit parameters in your configs to see if they are set to some strange values that might influence your submission behaviour (Configuration — Nextflow documentation).

Hope these suggestions can help, sorry I don’t have more helpful suggestions.

StevenWingett · July 17, 2024, 9:28am

Hi @FloWuenne ,

Thanks for getting back to me (I’ve been on Annual Leave the past few days and just returned to the office.)

I will raise these points with the department that manages my institute’s HPC.

I have already tried to limit the number of submitted jobs with the following command in my config file:

executor {
    submitRateLimit = '1sec'
    queueSize = 100
}

I understand that this is within the allowed limits (and I was only submitting 1 Nexflow job at a time).

But as you say, if the Slurm scheduler is doing something unexpected with my submissions then this could be the cause of the problem.

Anyway, I’ll speak to our HPC people.

Many thanks,
Steven

StevenWingett · July 18, 2024, 8:27am

The person who manages the job submission aspect of the HPC is out of the office at the moment, so I’ll need to wait for a reply.

(I’ve also attached the log files generated in the Launch Directory that were generated after the pipeline had be stopped-started and run to completion. I’ve not been able to work out from these what the problem is with the HPC settings when running pipelines.)

nextflow_log_files.zip (963.8 KB)

RaquelManzano · July 18, 2024, 9:20am

Hi Steven,

Has your cluster gone under maintenance/update recently? If it is SLURM I had a similar issue and it was indeed related to submitRateLimit and co. SLURM sometimes cannot keep up with nextflow speed and to protect it the new update has implemented a limit of jobs that is extremely difficult to quantify because it counts the times you submit and also any other slurm commands done in the background. It is a known issue in SLURM and hopefully they will fix it at some point.

These are the params I use in case is helpful:

executor {
    queueSize         = '2000'
    pollInterval      = '3 min'
    queueStatInterval = '5 min'
    submitRateLimit   = '50sec'
    exitReadTimeout   = '5 min'
}

Best,
Raquel

StevenWingett · July 24, 2024, 5:14pm

Thanks for your message - I’ve been on annual leave else would have replied sooner.

I’ve updated my config file and will try this…fingers crossed! (Unless I am mistaken, I seem to remember we discussed this exact same issue over a burrito during lunch at a Nextflow workshop at the Sanger.)

Anyway, I hope this solves the problem. Thanks once again.

Best,
Steven

system · July 31, 2024, 5:15pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
My jobs are being canceled and logs are unable to be recovered Ask for help	4	322	March 27, 2024
Nextflow run stopped for uknown reason Ask for help nf-core	2	100	August 20, 2024
Nextflow pipeline randomly freezes Ask for help nextflow	0	74	January 2, 2025
Jobs marked as failed by nextflow but slurm exit code 0 Ask for help nextflow , hpc	0	31	June 9, 2025
Slurm killing platform jobs Ask for help hpc , platform , slurm	3	310	April 8, 2024

Slurm jobs not being submitted

Related topics