Slurm killing platform jobs

johnvusich · March 27, 2024, 8:27pm

I set up a compute environment to run jobs using our HPC. Our cluster uses Slurm. When I submit the head job, I can see that it is running (using squeue -u) and that eventually the cluster jobs start to run. However, after a few minutes, it looks like Slurm kills all of the jobs. The error messages are usually SIGTERM or

terminated for unknown reason -- Likely it has been terminated by the external system.

The pipelines run when I submit the head job from the command line, but not when I submit from Seqera Platform. This shouldn’t be a permissions issue, right? Because it is submitting the head job from my user account (via ssh key) and all of the input data is in my home directory. Maybe it is a Slurm configuration problem.

When I set up the compute environment, I set the resources using advanced options → head job options. For example, for MSU_HPCC_short compute environment, I have the following:

--time=4:00 --mem=24GB --cpus-per-task=8

Is there a better way to set resources that allows me to use the same compute environment every time? And any idea why my jobs are getting killed by Slurm?

Thanks
John

ewels · March 28, 2024, 8:43am

These options are only for the slurm submission running the main Nextflow process itself, not the jobs that are then submitted by Nextflow. If Slurm kills any of the compute jobs because they went over then the pipeline run will crash and all jobs will exit, including the head job.

If Slurm is killing the compute jobs because they are requesting too many resources, you need to configure Nextflow itself, not Platform. You can do this by passing a Nextflow config in a few places, including at launch.

The Nextflow config should then use process.cpus, memory and time, the same as if you were running Nextflow manually on the command line. If you’re using nf-core pipelines (or pipelines made with the nf-core template), you can use the max-resources parameters.

You can check that your configuration is definitely getting through by looking in the task work directories that are created. The .command.run files should have #SBATCH header lines with the Slurm arguments.

Phil

johnvusich · April 8, 2024, 6:37pm

Hi Phil,

We figured out the problem is that Seqera is launching Slurm jobs from our gateway node. Is there a way to add a pre-run script to SSH into a development node to launch jobs from there?

Thanks,
John

ewels · April 8, 2024, 9:31pm

Ah nice, good find!

Yes, you can run pre- and post-run scripts: Advanced options | Seqera Docs

You may also find it easier to run using the Tower Agent - you can then run this process wherever you like and it’ll reach out to Platform and submit jobs wherever it is running.

Phil

Topic		Replies	Views
Resume run with changed head job configuration Ask for help hpc , platform , slurm	6	330	October 26, 2023
Slurm jobs not being submitted Ask for help nextflow	6	261	July 24, 2024
Slurm requiring multiple resumes for pipeline advancement Ask for help	1	53	August 13, 2024
Memory allocation issue Ask for help	3	227	March 27, 2024
Platform launch folders and concurrent Nextflow runs Ask for help hpc , platform , slurm	1	285	April 23, 2024

Slurm killing platform jobs

Related topics