I set up a compute environment to run jobs using our HPC. Our cluster uses Slurm. When I submit the head job, I can see that it is running (using squeue -u) and that eventually the cluster jobs start to run. However, after a few minutes, it looks like Slurm kills all of the jobs. The error messages are usually SIGTERM or
terminated for unknown reason -- Likely it has been terminated by the external system.
The pipelines run when I submit the head job from the command line, but not when I submit from Seqera Platform. This shouldn’t be a permissions issue, right? Because it is submitting the head job from my user account (via ssh key) and all of the input data is in my home directory. Maybe it is a Slurm configuration problem.
When I set up the compute environment, I set the resources using advanced options → head job options. For example, for MSU_HPCC_short compute environment, I have the following:
--time=4:00 --mem=24GB --cpus-per-task=8
Is there a better way to set resources that allows me to use the same compute environment every time? And any idea why my jobs are getting killed by Slurm?
These options are only for the slurm submission running the main Nextflow process itself, not the jobs that are then submitted by Nextflow. If Slurm kills any of the compute jobs because they went over then the pipeline run will crash and all jobs will exit, including the head job.
If Slurm is killing the compute jobs because they are requesting too many resources, you need to configure Nextflow itself, not Platform. You can do this by passing a Nextflow config in a few places, including at launch.
The Nextflow config should then use process.cpus, memory and time, the same as if you were running Nextflow manually on the command line. If you’re using nf-core pipelines (or pipelines made with the nf-core template), you can use the max-resources parameters.
You can check that your configuration is definitely getting through by looking in the task work directories that are created. The .command.run files should have #SBATCH header lines with the Slurm arguments.
We figured out the problem is that Seqera is launching Slurm jobs from our gateway node. Is there a way to add a pre-run script to SSH into a development node to launch jobs from there?
You may also find it easier to run using the Tower Agent - you can then run this process wherever you like and it’ll reach out to Platform and submit jobs wherever it is running.