Slurm requiring multiple resumes for pipeline advancement

agh-seq · August 12, 2024, 11:35pm

I am running a nexflow pipeline with Slurm and the runs fail at different points, and -resume will advance the pipeline further but then again fail. Here are the config settings (full config attached; nextflow.config (6.7 KB)
resume.nextflow.log (1.1 MB)):

process { executor = 'slurm' cpus = { max_cpu(4) } memory = { max_mem(6.GB * task.attempt) } time = params.max_time container = "public.ecr.aws/o5l3p3e4/scale_methyl:v1.6"

The majority of fails are exit code 137 so memory is an issue.

How do I correctly assign resource allocation? I have cpus and mem in the nextflow.config within the process scope, withName and the parameters scope.

mribeirodantas · August 13, 2024, 2:59am

Whenever you’re writing a process for a tool, check if there already isn’t a nf-core module for that tool. If there is one, you should use this one as it’s curated by experts. Maybe you want to have it differently, but at least you can learn from the resources requested in the module. If there isn’t a module for that, you can look for modules for similar tools and estimate the resoruce allocation based on that too.

You can also set automatic retries with increasing resources. See the example below (and read here, here and here for more info about it):

process FOO {
    cpus 4
    memory { 2.GB * task.attempt } 
    time { 1.hour * task.attempt } 
    errorStrategy { task.exitStatus == 140 ? 'retry' : 'terminate' } 
    maxRetries 3 

    script:
    """
    your_command --cpus $task.cpus --mem $task.memory
    """
}

In the snippet above, if the exitStatus of the task is 140, the task will be retried. Otherwise it will be terminated. maxRetries sets the maximu number of times this task should be retried and time and memory have a value that depends on how many times this task has been retried already. First time 2 GB and 1 hour. Second time, 4 GB and 2 hours. Third and last time 6 GB and 3 hours.

Topic		Replies	Views
Memory allocation issue Ask for help	3	206	March 27, 2024
Feature idea: Optimised memory allocation through preemptive adjustment to avoid anticipated failures Ask for help nextflow	3	247	July 13, 2024
Slurm jobs not being submitted Ask for help nextflow	6	152	July 24, 2024
Spades job in nf-core/denovotranscript pipeline fails with memory issues Ask for help platform , slurm	2	50	October 9, 2024
Slurm killing platform jobs Ask for help hpc , platform , slurm	3	299	April 8, 2024

Slurm requiring multiple resumes for pipeline advancement

Related topics