Slurm requiring multiple resumes for pipeline advancement

I am running a nexflow pipeline with Slurm and the runs fail at different points, and -resume will advance the pipeline further but then again fail. Here are the config settings (full config attached; nextflow.config (6.7 KB)
resume.nextflow.log (1.1 MB)):

process { executor = 'slurm' cpus = { max_cpu(4) } memory = { max_mem(6.GB * task.attempt) } time = params.max_time container = "public.ecr.aws/o5l3p3e4/scale_methyl:v1.6"

The majority of fails are exit code 137 so memory is an issue.

How do I correctly assign resource allocation? I have cpus and mem in the nextflow.config within the process scope, withName and the parameters scope.

Whenever you’re writing a process for a tool, check if there already isn’t a nf-core module for that tool. If there is one, you should use this one as it’s curated by experts. Maybe you want to have it differently, but at least you can learn from the resources requested in the module. If there isn’t a module for that, you can look for modules for similar tools and estimate the resoruce allocation based on that too.

You can also set automatic retries with increasing resources. See the example below (and read here, here and here for more info about it):

process FOO {
    cpus 4
    memory { 2.GB * task.attempt } 
    time { 1.hour * task.attempt } 
    errorStrategy { task.exitStatus == 140 ? 'retry' : 'terminate' } 
    maxRetries 3 

    script:
    """
    your_command --cpus $task.cpus --mem $task.memory
    """
}

In the snippet above, if the exitStatus of the task is 140, the task will be retried. Otherwise it will be terminated. maxRetries sets the maximu number of times this task should be retried and time and memory have a value that depends on how many times this task has been retried already. First time 2 GB and 1 hour. Second time, 4 GB and 2 hours. Third and last time 6 GB and 3 hours.