Troubleshooting why Nextflow is not capturing an Out of Memory (OOM) error (exit status 137) in a piped command within a process

Kanna_Dhasan · November 9, 2024, 5:23pm

Hi Everyone …

Issue with Capturing Exit Status in a Piped Command for Memory-Intensive Process
Environment:
Nextflow version: 24.10.0
Executor: local (running on an EC2 machine)
Docker: Enabled (process.container = ‘fragment_env:latest’)
Configuration:
I have the following main.nf script with nextflow.enable.dsl=2:
groovy

process fragmentfilter {

    cpus { 4 * task.attempt }
    memory { 16.GB * task.attempt }
    errorStrategy { task.exitStatus in 137..141 ? 'retry' : 'terminate' }
    maxRetries 3
    debug true

    publishDir "${params.pubdir}/fragment"

    input:
    file fragBed
    file targetBed

    output:
    path "filtered_file.bed"

    script:
    """
    echo "Shell options: \$SHELLOPTS"

    sort -k1,1V -k2,2n ${fragBed} | intersectBed -sorted -wa -a stdin -b ${targetBed} -f 0.5 | \\
    awk -F"\t" -v MQ=30 '(\$2 >= 6) && (\$5 >= MQ)' | cut -f 1,2,3,6 > filtered_file.bed
    """
}

workflow {
    fragmentfilter(file(params.fragBed), file(params.targetBed))
}

And here is my nextflow.config:



params {
    pubdir = "placeholder for absolute path to results"
    fragBed = "placeholder for absolute path to frag file"
    targetBed = "placeholder for absolute path to bedfile file"
}

docker {
    enabled = true
}

process.shell = ['/bin/bash', '-euo', 'pipefail']
process.container = 'fragment_env:latest'

Problem Description
In the fragmentfilter process, I have a command chain involving sort, intersectBed, awk, and cut. When I run this pipeline, it fails due to an out-of-memory (OOM) error in the first command, sort. Through debugging, I found that sort alone requires around 30 GB of memory to handle the input file (${fragBed}). The issue is that Nextflow doesn’t seem to capture the exit status 137 from the OOM failure within the piped command. As a result, the errorStrategy retry mechanism does not trigger, and Nextflow doesn’t retry the process with increased memory.

According to previous help in the community, I added process.shell = ['/bin/bash', '-euo', 'pipefail'] in nextflow.config to ensure that Nextflow captures the first failed command’s exit status in a pipeline. Despite this setting, Nextflow doesn’t appear to recognize the 137 exit code, and the process does not automatically retry with increased memory.
Observations and Debugging

When I run the sort command separately, it fails with exit code 137 (confirming an OOM error).
However, within the pipeline, the failure exit code is not properly captured.
I have set the retry strategy as follows to handle OOM errors:
I have set the retry strategy as follows to handle OOM errors:
errorStrategy { task.exitStatus in 137…141 ? ‘retry’ : ‘terminate’ }
My expectation is that, with pipefail enabled, Nextflow should detect the 137 exit status and automatically retry the process with double the memory (as defined in the memory directive).

Question: Why does the pipeline not capture the 137 exit status within the piped command sequence, and what adjustments should I make to ensure that Nextflow retries the process with increased memory upon an OOM error?
Thank you in advance for any guidance!

xsvato01 · November 13, 2024, 2:34pm

Hi, I spent a few hours today trying to solve the same issue. I managed to do it, but it’s a bit of a hack (I hope bentsherman or pditommaso won’t mind).

Create a file, e.g., fix_pipefail.sh, with the following code:

awk 'NR==1 {print} NR==2 {printf "%s || exit 137", $0} NR>2 {print}' $PWD/.command.sh > $PWD/.command.sh.tmp
mv $PWD/.command.sh.tmp $PWD/.command.sh

Add a label to the processes that involve piping:

label "pipe"

Then, in nextflow.config, add the following:

withLabel : pipe {
       	beforeScript = "sh ${params.home}/fix_pipefail.sh"
    }

(where params.home holds the path to your scripts).

You can also refer to the Nextflow documentation: Nextflow Docs - beforeScript.

Explanation

This hack appends || exit 137 to the pipeline commands. If any of the pipes fail, they will return an error code 137, simulating an OOM (Out of Memory) kill.

Kanna_Dhasan · November 16, 2024, 6:59am

Thank you, @xsvato01 , for this workaround. I’ve been struggling with this for so many days. I will give it a try.

Poshi · November 18, 2024, 10:04am

You said you added this modification:

But that do not only ensure that if a pipeline member fails you get the error. It also ensures that any unbound variable or any other error stops the script. So chances are that you are not getting to the problematic code because it fails before. Why don’t you try enabling only the option that you are interested in?
process.shell = ['/bin/bash', '-o', 'pipefail']

Alternatively, make sure you are testing your entire script with these settings to make sure the script gets to the command that causes the OoM.

Topic		Replies	Views
My jobs are being canceled and logs are unable to be recovered Ask for help	4	311	March 27, 2024
Nextflow Error Ask for help nextflow , nf-core , google-cloud , platform	5	432	July 1, 2024
Shared memory bug? Ask for help nextflow , platform	10	60	August 14, 2024
Is there a way to make custom/additional messages for resource exceeded errors? Ask for help nextflow , config	3	426	January 3, 2024
Jobs marked as failed by nextflow but slurm exit code 0 Ask for help nextflow , hpc	0	24	June 9, 2025

Troubleshooting why Nextflow is not capturing an Out of Memory (OOM) error (exit status 137) in a piped command within a process

Explanation

Related topics