OOMs using Nextflow that don't happen when submitting jobs manually

Hi team,

I previously constructed a Nanopore analysis pipeline in an ad-hoc manner using a bunch of bash scripts manually submitted to slurm. I am now trying to make this easier to use and more reproducible by converting it to a Nextflow pipeline.

However, when my scripts are submitted via Nextflow, I am getting OOM crashes even with several times the resource allocation that worked for individual shell scripts. For example, a basic Nanopore sequencing indexing script that takes FASTQ.GZ input, maps using minimap2, then sorts using samtools sort, works perfectly when submitted as an individual SLURM job with 48GB and 16 cpus.

However, when the equivalent script is submitted via Nextflow, I am consistently getting OOM crashes even when setting the resources to 96GB (which I have tried with anywhere from 2 to 32 CPUs - nothing makes a difference). I have tried all sorts of reconfigurations, such as changing whether minimap2 and samtools sort are connected by standard output piping or by writing to an intermediate file, and changing slurm submission options such as hyperthreading and node exclusivity. None of it has worked and I have not seen a single run of this script via Nextflow that hasn’t failed due to running out of memory.

Do you have any idea why resource requirements would be so incredibly different between single bash scripts submitted manually versus the same script submitted via Nextflow?

Many thanks,

Evelyn

You have tried a smaller test job to ensure nextflow pipeline was running expected?
I wonder if you could share which module(s) are crashing specifically OOM.

Hi, jobs with trivially small input datasets (e.g. pre-filtered) work fine with the exact same code. When inputting a <100 MB FASTQ, everything works as expected. Input files up to around 50 GB seem to be working okay. However, very large inputs of >100 GB give an OOM crash without fail.

I contacted the support team for my HPC cluster and they suspected that something in the nextflow configuration is causing SAMtools to try to store temporary files in memory rather than on disc. When running normally via slurm (without Nextflow), SAMtools will produce hundreds of GB of temporary files in some cases, then merge and delete them afterwards. If these are being held in memory rather than written to disk, it makes sense that I am getting OOM crashes.

If this is the issue, how can I configure nextflow to allow SAMtools to write to temporary files rather than trying to hold everything in memory?

Thanks,
Evelyn