Feature idea: Optimised memory allocation through preemptive adjustment to avoid anticipated failures

Im a big fan of Nextflow’s ability to retry failed processes with an increased memory allocation (e.g. memory { 20.GB * task.attempt }). Its a fantastic feature, however, I’ve been pondering the possibility of extending this feature to preemptively adapt processes before they even start.

Consider this scenario: I’m running a process in parallel 2,000 times. Initially, a few samples fail due to out-of-memory errors. After being resubmitted with increased memory, they still fail until they’re given three times the original memory allocation. Consequently, the process ends up running a total of 6,000 times, substantially inflating costs—especially on cloud computing platforms where expenses accumulate swiftly.

What if Nextflow could identify these initial failed samples and leverage their adjusted memory allocations to preemptively modify any yet-to-start processes? This proactive adjustment could potentially prevent two rounds of failures, significantly saving computational resources and costs.

Admittedly, this approach might not fit all scenarios—especially where sample sizes or types vary significantly within a single run. Therefore, it would need to be an optional switch, allowing users to decide when to employ this memory adjustment mechanism.

I’m curious to hear what others in the community think of this potential enhancement, or whether someone has somehow already hacked together something similar.

Hi Luke. I think there are different ways of addressing this.

On the Tower side, I think it’d be nice if the resource optimiser (which you can call after the pipeline has completed) could generate rules such as memory {20.GB * task.attempt} for the next run. At the moment, I think it only extracts the maximum memory usage of each process, e.g. memory { 43.GB }. It could even consider different increases between attempts like

memory {task.attempt == 1 ? 4.GB : task.attempt == 2 ? 8.GB : task.attempt == 3 ? 24.GB : 48.GB * (task.attempt - 3) }

I would also request the ability to generate an optimal config from multiple pipeline runs. Not all pipelines run the same process 6,000 times.

If analysing the memory usage distribution works, I would love seeing this moved to Nextflow itself and applied dynamically as processes are run. memory auto would be so cool !

However, I think linking the memory request to the input size requires some understanding of the input data that may be difficult to bake into Nextflow. I personally think that as pipeline developer we can do our bit too. If you’re able to test the pipeline on a good test dataset, first of all you can come up with the optimal memory {20.GB * task.attempt} yourself. But you can also try to figure out how each tool uses memory based on the inputs and code that in the pipeline. For instance, you can add a process at the beginning of your pipeline to extract the size of the input files, and the rule would be something like:

memory { 4.GB * Math.ceil(meta.num_reads / 1000000000) * task.attempt }

for 4 GB for every billion reads.

I guess we could dream of memory auto [meta.num_reads] to tell Nextflow what parameters to consider ?

1 Like

Hi Luke, we are addressing this mainly through the resource optimization feature in the Seqera Platform. The best practice is to run on a small set of representative samples to create the optimized profile, then use that for all future runs.

(By the way Matthieu you can make the optimized profile use the task attempt, there should be a toggle in the pipeline settings)

There are a few improvements I would like to make to it (as the developer of it :wink: ):

  • create an optimized profile from multiple runs
  • predict resource usage based on task inputs like files and metadata
  • update the prediction model during the pipeline run

Now you can already use task inputs like the meta map in the memory directive, and if you give your input files a variable name then you can use them too:

process FOO {

  samples.each {
    println "${it.name} ${it.size()}"

workflow {
  FOO( file('*.fastq') )

So you can certainly try to roll your own heuristics if you can find them.