Advice on best practices for submitting processes that wait for others to complete before starting


I’m seeking advice on the architecture of my Nextflow workflow, which currently works well but is approaching its capacity limit.

The workflow involves calling an internet API that allows approximately 10 concurrent calls per IP address. I’ve implemented a process to handle common rate-limiting errors by sleeping and retrying (up to 3 times), and I’ve found that 10 concurrent jobs are the optimal number to avoid too many jobs waiting on 429 errors.

While Nextflow's -resume functionality does a great job of recovering failed batches, I want to ensure that batch sizes are manageable for recovery (~5,000) and that the number of batches running at any time is limited to 10.

Specifically, I’m looking for advice on best practices for creating and submitting batches of batches. For example, Batch A should wait for Batch B to finish before submitting. I’m also open to suggestions for implementing a process pool with a limit of 10, where the next process starts as soon as one completes until all batches are processed.

Some additional notes:

  • Input and output to the process are file paths.
  • I’m not looking for specific code examples but rather for an architectural pattern.
  • Currently, I’m using a workaround involving .collect() to wait on the output from the previous process before calling an alias back to the same process, but I’m sure there’s a better approach.

Thanks for any insights you can provide.

Have you thought of using maxForks so that you don’t have more than N tasks at the same time calling this API? More info on this process directive here.

Hi @mribeirodantas ,
This is exactly what I need.
A simple and elegant solution.

Thanks so much!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.