Increasing throughput of Nextflow in GCP

I have been running epi2me/wf-clone-validation in Google Cloud Batch but I’ve been observing a linear relationship between the number of samples and the time taken to run everything. My assumption was that running in GCP would allow for full horizontal scaling and I want to confirm if this is possible.

I’m launching the pipeline from a Google Cloud Run Job with the following resources:

  • 8 vCPU
  • 32GB RAM

Running a single sample takes around an hour whereas running 43 samples takes > 5 hours (I haven’t run it long enough to get an accurate duration). There is clearly some scalability but given these samples are independent, I would expect the ability to process them concurrently.

I tried to implement my own parallel processing using Python’s subprocess module but the pipeline recognised that one process had the lock file open and so some samples ended up not being processed.

I also tried using the --threads property and set it to 8 to match my vCPU count but this didn’t improve throughput.

Is my ability to scale limited by the resource constraints of my Cloud Run Job? Or is there something else I’m missing to be able to fully horizontally scale this workflow and run 43 samples at roughly the same speed as 1.

N.B: I’m also trying to achieve the same thing with epi2me/wf-amplicon

Welcome to the community Joe!

Can you provide some additional details about your setup? What’s the exact cli command that you’re running?

In the interim, I’m going to guess that you might be running Nextflow with the default local executor, where it only has access to the 8vCPUs and 32 GB of RAM of the machine you launched it on. It’s definitely possible to do “full horizontal scaling” on Google Cloud but it requires some additional configuration.

This involves adding some credentials (so Nextflow can start instances that can work on tasks), and then configuring some additional details such as the project. Everything should be covered here.

Let us know if you need any extra help!