Hello everyone! I am deploying my pipeline in Azure and set everything up following these two great guides:
However I noticed that if I run the typical “hello world” pipeline specifying pools information as it follows it takes about 14 min to run it while if I remove all the pool provisioning info (autoPoolMode, autoPoolCreation, pools …) it runs in less than 2 min. This is a pipeline that in a laptop takes seconds, so I would like to know what exactly is going on when you run the pipeline in Azure (is there any detailed post, video etc… to understand that)? And why if you try to set up the pools it takes significantly longer? Any help is very welcome! Also let me know if there is a more specialized channel for that. Thanks!
When you run Nextflow on your laptop, all of the resources are ready and waiting. The machine is turned on, the CPUs have electricity running through them, the files are sat on the local drive ready to go. The latency is practically zero, but our bottleneck is the machine we are running on. How many CPUs does it have? How many processes can it run in parallel? What happens when it runs out of storage?
When you run Nextflow on a university-style cluster, the machines are up and waiting, the storage is there but nothing is quite as local. Files have to be transferred over the network, jobs must sit in a queue, scratch drives have to be managed. The latency has gone up, but we gain an increase in throughput because we have more machines, which means access to more CPUs, memory and storage.
When you run on the cloud, the machines might not even exist yet! When you start a pipeline, pools of virtual machines need to be created, booted up and join the scheduling system. Storage is all based on remote blob storage, so when a process starts it must download the files, run the analysis then re-upload the outputs. So our latency has gone up again, but our throughput is as big as our credit card.
Of course there is some nuance here. We could remove the scheduling system on a university cluster, which would reduce the latency but make running jobs a chaotic free-for-all. On a cloud executor, we could have leave some ‘hot’ machines ready to run, which would provide instance response time but cost more than our own metal. It’s all a game of tradeoffs and generally in my experience you can pick two from latency, throughput and cost efficiency.
As a side note, for your specific example you can watch all of this happening in real time using Azure Batch Explorer: Batch Explorer
Thanks Adam for the explanation and the link! It is super useful for me to understand the latency period.
However I still have one question, why there’s a difference in running time when you set up the pools to use in your config file (autoPoolMode, autoPoolCreation, pools …) and when you do not set up the pools?
However I still have one question, why there’s a difference in running time when you set up the pools to use in your config file (autoPoolMode, autoPoolCreation, pools …) and when you do not set up the pools?
What are the two setups you are trying to compare here? Can you share the configurations and/or Azure Batch pool details?
Based on your example, when you run the pipeline Nextflow will add a pool to Azure Batch of size 1 node, with 8 CPUs. So you have a lag where the machine is created on Azure Batch before all tasks essentially run in seconds, practically the same as a run on your laptop. After the pipeline is complete Nextflow deletes the pool.
Without seeing your pool configuration, I’m having to guess a bit for the other option. But I imagine what is happening is you are assigning work to pre-created pools with zero nodes active. The autoscale formula is evaluated every 5 minutes so you might have to wait 4 mins 59 seconds before Azure Batch re-evaluates how many nodes it needs and scales them up.
Other issues might be hitting a quota or an inappropriate autoscale formula, but I can’t say for certain without seeing the details.
I am trying to compare two runs with two different configs including vs not including the following (the rest of the config file is the same in both runs):
Including the pools configuration takes about 14 min while not including it takes less than 2 min. So in other words, why is it faster a run where you do not include any parameter for the pools?
# Run with pool config
nextflow run hello \
-c azureWithPools.config \
-with-report withPoolConfig.html \
# Run without pool config
&& nextflow run hello \
-c azureWithoutPools.config \
-with-report withoutPoolConfig.html
When I kick off the pipeline with the pool configuration, the following happens:
A pool is created of 1 node in Azure Batch, it’s got a randomly created name: nf-pool-e16b0b807f538b9c7585c1bf4bc64fe3-Standard_E8d_v5
The machine takes a few minutes to initiate.
Once it’s operational, it runs the hello world pipeline quite quickly because it has 8 CPUs so it can run each HELLO in parallel.
When I kick off the pool without the configuration I get…the same. But with a standard_d4_v3 machine and a maximum number of nodes per pool of (the default)
Timing wise, the first version took 2 minutes 45 seconds but sub 1 second of CPU hours. The second took 3 minutes 5 seconds, which is a bit longer but nothing excessive.
I’m not sure why you get such a slow result, do you have a different configuration somewhere which is being used when you run your pipeline?
Hi Adam,
Apologies to get back to you that late. I was trying to run the small test from my VM in Azure several times and it did not succeeded until I added nextflow.enable.dsl=2 in my config files. This was confusing for me as I am running Nextflow version 23.10.1 and this is what is said in the manual:
" For versions of Nextflow prior to 22.10.0 , you must explicitly enable DSL2 by adding nextflow.enable.dsl=2 to the top of the script or by using the -dsl2 command-line option."
This has never happened to me in the pipeline I am building but I see I need this code sentence when running hello pipeline or the nf-core pipelines. It could be nice to hear if you have any comment on this, but I understand this is another topic.
If I come back to the test that you propose it is certainly as you said, the duration is similar when you use a config file with or without pools. Perhaps without it takes usually a bit longer? See the image for the results and my config files shared at the bottom. But as a summary with pools it takes almost 5 min and without pools it takes 7 min. I guess the small differences in time it can depend on the vmType?
Regarding differences with my initial example I could see that in my modified version of “hello” I was calling the “nf-core/rnaseq” container and I was writing results in files as we were testing something else. This could probably take a bit more time (specially downloading the rnaseq container in each pool?
PLease let me know if I am wrong on any of my assumptions. I definitely understood a lot from your explanations.
If you enable autoPoolMode but don’t enable deletePoolsOnCompletion, Nextflow will leave the worker pool in the Azure Batch account and not delete it.
When you run Nextflow the second time, it will pick up the existing pool and submit the jobs and tasks to this pool. Now, if the pool has autoscaled to zero, you will need to wait 5 minutes until the autoscale formula is re-evaluated and the nodes are scaled up.
I recommend downloading Azure Batch Explorer and watching your Nextflow pipelines running in real time.
Thank you very much for the explanation and the tool recommendation. I just downloaded and it is very nice to see in real time what is going on with the pipeline and the pools in real time