aws.batch.maxParallelTransfers Documentation Question

The documentation for aws.batch.maxParallelTransfers states “Max parallel upload/download transfer operations per job”. Can you please confirm my understanding and correct me if I am wrong?

This is describing the number of AWS CLI process executions (not the number of threads) on a worker node for a particular nextflow task.

You’re absolutely right. aws.batch.maxParallelTransfers (and the equivalent google.storage.maxParallelTransfers) controls the maximum number of concurrent aws CLI commands (or gsutil commands for Google Cloud) used for transferring data to and from object storage during Nextflow task execution. It’s important to emphasize that this refers to separate processes, not threads within a single process.

These settings directly influence the nxf_parallel function within Nextflow, which manages the parallel execution of these transfer commands (see the source code here). This function is employed by Nextflow’s internal staging (nxf_stage) and unstaging (nxf_unstage) mechanisms.

For example, consider the following (simplified) staging scenario:

nxf_stage() {
  downloads=( ) # Initialize an empty array
  downloads+=(nxf_s3_download s3://example-bucket/file1.dat)
  downloads+=(nxf_s3_download s3://example-bucket/file2.dat)
  # ... more files ...
  nxf_parallel "${downloads[@]}"
}

Each element in the downloads array represents a separate aws s3 cp command (wrapped by nxf_s3_download, which you can find here). nxf_parallel will execute these downloads concurrently, respecting the maxParallelTransfers limit. It will ensure that no more than that many aws CLI processes are running at the same time. The actual concurrency is further limited by the number of available CPU cores, using max(cpus, maxParallelTransfers) as the upper bound.

2 Likes

@robsyme, this is very helpful. I have a follow-up question. I have been reading about a AWS CLI configuration called max_concurrent_requests AWS CLI S3 Configuration — AWS CLI 1.37.13 Command Reference . If this is set inside the docker container how would that impact transfers? It appears that maxParallelTransfers is related to AWS CLI processes and max_concurrent_requests are related to the number of threads for each of those AWS CLI processes. Would you agree with that statement and what are Seqera’s recommendations for using max_concurrent_requests in general?

Good question.

No, Nextflow doesn’t expose the max_concurrent_requests cli configuration via Nextflow configuration.

We’ve found limited utility parallelizing via the threads given that there is already the ability to parallelize calls (via the maxParallelTransfers mechanism we’ve already discussed). For more efficient and robust data provision, a more complete solution was required, which is one of the reasons why we developed the Fusion filesystem.