Aws s3 cp with Fusion terminating multipart download early with v24.04.03

We have the following process:

process download_star_index {
  tag "Download STAR index"

  output:
  path("star"), emit: star_index

  script:
  """
  mkdir star
  aws s3 cp --no-sign-request ${params.star_index} ./star --recursive
  """
}

Where params.star_index is an S3 path that is publicly accessible (read only).

In main.nf we pick up the star_index as:

if (params.star_index.startsWith("s3://csgx.public.readonly")){
    star_index = download_star_index()
  } else {
    star_index = file(params.star_index)
  }

We have this process to attain the STAR index when the STAR index is hosted in our public read only bucket as some of our users are not able to access public resources when using their configured AWS credentials. (We tried setting aws.client.anonymous = true in configuration, but this caused an issue, as the user credentials are required for other non-public S3 objects).

This process runs fine when launching the pipeline from Seqera Platform using Nextflow v23.10.1 and Wave/Fusion. However, after udating to v24.04.3, this process has started causing issues (only when launched from Platform using Wave/Fusion; runs fine without Wave/Fusion). The download_star_index completes but when our star process (that makes use of the star_index) is reached, it throws an error like:

EXITING because of FATAL ERROR: could not open genome file: star//SA
SOLUTION: check that the path to genome files, specified in --genomeDir is correct and the files are present, and have user read permissions
Jul 29 12:07:15 ...... FATAL ERROR, exiting
    /opt/conda/envs/star_env/bin/STAR-avx2 --runThreadN 8 --genomeDir star --readFilesIn Sample2.polyAtrimmed.fastq.gz --outFileNamePrefix Sample2_ --outReadsUnmapped Fastx --outSAMtype BAM Unsorted --readFilesCommand zcat --outSAMattributes Standard --outFilterMultimapNmax 1000
    STAR version: 2.7.11b   compiled: 2024-03-19T08:38:59+0000 :/opt/conda/conda-bld/star_1710837244939/work/source
Jul 29 12:07:15 ..... started STAR run
Jul 29 12:07:15 ..... loading genome
12:07PM INF shutdown filesystem start
12:07PM INF shutdown filesystem done

Upon closer inspection, the issue is with the download_star_index process. The process is completing before the STAR genome download has completed and the Genome and SA files are therefore not staged for the star process. Hence the error. When using a much smaller STAR index, this error doesn’t occur (presumably becuase aws s3 cp doesn’t have to make use of a multi-part transfer).

After some time, I found this post, where it was suggested to set scratch=true for the process in question. I.e.:

withName: 'download_star_index' {
    cpus = { check_max (1, 'cpus')}
    memory = { check_max( 4.GB * task.attempt, 'memory' ) }
    scratch = true
  }

This worked(!) but I have a couple of questions:

  • Why isn’t the process waiting for the download to finish? The process doesn’t seem to even be waiting for all the chunks to finish downloading. Is it because the multipart chunks are being renamed? This could be realted to this or this issue.
  • Is there a way to apply aws.client.anonymous = true on a resource by resource basis, i.e. to prevent us needing to have the download_star_index process for the star index (or other publicly hosted resources)?
  • Why is this not an issue on v23.10.1?
  • Is setting scratch = true the best solution?