We have the following process:
process download_star_index {
tag "Download STAR index"
output:
path("star"), emit: star_index
script:
"""
mkdir star
aws s3 cp --no-sign-request ${params.star_index} ./star --recursive
"""
}
Where params.star_index
is an S3 path that is publicly accessible (read only).
In main.nf we pick up the star_index as:
if (params.star_index.startsWith("s3://csgx.public.readonly")){
star_index = download_star_index()
} else {
star_index = file(params.star_index)
}
We have this process to attain the STAR index when the STAR index is hosted in our public read only bucket as some of our users are not able to access public resources when using their configured AWS credentials. (We tried setting aws.client.anonymous = true
in configuration, but this caused an issue, as the user credentials are required for other non-public S3 objects).
This process runs fine when launching the pipeline from Seqera Platform using Nextflow v23.10.1 and Wave/Fusion. However, after udating to v24.04.3, this process has started causing issues (only when launched from Platform using Wave/Fusion; runs fine without Wave/Fusion). The download_star_index
completes but when our star
process (that makes use of the star_index
) is reached, it throws an error like:
EXITING because of FATAL ERROR: could not open genome file: star//SA
SOLUTION: check that the path to genome files, specified in --genomeDir is correct and the files are present, and have user read permissions
Jul 29 12:07:15 ...... FATAL ERROR, exiting
/opt/conda/envs/star_env/bin/STAR-avx2 --runThreadN 8 --genomeDir star --readFilesIn Sample2.polyAtrimmed.fastq.gz --outFileNamePrefix Sample2_ --outReadsUnmapped Fastx --outSAMtype BAM Unsorted --readFilesCommand zcat --outSAMattributes Standard --outFilterMultimapNmax 1000
STAR version: 2.7.11b compiled: 2024-03-19T08:38:59+0000 :/opt/conda/conda-bld/star_1710837244939/work/source
Jul 29 12:07:15 ..... started STAR run
Jul 29 12:07:15 ..... loading genome
12:07PM INF shutdown filesystem start
12:07PM INF shutdown filesystem done
Upon closer inspection, the issue is with the download_star_index
process. The process is completing before the STAR genome download has completed and the Genome
and SA
files are therefore not staged for the star
process. Hence the error. When using a much smaller STAR index, this error doesn’t occur (presumably becuase aws s3 cp doesn’t have to make use of a multi-part transfer).
After some time, I found this post, where it was suggested to set scratch=true
for the process in question. I.e.:
withName: 'download_star_index' {
cpus = { check_max (1, 'cpus')}
memory = { check_max( 4.GB * task.attempt, 'memory' ) }
scratch = true
}
This worked(!) but I have a couple of questions:
- Why isn’t the process waiting for the download to finish? The process doesn’t seem to even be waiting for all the chunks to finish downloading. Is it because the multipart chunks are being renamed? This could be realted to this or this issue.
- Is there a way to apply
aws.client.anonymous = true
on a resource by resource basis, i.e. to prevent us needing to have thedownload_star_index
process for the star index (or other publicly hosted resources)? - Why is this not an issue on v23.10.1?
- Is setting
scratch = true
the best solution?