Question about nf-core/nextflow reuse of data

Hi

Thank you for setting up the Seqera platform.
I have a question which is about nf-core/nextflow behavior, and how Seqera fusion deals with it.
nf-core seems to have one task per command, so we have a samtools sort, samtools index, samtools stats task.
Each of these tasks needs the BAM file as input.

Does this mean that nextflow spins up multiple instance, and each instance downloads the 5GB BAM file from S3 each time? Looking at the timeline, it seems so, but that seems inefficient.
The nextflow docs (Working with files — Nextflow documentation) seem to indicate that nextflow can download the file once, and reusue it, but if each process is running in a different AWS EC2 instance, doesn’t that force re-downloading?

When I wrote pipelines in WDL, I usually had one task for samtools, that would run all the commands, and therefore download the BAM file once. Does the Fusion filesystem make the downloading only once?

Am I missing something here?

Thank you,

Uri David

It depends on how you do that.

If you write your pipeline so that every task will handle 1 sample, and for each sample you need this 5 GB file, which is stored in an S3 bucket, each VM instance will need this file, and with that, mount the S3 bucket and make use of it. Fusion helps here by being smart about file transfer.

One thing you can do, though, is to have a single task managing multiple samples. We call that task batching and you can see an example here.