Stream reference directories from S3

fishcakess · October 18, 2024, 10:10am

Hello, this is my second question this week but truly lost here.

I’m trying to run a software on aws batch that requires a few reference directories / data packs. We’ve been advised to store these on S3 and stream them to the EC2 using nextflow instead of doing things manually in the process like "aws s3 cp s3://ref_dir ./ "

I cannot work out the best way to feed these into the processes. We did something like below for one of the software using a channel.

Channel.fromPath('s3://bucket/samplesheets/samples.csv')
    .splitCsv(header: true)
    .map {row ->
        def meta = [
            sample_id: row.sample_id,
            condition: row.condition

    ]

    // emit this meta data + an S3 path

    def s3_R1_path = "s3://bucket/fastqs/${row.sample_id}_R1_001.fastq.gz"
    def s3_R2_path = "s3://bucket/fastqs/${row.sample_id}_R2_001.fastq.gz"
    def ref_dir = "s3://bucket/hash/"

    tuple(meta, file(s3_R1_path), file(s3_R2_path), file(ref_dir), file(pcgr_dir))
    }
    .set { sample_channel }

I was striving for something where I could use params but have seen that this is not their intended use, and nextflow wont stream data if its not in the input.

Is the best way to attach the reference dirs to each sample, like in my first channel?

What if I’m combining a few data packs to be used across a few different samples, how can I reference one version out of multiple? (I guess could put in samplesheets)


process ANNOTATE_VCF {


    input: 
    tuple val(meta), path(raw_vcf) 

    script:
    
    software ... --ref_dir ${params.ref_dir}

    
}

Tried this too and had no luck


process ANNOTATE_VCF {


    input: 
    tuple val(meta), path(raw_vcf), path ('s3://bucket/new_dir/ref_dir'}

    script:
    
    software ... --ref_dir ./new_dir/ref_dir

    
}

Alexander_Nater · October 24, 2024, 12:49pm

What you need is something along these lines:

process ANNOTATE_VCF {
    input: 
    tuple val(meta),  path(raw_vcf)
    tuple val(meta2), path(ref_dir)

    script:
    software ... --ref_dir $ref_dir
}

Channel
    .fromPath('s3://bucket/new_dir/ref_dir', type: 'dir')
    .map { dir -> [ [id: 'my_reference'], dir ] }
    .first()
    .set { ch_reference }

ANNOTATE_VCF (
    ch_vcf,
    ch_reference
)

Topic		Replies	Views
Publish AWS batch -> S3 Ask for help	4	34	October 23, 2024
ERROR ~ Unexpected error while finalizing task 'copyFilesToS3 (1)' - cause: Failed to create publish directory: s3://mybucket/fastq_standard/test Ask for help	0	46	November 15, 2024
Using Reference File Name to Emit Reference Path Ask for help	3	66	June 25, 2024
Nextflow can't file file from combined Channel when running on running awsbatch executor Ask for help nextflow , aws	2	27	September 25, 2024
Enabling `-resume` and `-log` on AWS Batch Ask for help aws	1	49	September 18, 2024

Stream reference directories from S3

Related topics