Hello, this is my second question this week but truly lost here.
I’m trying to run a software on aws batch that requires a few reference directories / data packs. We’ve been advised to store these on S3 and stream them to the EC2 using nextflow instead of doing things manually in the process like "aws s3 cp s3://ref_dir ./ "
I cannot work out the best way to feed these into the processes. We did something like below for one of the software using a channel.
Channel.fromPath('s3://bucket/samplesheets/samples.csv')
.splitCsv(header: true)
.map {row ->
def meta = [
sample_id: row.sample_id,
condition: row.condition
]
// emit this meta data + an S3 path
def s3_R1_path = "s3://bucket/fastqs/${row.sample_id}_R1_001.fastq.gz"
def s3_R2_path = "s3://bucket/fastqs/${row.sample_id}_R2_001.fastq.gz"
def ref_dir = "s3://bucket/hash/"
tuple(meta, file(s3_R1_path), file(s3_R2_path), file(ref_dir), file(pcgr_dir))
}
.set { sample_channel }
I was striving for something where I could use params but have seen that this is not their intended use, and nextflow wont stream data if its not in the input.
Is the best way to attach the reference dirs to each sample, like in my first channel?
What if I’m combining a few data packs to be used across a few different samples, how can I reference one version out of multiple? (I guess could put in samplesheets)
process ANNOTATE_VCF {
input:
tuple val(meta), path(raw_vcf)
script:
software ... --ref_dir ${params.ref_dir}
}
Tried this too and had no luck
process ANNOTATE_VCF {
input:
tuple val(meta), path(raw_vcf), path ('s3://bucket/new_dir/ref_dir'}
script:
software ... --ref_dir ./new_dir/ref_dir
}