Dear developers,
I’ve two data types: WES and RNA; per-sample DNA have tumor and normal, while RNA has tumor only.
I run the WES pipeline hiding tumor normal as long as possible, at later stages I create branch and do a join as needed.
In order to run multiqc per-sample there are wes tumor and wes normal, that will be joined and similarly applybqsr files.
There are RNA data as well, however, not all WES samples have RNA.
I’ve multiple channels: fastp_wes_tumor, fastp_wes_normal, fastp_tumor_rna and wes_applybqsr_normal, wes_applybqsr_tumor and such.
I’d like to run multiqc on all samples whether there’s RNA data available or otherwise.
I’ve following code:
ch_fastp_normal = Channel.of(
[ [batch:'SEMA-MM-001', timepoint:'MM-0486-T-01', tissue:'normal', sequencing_type:'wes'],
tuple( file('normal_wes_fastq1.gz'),file('normal_wes_fastq2.gz')),file('1_normal.html'),file('1_normal.json') ],
[ [batch:'SEMA-MM-001', timepoint:'MM-0487-T-01', tissue:'normal', sequencing_type:'wes'],
tuple( file('normal_wes_fastq1.gz'),file('normal_wes_fastq2.gz')),file('1_normal.html'),file('1_normal.json') ]
)
ch_fastp_tumor = Channel.of(
[ [batch:'SEMA-MM-001', timepoint:'MM-0486-T-01', tissue:'tumor', sequencing_type:'wes'],
tuple( file('tumor_wes_fastq1.gz'),file('tumor_wes_fastq2.gz')),file('1_tumor.html'),file('1_tumor.json') ],
[ [batch:'SEMA-MM-001', timepoint:'MM-0487-T-01', tissue:'tumor', sequencing_type:'wes'],
tuple( file('tumor_wes_fastq1.gz'),file('tumor_wes_fastq2.gz')),file('1_tumor.html'),file('1_tumor.json') ]
)
ch_rna_fastp_tumor = Channel.of(
[ [batch:'SEMA-MM-001', timepoint:'MM-0486-T-01', tissue:'rna', sequencing_type:'wes'],
tuple( file('rna_fastq1.gz'),file('rna_fastq2.gz')),file('1_rna.html'),file('1_rna.json') ]
)
ch_fastp_normal.map{meta,reads_fq,html,json->["${meta.timepoint}", meta,reads_fq,html,json]}
.join(ch_fastp_tumor.map{meta3,reads_fq3,html3,json3->["${meta3.timepoint}", meta3,reads_fq3,html3,json3]})
.join(ch_rna_fastp_tumor.map{meta2,reads_fq2,html2,json2->["${meta2.timepoint}", meta2,reads_fq2,html2,json2]}, failOnMismatch:false, failOnDuplicate:true)
.view()
ch_fastp_normal.map{meta,reads_fq,html,json->["${meta.timepoint}", meta,reads_fq,html,json]}
.join(ch_fastp_tumor.map{meta3,reads_fq3,html3,json3->["${meta3.timepoint}", meta3,reads_fq3,html3,json3]})
.join(ch_rna_fastp_tumor.map{meta2,reads_fq2,html2,json2->["${meta2.timepoint}", meta2,reads_fq2,html2,json2]}, failOnMismatch:false, failOnDuplicate:true)
.groupTuple()
.multiMap { pid, meta1, reads_fq,html,json, meta3, reads_fq3,html3,json3,meta2, reads_fq2,html2,json2->
normal_fastp: [ meta1[0], reads_fq,html,json ]
tumor_fastp: [ meta3[0], reads_fq3,html3,json3,meta2]
rna_fastp: [ meta2[0], reads_fq2,html2,json2]
}
.set { ch_merged_fastp }
ch_merged_fastp.normal_fastp.view()
I do join but sample MM-0487-T-01 is left out.
My multiqc code would be as:
multiqc sample_name/RNA/primary/fastp/tumor/ \
sample_name/RNA/primary/star_align/ \
sample_name/WES/primary/fastp/tumor/ \
sample_name/WES/primary/fastp/normal/ \
sample_name/WES/primary/applybqsr/tumor/ \
sample_name/WES/primary/applybqsr/normal/ \
--zip-data-dir --config config_multiqc.yaml --force
I do not know how do write code that performs a full outer join and encompasses all samples. Or, how to write a conditional code in nextflow so it checks if RNA data are unavailable that variable is excluded.