How to do full outer join? run multiqc

complexgenome · May 20, 2024, 10:43pm

Dear developers,

I’ve two data types: WES and RNA; per-sample DNA have tumor and normal, while RNA has tumor only.
I run the WES pipeline hiding tumor normal as long as possible, at later stages I create branch and do a join as needed.

In order to run multiqc per-sample there are wes tumor and wes normal, that will be joined and similarly applybqsr files.
There are RNA data as well, however, not all WES samples have RNA.

I’ve multiple channels: fastp_wes_tumor, fastp_wes_normal, fastp_tumor_rna and wes_applybqsr_normal, wes_applybqsr_tumor and such.

I’d like to run multiqc on all samples whether there’s RNA data available or otherwise.

I’ve following code:

  ch_fastp_normal = Channel.of(
    [ [batch:'SEMA-MM-001', timepoint:'MM-0486-T-01', tissue:'normal', sequencing_type:'wes'], 
    tuple( file('normal_wes_fastq1.gz'),file('normal_wes_fastq2.gz')),file('1_normal.html'),file('1_normal.json') ],
    [ [batch:'SEMA-MM-001', timepoint:'MM-0487-T-01', tissue:'normal', sequencing_type:'wes'], 
    tuple( file('normal_wes_fastq1.gz'),file('normal_wes_fastq2.gz')),file('1_normal.html'),file('1_normal.json') ]
)

ch_fastp_tumor = Channel.of(
    [ [batch:'SEMA-MM-001', timepoint:'MM-0486-T-01', tissue:'tumor', sequencing_type:'wes'], 
    tuple( file('tumor_wes_fastq1.gz'),file('tumor_wes_fastq2.gz')),file('1_tumor.html'),file('1_tumor.json') ],
    [ [batch:'SEMA-MM-001', timepoint:'MM-0487-T-01', tissue:'tumor', sequencing_type:'wes'], 
    tuple( file('tumor_wes_fastq1.gz'),file('tumor_wes_fastq2.gz')),file('1_tumor.html'),file('1_tumor.json') ]
)

ch_rna_fastp_tumor = Channel.of(
    [ [batch:'SEMA-MM-001', timepoint:'MM-0486-T-01', tissue:'rna', sequencing_type:'wes'], 
    tuple( file('rna_fastq1.gz'),file('rna_fastq2.gz')),file('1_rna.html'),file('1_rna.json') ]
)

  ch_fastp_normal.map{meta,reads_fq,html,json->["${meta.timepoint}", meta,reads_fq,html,json]}
  .join(ch_fastp_tumor.map{meta3,reads_fq3,html3,json3->["${meta3.timepoint}", meta3,reads_fq3,html3,json3]})
  .join(ch_rna_fastp_tumor.map{meta2,reads_fq2,html2,json2->["${meta2.timepoint}", meta2,reads_fq2,html2,json2]}, failOnMismatch:false, failOnDuplicate:true)
  .view()

  ch_fastp_normal.map{meta,reads_fq,html,json->["${meta.timepoint}", meta,reads_fq,html,json]}
  .join(ch_fastp_tumor.map{meta3,reads_fq3,html3,json3->["${meta3.timepoint}", meta3,reads_fq3,html3,json3]})
  .join(ch_rna_fastp_tumor.map{meta2,reads_fq2,html2,json2->["${meta2.timepoint}", meta2,reads_fq2,html2,json2]}, failOnMismatch:false, failOnDuplicate:true)
  .groupTuple()
.multiMap { pid, meta1, reads_fq,html,json, meta3, reads_fq3,html3,json3,meta2, reads_fq2,html2,json2->
   normal_fastp: [ meta1[0], reads_fq,html,json ] 
    tumor_fastp: [ meta3[0], reads_fq3,html3,json3,meta2]
    rna_fastp: [ meta2[0], reads_fq2,html2,json2]
    }
	.set { ch_merged_fastp }
    ch_merged_fastp.normal_fastp.view()

I do join but sample MM-0487-T-01 is left out.

My multiqc code would be as:

multiqc sample_name/RNA/primary/fastp/tumor/ \
 sample_name/RNA/primary/star_align/ \
  sample_name/WES/primary/fastp/tumor/ \
  sample_name/WES/primary/fastp/normal/ \
sample_name/WES/primary/applybqsr/tumor/ \
  sample_name/WES/primary/applybqsr/normal/ \
  --zip-data-dir  --config config_multiqc.yaml --force

I do not know how do write code that performs a full outer join and encompasses all samples. Or, how to write a conditional code in nextflow so it checks if RNA data are unavailable that variable is excluded.

jmichaelegana · June 2, 2024, 8:09am

Have you tried using the remainder option set to true? See Operators — Nextflow documentation

complexgenome · June 5, 2024, 7:04pm

@jmichaelegana Great, it works.
However, now I’m stuck at multiMap step.

I’m unable to assign such massive join in multiMap.

ch_fastp_normal.map{meta,reads_fq,html,json->["${meta.timepoint}", meta,reads_fq,html,json]}
  .join(ch_fastp_tumor.map{meta3,reads_fq3,html3,json3->["${meta3.timepoint}", meta3,reads_fq3,html3,json3]})
  .join(ch_rna_fastp_tumor.map{meta2,reads_fq2,html2,json2->["${meta2.timepoint}", meta2,reads_fq2,html2,json2]}, failOnMismatch:false, failOnDuplicate:true,remainder: true)
  .groupTuple()
  .view()

[MM-0486-T-01, [[batch:SEMA-MM-001, timepoint:MM-0486-T-01, tissue:normal, sequencing_type:wes]], [[/mnt/data1/users/nextflow/learn_nextflow/normal_wes_fastq1.gz, /mnt/data1/users/nextflow/learn_nextflow/normal_wes_fastq2.gz]], [/mnt/data1/users/nextflow/learn_nextflow/1_normal.html], [/mnt/data1/users/nextflow/learn_nextflow/1_normal.json], [[batch:SEMA-MM-001, timepoint:MM-0486-T-01, tissue:tumor, sequencing_type:wes]], [[/mnt/data1/users/nextflow/learn_nextflow/tumor_wes_fastq1.gz, /mnt/data1/users/nextflow/learn_nextflow/tumor_wes_fastq2.gz]], [/mnt/data1/users/nextflow/learn_nextflow/1_tumor.html], [/mnt/data1/users/nextflow/learn_nextflow/1_tumor.json], [[batch:SEMA-MM-001, timepoint:MM-0486-T-01, tissue:rna, sequencing_type:wes]], [[/mnt/data1/users/nextflow/learn_nextflow/rna_fastq1.gz, /mnt/data1/users/nextflow/learn_nextflow/rna_fastq2.gz]], [/mnt/data1/users/nextflow/learn_nextflow/1_rna.html], [/mnt/data1/users/nextflow/learn_nextflow/1_rna.json]]

[MM-0487-T-01, [[batch:SEMA-MM-001, timepoint:MM-0487-T-01, tissue:normal, sequencing_type:wes]], [[/mnt/data1/users/nextflow/learn_nextflow/normal_wes_fastq1.gz, /mnt/data1/users/nextflow/learn_nextflow/normal_wes_fastq2.gz]], [/mnt/data1/users/nextflow/learn_nextflow/1_normal.html], [/mnt/data1/users/nextflow/learn_nextflow/1_normal.json], [[batch:SEMA-MM-001, timepoint:MM-0487-T-01, tissue:tumor, sequencing_type:wes]], [[/mnt/data1/users/nextflow/learn_nextflow/tumor_wes_fastq1.gz, /mnt/data1/users/nextflow/learn_nextflow/tumor_wes_fastq2.gz]], [/mnt/data1/users/nextflow/learn_nextflow/1_tumor.html], [/mnt/data1/users/nextflow/learn_nextflow/1_tumor.json], [null]]

I cannot get this into multiMap:

ch_fastp_normal.map{meta,reads_fq,html,json->["${meta.timepoint}", meta,reads_fq,html,json]}
  .join(ch_fastp_tumor.map{meta3,reads_fq3,html3,json3->["${meta3.timepoint}", meta3,reads_fq3,html3,json3]})
  .join(ch_rna_fastp_tumor.map{meta2,reads_fq2,html2,json2->["${meta2.timepoint}", meta2,reads_fq2,html2,json2]}, failOnMismatch:false, failOnDuplicate:true,remainder: true)
  .groupTuple()
  
  .multiMap { pid, meta, reads_fq,html,json->
   normal_fastp: [ meta[0], reads_fq,html,json, meta[0], tumor_wes_fastq, tumor_html,tumor_json] 
  }.set { ch_merged_fastp }

I get error:

ERROR ~ Invalid method invocation doCall with arguments: [MM-0486-T-01, [[batch:SEMA-MM-001, timepoint:MM-0486-T-01, tissue:normal, sequencing_type:wes]], [[/mnt/data1/users/nextflow/learn_nextflow/normal_wes_fastq1.gz, /mnt/data1/users/nextflow/learn_nextflow/normal_wes_fastq2.gz]], [/mnt/data1/users/nextflow/learn_nextflow/1_normal.html], [/mnt/data1/users/nextflow/learn_nextflow/1_normal.json], [[batch:SEMA-MM-001, timepoint:MM-0486-T-01, tissue:tumor, sequencing_type:wes]], [[/mnt/data1/users/nextflow/learn_nextflow/tumor_wes_fastq1.gz, /mnt/data1/users/nextflow/learn_nextflow/tumor_wes_fastq2.gz]], [/mnt/data1/users/nextflow/learn_nextflow/1_tumor.html], [/mnt/data1/users/nextflow/learn_nextflow/1_tumor.json]] (java.util.ArrayList) on _closure7 type

Thank you again for your response.

Topic		Replies	Views
How to send combined channel to a process \|\| facing cardinality issue Ask for help	3	263	March 13, 2024
How to join branched channel with multi channel? Ask for help	10	219	March 20, 2024
How to use collect on two process and do a join? Ask for help nextflow	4	167	March 13, 2024
How to process multiqc per sample instead of generating single report all togethe for the given list of sample in a workflow? Ask for help nextflow , multiqc	1	408	October 20, 2023
Quick Start Help Ask for help multiqc	3	43	November 15, 2024

How to do full outer join? run multiqc

Related topics