How to do full outer join? run multiqc

Dear developers,

I’ve two data types: WES and RNA; per-sample DNA have tumor and normal, while RNA has tumor only.
I run the WES pipeline hiding tumor normal as long as possible, at later stages I create branch and do a join as needed.

In order to run multiqc per-sample there are wes tumor and wes normal, that will be joined and similarly applybqsr files.
There are RNA data as well, however, not all WES samples have RNA.

I’ve multiple channels: fastp_wes_tumor, fastp_wes_normal, fastp_tumor_rna and wes_applybqsr_normal, wes_applybqsr_tumor and such.

I’d like to run multiqc on all samples whether there’s RNA data available or otherwise.

I’ve following code:

  ch_fastp_normal = Channel.of(
    [ [batch:'SEMA-MM-001', timepoint:'MM-0486-T-01', tissue:'normal', sequencing_type:'wes'], 
    tuple( file('normal_wes_fastq1.gz'),file('normal_wes_fastq2.gz')),file('1_normal.html'),file('1_normal.json') ],
    [ [batch:'SEMA-MM-001', timepoint:'MM-0487-T-01', tissue:'normal', sequencing_type:'wes'], 
    tuple( file('normal_wes_fastq1.gz'),file('normal_wes_fastq2.gz')),file('1_normal.html'),file('1_normal.json') ]
)

ch_fastp_tumor = Channel.of(
    [ [batch:'SEMA-MM-001', timepoint:'MM-0486-T-01', tissue:'tumor', sequencing_type:'wes'], 
    tuple( file('tumor_wes_fastq1.gz'),file('tumor_wes_fastq2.gz')),file('1_tumor.html'),file('1_tumor.json') ],
    [ [batch:'SEMA-MM-001', timepoint:'MM-0487-T-01', tissue:'tumor', sequencing_type:'wes'], 
    tuple( file('tumor_wes_fastq1.gz'),file('tumor_wes_fastq2.gz')),file('1_tumor.html'),file('1_tumor.json') ]
)

ch_rna_fastp_tumor = Channel.of(
    [ [batch:'SEMA-MM-001', timepoint:'MM-0486-T-01', tissue:'rna', sequencing_type:'wes'], 
    tuple( file('rna_fastq1.gz'),file('rna_fastq2.gz')),file('1_rna.html'),file('1_rna.json') ]
)

  ch_fastp_normal.map{meta,reads_fq,html,json->["${meta.timepoint}", meta,reads_fq,html,json]}
  .join(ch_fastp_tumor.map{meta3,reads_fq3,html3,json3->["${meta3.timepoint}", meta3,reads_fq3,html3,json3]})
  .join(ch_rna_fastp_tumor.map{meta2,reads_fq2,html2,json2->["${meta2.timepoint}", meta2,reads_fq2,html2,json2]}, failOnMismatch:false, failOnDuplicate:true)
  .view()

  ch_fastp_normal.map{meta,reads_fq,html,json->["${meta.timepoint}", meta,reads_fq,html,json]}
  .join(ch_fastp_tumor.map{meta3,reads_fq3,html3,json3->["${meta3.timepoint}", meta3,reads_fq3,html3,json3]})
  .join(ch_rna_fastp_tumor.map{meta2,reads_fq2,html2,json2->["${meta2.timepoint}", meta2,reads_fq2,html2,json2]}, failOnMismatch:false, failOnDuplicate:true)
  .groupTuple()
.multiMap { pid, meta1, reads_fq,html,json, meta3, reads_fq3,html3,json3,meta2, reads_fq2,html2,json2->
   normal_fastp: [ meta1[0], reads_fq,html,json ] 
    tumor_fastp: [ meta3[0], reads_fq3,html3,json3,meta2]
    rna_fastp: [ meta2[0], reads_fq2,html2,json2]
    }
	.set { ch_merged_fastp }
    ch_merged_fastp.normal_fastp.view()

I do join but sample MM-0487-T-01 is left out.

My multiqc code would be as:

multiqc sample_name/RNA/primary/fastp/tumor/ \
 sample_name/RNA/primary/star_align/ \
  sample_name/WES/primary/fastp/tumor/ \
  sample_name/WES/primary/fastp/normal/ \
sample_name/WES/primary/applybqsr/tumor/ \
  sample_name/WES/primary/applybqsr/normal/ \
  --zip-data-dir  --config config_multiqc.yaml --force

I do not know how do write code that performs a full outer join and encompasses all samples. Or, how to write a conditional code in nextflow so it checks if RNA data are unavailable that variable is excluded.

Have you tried using the remainder option set to true? See Operators — Nextflow documentation

@jmichaelegana Great, it works.
However, now I’m stuck at multiMap step.

I’m unable to assign such massive join in multiMap.

ch_fastp_normal.map{meta,reads_fq,html,json->["${meta.timepoint}", meta,reads_fq,html,json]}
  .join(ch_fastp_tumor.map{meta3,reads_fq3,html3,json3->["${meta3.timepoint}", meta3,reads_fq3,html3,json3]})
  .join(ch_rna_fastp_tumor.map{meta2,reads_fq2,html2,json2->["${meta2.timepoint}", meta2,reads_fq2,html2,json2]}, failOnMismatch:false, failOnDuplicate:true,remainder: true)
  .groupTuple()
  .view()

[MM-0486-T-01, [[batch:SEMA-MM-001, timepoint:MM-0486-T-01, tissue:normal, sequencing_type:wes]], [[/mnt/data1/users/nextflow/learn_nextflow/normal_wes_fastq1.gz, /mnt/data1/users/nextflow/learn_nextflow/normal_wes_fastq2.gz]], [/mnt/data1/users/nextflow/learn_nextflow/1_normal.html], [/mnt/data1/users/nextflow/learn_nextflow/1_normal.json], [[batch:SEMA-MM-001, timepoint:MM-0486-T-01, tissue:tumor, sequencing_type:wes]], [[/mnt/data1/users/nextflow/learn_nextflow/tumor_wes_fastq1.gz, /mnt/data1/users/nextflow/learn_nextflow/tumor_wes_fastq2.gz]], [/mnt/data1/users/nextflow/learn_nextflow/1_tumor.html], [/mnt/data1/users/nextflow/learn_nextflow/1_tumor.json], [[batch:SEMA-MM-001, timepoint:MM-0486-T-01, tissue:rna, sequencing_type:wes]], [[/mnt/data1/users/nextflow/learn_nextflow/rna_fastq1.gz, /mnt/data1/users/nextflow/learn_nextflow/rna_fastq2.gz]], [/mnt/data1/users/nextflow/learn_nextflow/1_rna.html], [/mnt/data1/users/nextflow/learn_nextflow/1_rna.json]]

[MM-0487-T-01, [[batch:SEMA-MM-001, timepoint:MM-0487-T-01, tissue:normal, sequencing_type:wes]], [[/mnt/data1/users/nextflow/learn_nextflow/normal_wes_fastq1.gz, /mnt/data1/users/nextflow/learn_nextflow/normal_wes_fastq2.gz]], [/mnt/data1/users/nextflow/learn_nextflow/1_normal.html], [/mnt/data1/users/nextflow/learn_nextflow/1_normal.json], [[batch:SEMA-MM-001, timepoint:MM-0487-T-01, tissue:tumor, sequencing_type:wes]], [[/mnt/data1/users/nextflow/learn_nextflow/tumor_wes_fastq1.gz, /mnt/data1/users/nextflow/learn_nextflow/tumor_wes_fastq2.gz]], [/mnt/data1/users/nextflow/learn_nextflow/1_tumor.html], [/mnt/data1/users/nextflow/learn_nextflow/1_tumor.json], [null]]

I cannot get this into multiMap:

ch_fastp_normal.map{meta,reads_fq,html,json->["${meta.timepoint}", meta,reads_fq,html,json]}
  .join(ch_fastp_tumor.map{meta3,reads_fq3,html3,json3->["${meta3.timepoint}", meta3,reads_fq3,html3,json3]})
  .join(ch_rna_fastp_tumor.map{meta2,reads_fq2,html2,json2->["${meta2.timepoint}", meta2,reads_fq2,html2,json2]}, failOnMismatch:false, failOnDuplicate:true,remainder: true)
  .groupTuple()
  
  .multiMap { pid, meta, reads_fq,html,json->
   normal_fastp: [ meta[0], reads_fq,html,json, meta[0], tumor_wes_fastq, tumor_html,tumor_json] 
  }.set { ch_merged_fastp }
  

I get error:

ERROR ~ Invalid method invocation doCall with arguments: [MM-0486-T-01, [[batch:SEMA-MM-001, timepoint:MM-0486-T-01, tissue:normal, sequencing_type:wes]], [[/mnt/data1/users/nextflow/learn_nextflow/normal_wes_fastq1.gz, /mnt/data1/users/nextflow/learn_nextflow/normal_wes_fastq2.gz]], [/mnt/data1/users/nextflow/learn_nextflow/1_normal.html], [/mnt/data1/users/nextflow/learn_nextflow/1_normal.json], [[batch:SEMA-MM-001, timepoint:MM-0486-T-01, tissue:tumor, sequencing_type:wes]], [[/mnt/data1/users/nextflow/learn_nextflow/tumor_wes_fastq1.gz, /mnt/data1/users/nextflow/learn_nextflow/tumor_wes_fastq2.gz]], [/mnt/data1/users/nextflow/learn_nextflow/1_tumor.html], [/mnt/data1/users/nextflow/learn_nextflow/1_tumor.json]] (java.util.ArrayList) on _closure7 type

Thank you again for your response.