Grouping,mapping is shuffling the labels

I did some grouping and mapping:

REMOVE_UNMAPPED_READS.out.bam.mix(CREATEREADNUM.out.readnum).mix(CLIPPER.out.bed)
    .groupTuple(
        by: [0]
    )
    .map { meta,data  ->
         tuple(meta.sample, meta.replicate, meta.type, [bam:data[0],readnum:data[1],bed:\
data[2]])
    }
   .view()

and then get what I need: bams, beds and readnums grouped together by sample and replicate and type (signal and background), for example…

[test1, replicate1, signal, [bam:test1_replicate1_signal_mapped.bam, readnum:test1_replicate1_signal.readnum.txt, bed:test1_replicate1_signal.clip.peakClusters.bed]]
[test1, replicate1, background, [bam:test1_replicate1_background_mapped.bam, readnum:test1_replicate1_background.clip.peakClusters.bed, bed:test1_replicate1_background.readnum.txt]]

but then, if try to group it again - this time only by sample and replicate - for example:

REMOVE_UNMAPPED_READS.out.bam.mix(CREATEREADNUM.out.readnum).mix(CLIPPER.out.bed)
     .groupTuple(
         by: [0]
     )
     .map { meta,data  ->
          tuple(meta.sample, meta.replicate, meta.type, [bam:data[0],readnum:data[1],bed:\
 data[2]])
     }
     .groupTuple(
          by: [0, 1]
     )
    .view()

though is seems to group things as it should, it also shuffles the bams and the beds around in a strange way (now one of the readnums has the bed label and viceversa!!!):

[test1, replicate1, [background, signal], [[bam:test1_replicate1_background_mapped.bam, readnum:test1_replicate1_background.clip.peakClusters.bed, bed:test1_replicate1_background.readnum.txt], [bam:test1_replicate1_signal_mapped.bam
, readnum:test1_replicate1_signal.readnum.txt, bed:test1_replicate1_signal.clip.peakClusters.bed]]]

Why is this happening? It’s changing the order of things!

What does the initial channel look like before you do any grouping? i.e. the output of this:

REMOVE_UNMAPPED_READS.out.bam
    .mix(CREATEREADNUM.out.readnum)
    .mix(CLIPPER.out.bed)
    .view()

My best guess is after the first groupTuple the indexing isn’t what you expect so this part is picking up the incorrect files:

[ bam: data[0], readnum: data[1], bed: data[2] ]

You can use the dump operator for some additional debugging.

Thank you! To answer your question:

[[id:test1_replicate1_signal, sample:test1, replicate:replicate1, type:signal, single_end:false], test1_replicate1_signal_mapped.bam]
[[id:test1_replicate1_signal, sample:test1, replicate:replicate1, type:signal, single_end:false], test1_replicate1_signal.clip.peakClusters.bed]
[[id:test1_replicate1_signal, sample:test1, replicate:replicate1, type:signal, single_end:false], test1_replicate1_signal.readnum.txt]
[[id:test1_replicate1_background, sample:test1, replicate:replicate1, type:background, single_end:false], test1_replicate1_background_mapped.bam]
[[id:test1_replicate1_background, sample:test1, replicate:replicate1, type:background, single_end:false],test1_replicate1_background.clip.peakClusters.bed]
[[id:test1_replicate1_background, sample:test1, replicate:replicate1, type:background, single_end:false], test1_replicate1_background.readnum.txt]

But what I don’t understand in how/why before the second group_tuple the indexing seems to work fine, but when there is a second group_tuple they change?

but yes, I just used the dump operator (thank you for pointing that out) and indeed the indices are different each time:

[DUMP: test] [‘test1’, ‘replicate1’, ‘background’, [‘bam’:
9/test1_replicate1_background_mapped.bam, ‘readnum’:test1_replicate1_background.clip.peakClusters.bed, ‘bed’:test1_replicate1_background.readnum.txt]]
[DUMP: test] [‘test1’, ‘replicate1’, ‘signal’, [‘bam’:test1_replicate1_signal_mapped.bam, ‘readnum’:test1_replicate1_signal.readnum.txt, ‘bed’:test1_replicate1_signal.clip.peakClusters.bed]]

What can I to ensure about the order so that this assignment is accurate?

bam:data[0],readnum:data[1],bed:data[2]:

OK I feel very silly.

The problem is the mix operator. mix essentially appends the 2nd channel onto the first, creating a new channel in the process. Then we do the groupTuple to reduce the channel to the sample items. The problem is a channel order isn’t guaranteed, by it’s nature Nextflow is asynchronous. Sometimes, we are grouping channel [a, b, c] and sometimes we are grouping [a, c, b].

So what’s the solution? Well what we are trying to do is essentially a left join on channel items. Luckily for us, Nextflow comes with a join operator. Here’s an example of using it with your channel above:

workflow {
    REMOVE_UNMAPPED_READS_out_bam = Channel.of(
        [[id:'test1_replicate1_signal', sample:'test1', replicate:'replicate1', type:'signal', single_end:false], 'test1_replicate1_signal_mapped.bam'],
        [[id:'test1_replicate1_background', sample:'test1', replicate:'replicate1', type:'background', single_end:false], 'test1_replicate1_background_mapped.bam']
    )
    CREATEREADNUM_out_readnum = Channel.of(
        [[id:'test1_replicate1_background', sample:'test1', replicate:'replicate1', type:'background', single_end:false], 'test1_replicate1_background.clip.peakClusters.bed'],
        [[id:'test1_replicate1_signal', sample:'test1', replicate:'replicate1', type:'signal', single_end:false], 'test1_replicate1_signal.clip.peakClusters.bed']
    )
    CLIPPER_out_bed = Channel.of(
        [[id:'test1_replicate1_signal', sample:'test1', replicate:'replicate1', type:'signal', single_end:false], 'test1_replicate1_signal.readnum.txt'],
        [[id:'test1_replicate1_background', sample:'test1', replicate:'replicate1', type:'background', single_end:false], 'test1_replicate1_background.readnum.txt']
    )

    REMOVE_UNMAPPED_READS_out_bam
        .join(CREATEREADNUM_out_readnum)
        .join(CLIPPER_out_bed)
        .view()
}

The result looks like this:

[[id:test1_replicate1_signal, sample:test1, replicate:replicate1, type:signal, single_end:false], test1_replicate1_signal_mapped.bam, test1_replicate1_signal.clip.peakClusters.bed, test1_replicate1_signal.readnum.txt]
[[id:test1_replicate1_background, sample:test1, replicate:replicate1, type:background, single_end:false], test1_replicate1_background_mapped.bam, test1_replicate1_background.clip.peakClusters.bed, test1_replicate1_background.readnum.txt]

I think in the original example you wanted to group on sample and replicate. Well after we’ve performed the join, it becomes relatively simple to do a groupTuple afterwards. I’ve added an additional sample and replicate for sample test1 here to make it more realistic, sorry for the wall of code:

workflow {
    REMOVE_UNMAPPED_READS_out_bam = Channel.of(
        [[id:'test1_replicate1_signal', sample:'test1', replicate:'replicate1', type:'signal', single_end:false], 'test1_replicate1_signal_mapped.bam'],
        [[id:'test1_replicate1_background', sample:'test1', replicate:'replicate1', type:'background', single_end:false], 'test1_replicate1_background_mapped.bam'],
        [[id:'test1_replicate2_signal', sample:'test1', replicate:'replicate2', type:'signal', single_end:false], 'test1_replicate2_signal_mapped.bam'],
        [[id:'test1_replicate2_background', sample:'test1', replicate:'replicate2', type:'background', single_end:false], 'test1_replicate2_background_mapped.bam'],
        [[id:'test2_replicate1_signal', sample:'test2', replicate:'replicate1', type:'signal', single_end:false], 'test2_replicate1_signal_mapped.bam'],
        [[id:'test2_replicate1_background', sample:'test2', replicate:'replicate1', type:'background', single_end:false], 'test2_replicate1_background_mapped.bam'],
    )
    CREATEREADNUM_out_readnum = Channel.of(
        [[id:'test1_replicate1_background', sample:'test1', replicate:'replicate1', type:'background', single_end:false], 'test1_replicate1_background.clip.peakClusters.bed'],
        [[id:'test1_replicate1_signal', sample:'test1', replicate:'replicate1', type:'signal', single_end:false], 'test1_replicate1_signal.clip.peakClusters.bed'],
        [[id:'test1_replicate2_background', sample:'test1', replicate:'replicate2', type:'background', single_end:false], 'test1_replicate2_background.clip.peakClusters.bed'],
        [[id:'test1_replicate2_signal', sample:'test1', replicate:'replicate2', type:'signal', single_end:false], 'test1_replicate2_signal.clip.peakClusters.bed'],
        [[id:'test2_replicate1_background', sample:'test2', replicate:'replicate1', type:'background', single_end:false], 'test2_replicate1_background.clip.peakClusters.bed'],
        [[id:'test2_replicate1_signal', sample:'test2', replicate:'replicate1', type:'signal', single_end:false], 'test2_replicate1_signal.clip.peakClusters.bed'],
    )
    CLIPPER_out_bed = Channel.of(
        [[id:'test1_replicate1_signal', sample:'test1', replicate:'replicate1', type:'signal', single_end:false], 'test1_replicate1_signal.readnum.txt'],
        [[id:'test1_replicate1_background', sample:'test1', replicate:'replicate1', type:'background', single_end:false], 'test1_replicate1_background.readnum.txt'],
        [[id:'test1_replicate2_signal', sample:'test1', replicate:'replicate2', type:'signal', single_end:false], 'test1_replicate2_signal.readnum.txt'],
        [[id:'test1_replicate2_background', sample:'test1', replicate:'replicate2', type:'background', single_end:false], 'test1_replicate2_background.readnum.txt'],
        [[id:'test2_replicate1_signal', sample:'test2', replicate:'replicate1', type:'signal', single_end:false], 'test2_replicate1_signal.readnum.txt'],
        [[id:'test2_replicate1_background', sample:'test2', replicate:'replicate1', type:'background', single_end:false], 'test2_replicate1_background.readnum.txt'],
    )

    joined_ch = REMOVE_UNMAPPED_READS_out_bam
        .join(CREATEREADNUM_out_readnum)
        .join(CLIPPER_out_bed)

    joined_ch
        .map { meta, bam, bed, readnum ->
            tuple(
                meta.subMap('sample', 'replicate'),
                meta,
                bam,
                bed,
                readnum
            )
        }
        .groupTuple(remainder: true)
        .map { sample_map, meta, bam, bed, readnum ->
            tuple(
                meta,
                bam,
                bed,
                readnum
            )
        }
        .view()
        
}

The output should be:

[ maps, bams, beds, reanums ]
[[[id:test1_replicate1_background, sample:test1, replicate:replicate1, type:background, single_end:false], [id:test1_replicate1_signal, sample:test1, replicate:replicate1, type:signal, single_end:false]], [test1_replicate1_background_mapped.bam, test1_replicate1_signal_mapped.bam], [test1_replicate1_background.clip.peakClusters.bed, test1_replicate1_signal.clip.peakClusters.bed], [test1_replicate1_background.readnum.txt, test1_replicate1_signal.readnum.txt]]
[[[id:test1_replicate2_background, sample:test1, replicate:replicate2, type:background, single_end:false], [id:test1_replicate2_signal, sample:test1, replicate:replicate2, type:signal, single_end:false]], [test1_replicate2_background_mapped.bam, test1_replicate2_signal_mapped.bam], [test1_replicate2_background.clip.peakClusters.bed, test1_replicate2_signal.clip.peakClusters.bed], [test1_replicate2_background.readnum.txt, test1_replicate2_signal.readnum.txt]]
[[[id:test2_replicate1_background, sample:test2, replicate:replicate1, type:background, single_end:false], [id:test2_replicate1_signal, sample:test2, replicate:replicate1, type:signal, single_end:false]], [test2_replicate1_background_mapped.bam, test2_replicate1_signal_mapped.bam], [test2_replicate1_background.clip.peakClusters.bed, test2_replicate1_signal.clip.peakClusters.bed], [test2_replicate1_background.readnum.txt, test2_replicate1_signal.readnum.txt]]
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.