Group tuples based on two meta map keys, and sort based on another one

mribeirodantas · October 2, 2023, 2:56pm

Let’s say you have a channel where every element is a tuple that consists of a
meta map and a FASTQ file. This meta map, which is the first item of your tuple,
contains multiple key-value pairs, and you want to group the elements in this
channel based on two of these keys and then sort the grouped tuples based on
another key of this meta map. Check the channel created below for a better
understanding of the problem.

Channel
  .of(
    [[id:'testA', end:2, chunk:0, total_chunks:2], file('/home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testA_2.0.fastq')],
    [[id:'testA', end:2, chunk:1, total_chunks:2], file('/home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testA_2.1.fastq')],
    [[id:'testA', end:1, chunk:0, total_chunks:2], file('/home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testA_1.0.fastq')],
    [[id:'testA', end:1, chunk:1, total_chunks:2], file('/home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testA_1.1.fastq')],
    [[id:'testB', end:1, chunk:0, total_chunks:2], file('/home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testB_1.0.fastq')],
    [[id:'testB', end:1, chunk:1, total_chunks:2], file('/home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testB_1.1.fastq')],
    [[id:'testB', end:2, chunk:0, total_chunks:2], file('/home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testB_2.0.fastq')],
    [[id:'testB', end:2, chunk:1, total_chunks:2], file('/home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testB_2.1.fastq')]
  )
  .view()
  .set { my_ch }

If you run the snippet above, you’ll see something like the output below:

N E X T F L O W  ~  version 23.09.2-edge
Launching `x.nf` [backstabbing_mahavira] DSL2 - revision: cb93185077
[[id:testA, end:2, chunk:0, total_chunks:2], /home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testA_2.0.fastq]
[[id:testA, end:2, chunk:1, total_chunks:2], /home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testA_2.1.fastq]
[[id:testA, end:1, chunk:0, total_chunks:2], /home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testA_1.0.fastq]
[[id:testA, end:1, chunk:1, total_chunks:2], /home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testA_1.1.fastq]
[[id:testB, end:1, chunk:0, total_chunks:2], /home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testB_1.0.fastq]
[[id:testB, end:1, chunk:1, total_chunks:2], /home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testB_1.1.fastq]
[[id:testB, end:2, chunk:0, total_chunks:2], /home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testB_2.0.fastq]
[[id:testB, end:2, chunk:1, total_chunks:2], /home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testB_2.1.fastq]

Based on the example above, you want to group these elements by the id and
chunk, and then sort by end. You’re probably aware of the groupTuple
channel operator that… groups tuples . It also has a sort option (Read
more about the operator here).
However, the by option of groupTuple can’t access values inside a map. The
solution is to extract the information outside the map and then pass its
position to groupTuple, like below:

my_ch
  .map { metadata, fastq ->
    tuple(metadata.id, metadata.chunk, [metadata, fastq])
  }
  .groupTuple(
    by: [0, 1],
  )
  .view()

Output:

N E X T F L O W  ~  version 23.09.2-edge
Launching `x.nf` [insane_mendel] DSL2 - revision: 07c0453c47
[testA, 0, [[[id:testA, end:2, chunk:0, total_chunks:2], /home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testA_2.0.fastq], [[id:testA, end:1, chunk:0, total_chunks:2], /home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testA_1.0.fastq]]]
[testA, 1, [[[id:testA, end:2, chunk:1, total_chunks:2], /home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testA_2.1.fastq], [[id:testA, end:1, chunk:1, total_chunks:2], /home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testA_1.1.fastq]]]
[testB, 0, [[[id:testB, end:1, chunk:0, total_chunks:2], /home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testB_1.0.fastq], [[id:testB, end:2, chunk:0, total_chunks:2], /home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testB_2.0.fastq]]]
[testB, 1, [[[id:testB, end:1, chunk:1, total_chunks:2], /home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testB_1.1.fastq], [[id:testB, end:2, chunk:1, total_chunks:2], /home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testB_2.1.fastq]]]

It’s grouped as we want, but it’s not sorted yet, and there’s too much redundant
information. The next step is to sort, which we will do through a closure that
accesses the value of the end key.

my_ch
  .map { metadata, fastq ->
    tuple(metadata.id, metadata.chunk, [metadata, fastq])
  }
  .groupTuple(
    by: [0, 1],
    sort: { e1, e2 -> e1[0].end <=> e2[0].end }
  )

And then, a map to apply some cleaning to every element of the channel.

my_ch
  .map { metadata, fastq ->
    tuple(metadata.id, metadata.chunk, [metadata, fastq])
  }
  .groupTuple(
    by: [0, 1],
    sort: { e1, e2 -> e1[0].end <=> e2[0].end }
  )
  .map {
    tuple(it[0], [it[2][0][1], it[2][1][1]])
  }
  .view()

You can find the complete snippet below, followed by the output:

Channel
  .of(
    [[id:'testA', end:2, chunk:0, total_chunks:2], file('/home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testA_2.0.fastq')],
    [[id:'testA', end:2, chunk:1, total_chunks:2], file('/home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testA_2.1.fastq')],
    [[id:'testA', end:1, chunk:0, total_chunks:2], file('/home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testA_1.0.fastq')],
    [[id:'testA', end:1, chunk:1, total_chunks:2], file('/home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testA_1.1.fastq')],
    [[id:'testB', end:1, chunk:0, total_chunks:2], file('/home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testB_1.0.fastq')],
    [[id:'testB', end:1, chunk:1, total_chunks:2], file('/home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testB_1.1.fastq')],
    [[id:'testB', end:2, chunk:0, total_chunks:2], file('/home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testB_2.0.fastq')],
    [[id:'testB', end:2, chunk:1, total_chunks:2], file('/home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testB_2.1.fastq')]
  )
  .set { my_ch }

my_ch
  .map { metadata, fastq ->
    tuple(metadata.id, metadata.chunk, [metadata, fastq])
  }
  .groupTuple(
    by: [0, 1],
    sort: { e1, e2 -> e1[0].end <=> e2[0].end }
  )
  .map {
    tuple(it[0], [it[2][0][1], it[2][1][1]])
  }
  .view()

Output:

N E X T F L O W  ~  version 23.09.2-edge
Launching `x.nf` [stupefied_church] DSL2 - revision: dad4f10db1
[testA, [/home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testA_1.0.fastq, /home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testA_2.0.fastq]]
[testA, [/home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testA_1.1.fastq, /home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testA_2.1.fastq]]
[testB, [/home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testB_1.0.fastq, /home/user/nf/work/04/9488ff844ef80e070e8b538ae553c5/testB_2.0.fastq]]
[testB, [/home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testB_1.1.fastq, /home/user/nf/work/ff/4b8ebb5a54d0532ef10a61d1642196/testB_2.1.fastq]]

Special thanks to Jordi Camps for the final version of this snippet

Topic		Replies	Views
Grouping,mapping is shuffling the labels Ask for help nextflow	4	84	June 21, 2024
How to view all multi map data at once? Ask for help	7	269	March 26, 2024
I don't know how to sort Ask for help	3	76	June 28, 2024
How to sort tuples Ask for help	4	182	June 13, 2024
Join channels based on meta field Ask for help	3	333	May 8, 2024

Group tuples based on two meta map keys, and sort based on another one

Related topics