Collect sorting by key but keep only files/paths

Hello,

I have a process that works by chromosomes that output a tuple in the following format: tuple val(chr_id), path(chr_vcf). In the following process i want to concatenate all my vcfs into a full genome vcf but i want to make sure my chromosomes are in the correct order.

I managed to collect and sort by the key chr_id using either collect(flat: false, sort: {it[0]}) or toSortedList{ a, b -> a[0] <=> b[0] } but then I struggle with the cardinality or to get a list with only the vcfs paths for the following process.

A more concrete example:

Channel.of(["1","test_1.vcf.gz"], ["MU123", "test_MU123.vcf.gz"], ["JA01.1", "test_JA01.1.vcf.gz"], ["2", "test_2.vcf.gz"], ["10", "test_10.vcf.gz"], ["MT", "test_MT.vcf.gz"])
       .view()
       .set{ my_ch }

// sorting works
my_ch.collect(flat: false, sort: {it[0]})
     .view()

// sorting works
my_ch.toSortedList{ a, b -> a[0] <=> b[0] }
     .view()

// sorting works but how to get list of paths? => does nothing
my_ch.collect(flat: false, sort: {it[0]})
     .map{ it[][1] }
     .view()

process CONCAT_TEST{
       input:
       // or how to manage input cardinality?
       tuple val(chr_id), path(chr_vcfs)

       script:
       """
       bcftools concat ...

       """
}

workflow{
       CONCAT_TEST(my_ch.collect(flat: false, sort: {it[0]}))
}

Output:

Launching `example_nf.nf` [distracted_legentil] DSL2 - revision: 3470727da3

[1, test_1.vcf.gz]
[MU123, test_MU123.vcf.gz]
[JA01.1, test_JA01.1.vcf.gz]
[2, test_2.vcf.gz]
[10, test_10.vcf.gz]
[MT, test_MT.vcf.gz]

// collect:
[[1, test_1.vcf.gz], [10, test_10.vcf.gz], [2, test_2.vcf.gz], [JA01.1, test_JA01.1.vcf.gz], [MT, test_MT.vcf.gz], [MU123, test_MU123.vcf.gz]]

// toSortedList:
[['1', 'test_1.vcf.gz'], ['10', 'test_10.vcf.gz'], ['2', 'test_2.vcf.gz'], ['JA01.1', 'test_JA01.1.vcf.gz'], ['MT', 'test_MT.vcf.gz'], ['MU123', 'test_MU123.vcf.gz']]

WARN: Input tuple does not match tuple declaration in process `CONCAT_TEST` -- offending value: [[1, test_1.vcf.gz], [10, test_10.vcf.gz], [2, test_2.vcf.gz], [JA01.1, test_JA01.1.vcf.gz], [MT, test_MT.vcf.gz], [MU123, test_MU123.vcf.gz]]
ERROR ~ Error executing process > 'CONCAT_TEST'

Caused by:
  Not a valid path value: '10'

You need a bit of extra operator logic to transform the list of tuples into a tuple of lists:

process CONCAT_TEST {
   input:
   tuple val(chr_ids), path(chr_vcfs)

   script:
   """
   bcftools concat ...
   """
}

workflow{
    my_ch = channel.of(
        ["1","test_1.vcf.gz"],
        ["MU123", "test_MU123.vcf.gz"],
        ["JA01.1", "test_JA01.1.vcf.gz"],
        ["2", "test_2.vcf.gz"],
        ["10", "test_10.vcf.gz"],
        ["MT", "test_MT.vcf.gz"]
    )

    samples_val = my_ch.collect(flat: false)
        .map { samples ->
            samples = samples.toSorted { id, path -> id }
            def ids = samples.collect { id, path -> id }
            def files = samples.collect { id, path -> path }
            tuple(ids, files)
        }
        .view()

    CONCAT_TEST(samples_val)
}

You can find all of the extra methods like toSorted and collect in the standard library docs: Types | Seqera Docs

I like to do this:

my_ch
    .collect(flat: false, sort: {it[0]})
    .map {
        def (ids, vcfs) = it.transpose()
        tuple(ids, vcfs)
    }
    .view()

Thanks for the answer, when I try with nextflow version 25.10.4 i get an error with the samples = samples.toSorted { id, path -> id } line:

ERROR ~ Cannot cast object ‘[MU123, test_MU123.vcf.gz]’ with class ‘java.util.ArrayList’ to class ‘int’

but Alexander’s answer below seems to work using a similar logic!

You’re right, the tuple unpacking won’t work with the toSorted closure. So I guess you would have to do samples.toSorted { it -> it[0] }

(my answer is geared more towards setting you up for static typing in the future :smiley: )