Batching samples into groups of 5 within patients

Hi,
I have a channel like this:

[[sample_id:patient1_1, donor_id:patient1], patient1_1.vcf],
...
[[sample_id:patient1_11, donor_id:patient1], patient1_11.vcf],
[[sample_id:patient2_1, donor_id:patient2], patient2_1.vcf],
...
[[sample_id: patient2_11, donor_id: patient2], patient2_11.vcf]

I would like to be able to batch samples within donors into groups of 5, with remainders, so would hope to get an output something like this:

[[donor_id:patient1], [patient1_1.vcf,patient1_2.vcf,patient1_3.vcf,patient1_4.vcf,patient1_5.vcf]],
[[donor_id:patient1], [patient1_6.vcf,patient1_7.vcf,patient1_8.vcf,patient1_9.vcf,patient1_10.vcf]],
[[donor_id:patient1], [patient1_11.vcf]],
[[donor_id:patient2], [patient2_1.vcf,patient2_2.vcf,patient2_3.vcf,patient2_4.vcf,patient2_5.vcf]],
[[donor_id:patient2], [patient2_6.vcf,patient2_7.vcf,patient2_8.vcf,patient2_9.vcf,patient2_10.vcf]],
[[donor_id:patient2], [patient2_11.vcf]]

Any ideas? Thanks!

Hi @alextidd,

Welcome to the community forum! :slight_smile:

Whenever you want to manipulate a channel, channel operators should be the first thing to come to mind :wink:. In your example, you want to group the samples based on a specific key, so a quick solution is to isolate this key and then group based on it. See the snippet below for a solution to your problem.

workflow {
  // Generating fake data
  Channel
    .of([[sample_id:'patient1_1', donor_id:'patient1'], file('patient1_1.vcf')],
        [[sample_id:'patient1_11', donor_id:'patient1'], file('patient1_11.vcf')],
        [[sample_id:'patient1_111', donor_id:'patient1'], file('patient1_111.vcf')],
        [[sample_id:'patient2_1', donor_id:'patient2'], file('patient2_1.vcf')],
        [[sample_id:'patient2_11', donor_id:'patient2'], file('patient2_11.vcf')],
        [[sample_id:'patient2_111', donor_id: 'patient2'], file('patient2_111.vcf')],
        [[sample_id:'patient2_1111', donor_id:'patient2'], file('patient2_1111.vcf')],
        [[sample_id:'patient2_11111', donor_id:'patient2'], file('patient2_11111.vcf')]
    )
    // Isolating the key of interest - donor_id
    .map { keys, files -> tuple(donor_id:keys.donor_id, files ) }
    // Group based on the key donor_id
    .groupTuple(size: 2, remainder: true)
    .view()

For simplicity, I set the size to 2, but it’s just a matter of setting it to 5 to fit precisely what you asked for. Check the print screen below for the output of the snippet above.