Batching samples into groups of 5 within patients

alextidd · October 31, 2024, 4:07pm

Hi,
I have a channel like this:

[[sample_id:patient1_1, donor_id:patient1], patient1_1.vcf],
...
[[sample_id:patient1_11, donor_id:patient1], patient1_11.vcf],
[[sample_id:patient2_1, donor_id:patient2], patient2_1.vcf],
...
[[sample_id: patient2_11, donor_id: patient2], patient2_11.vcf]

I would like to be able to batch samples within donors into groups of 5, with remainders, so would hope to get an output something like this:

[[donor_id:patient1], [patient1_1.vcf,patient1_2.vcf,patient1_3.vcf,patient1_4.vcf,patient1_5.vcf]],
[[donor_id:patient1], [patient1_6.vcf,patient1_7.vcf,patient1_8.vcf,patient1_9.vcf,patient1_10.vcf]],
[[donor_id:patient1], [patient1_11.vcf]],
[[donor_id:patient2], [patient2_1.vcf,patient2_2.vcf,patient2_3.vcf,patient2_4.vcf,patient2_5.vcf]],
[[donor_id:patient2], [patient2_6.vcf,patient2_7.vcf,patient2_8.vcf,patient2_9.vcf,patient2_10.vcf]],
[[donor_id:patient2], [patient2_11.vcf]]

Any ideas? Thanks!

mribeirodantas · November 1, 2024, 11:50am

Hi @alextidd,

Welcome to the community forum!

Whenever you want to manipulate a channel, channel operators should be the first thing to come to mind . In your example, you want to group the samples based on a specific key, so a quick solution is to isolate this key and then group based on it. See the snippet below for a solution to your problem.

workflow {
  // Generating fake data
  Channel
    .of([[sample_id:'patient1_1', donor_id:'patient1'], file('patient1_1.vcf')],
        [[sample_id:'patient1_11', donor_id:'patient1'], file('patient1_11.vcf')],
        [[sample_id:'patient1_111', donor_id:'patient1'], file('patient1_111.vcf')],
        [[sample_id:'patient2_1', donor_id:'patient2'], file('patient2_1.vcf')],
        [[sample_id:'patient2_11', donor_id:'patient2'], file('patient2_11.vcf')],
        [[sample_id:'patient2_111', donor_id: 'patient2'], file('patient2_111.vcf')],
        [[sample_id:'patient2_1111', donor_id:'patient2'], file('patient2_1111.vcf')],
        [[sample_id:'patient2_11111', donor_id:'patient2'], file('patient2_11111.vcf')]
    )
    // Isolating the key of interest - donor_id
    .map { keys, files -> tuple(donor_id:keys.donor_id, files ) }
    // Group based on the key donor_id
    .groupTuple(size: 2, remainder: true)
    .view()

For simplicity, I set the size to 2, but it’s just a matter of setting it to 5 to fit precisely what you asked for. Check the print screen below for the output of the snippet above.

system · November 8, 2024, 11:50am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Channel Structure for Process with Grouped Files and Nested Lists Ask for help nextflow	1	14	June 30, 2025
How to group appropriately Ask for help	1	65	June 5, 2024
How to work with multiple time point/samples for a patient? read CSV Ask for help	0	168	January 11, 2024
Handle NA and use branch operator - then send to process Ask for help	9	323	February 19, 2024
Create channel after collect from channel that has multiple files? Ask for help	8	115	April 24, 2024

Batching samples into groups of 5 within patients

Related topics