Resolving variable number of s3 paths in a list of lists fed to a process

We are trying to set up a process that used FastQ Screen to map a sample to a number of references for contamination screening. We want to be able to provide a bunch of different references (their respective name, path and preferred aligner), have their paths resolved and symlinked in the work dir of the process, and written to a config on the form

DATABASE ref1 workdir/symlink/to/ref1 bowtie2
DATABASE ref2 workdir/symlink/to/ref2 bowtie2
etc

The following simple example runs, and illustrates what we want to do:

process TEST {
    input:
    tuple val(db_name), path(db_path, name: "db_path*"), val(aligner)

    script:
    """
    echo "DATABASE ${db_name} ./${db_path}/genome ${aligner}" >> fastq_screen.conf
    """
}

workflow {
    ch_db = Channel
        .fromList([
            ["Ecoli", "s3://ngi-igenomes/igenomes/Escherichia_coli_K_12_MG1655/NCBI/2001-10-15/Sequence/Bowtie2Index/", "bowtie2"],
        ])
        .collect()
        .view()

    TEST(ch_db)
}

but we can’t expand it to handle more than one entry, e.g. adding ["Scerevisiae","s3://ngi-igenomes/igenomes/Saccharomyces_cerevisiae/NCBI/build3.1/Sequence/Bowtie2Index/","bowtie2"] to the list of lists.

PR and discussion here

This is a limitation of the process input/output syntax. You can have a path, a path list, a tuple with path elements, and even a tuple with path list elements, but you can’t have a list of tuples containing paths, which seems to be what you want.

To make this work, you’ll need to transpose the way you provide inputs to the process:

process TEST {
    input:
    val(db_names)
    path(db_paths, name: "db_path*")
    val(aligners)

    // ...
}

You can then use multiMap (or just map) to split your ch_db into the three process inputs:

ch_db_multi = ch_db.multiMap { dbs ->
  db_name: dbs.collect { db -> db[0] }
  db_path: dbs.collect { db -> db[1] }
  aligner: dbs.collect { db -> db[2] }
}

TEST (
  ch_db_multi.db_name,
  ch_db_multi.db_path,
  ch_db_multi.aligner,
)

EDIT: corrected workflow logic

2 Likes

One more thing – for my example to work you need to use toList instead of collect to create ch_db:

    ch_db = Channel
        .fromList([
            ["Ecoli", "s3://ngi-igenomes/igenomes/Escherichia_coli_K_12_MG1655/NCBI/2001-10-15/Sequence/Bowtie2Index/", "bowtie2"],
        ])
        .toList()
        .view()

This is because collect does this weird thing where it flattens the collected list by one level. I don’t know why :man_shrugging:

1 Like

Another option:

workflow {
    Channel.of(
        ["Ecoli", "Escherichia_coli_K_12_MG1655//Bowtie2Index", "bowtie2"],
        ["Scerevisiae","Saccharomyces_cerevisiae/Bowtie2Index","bowtie2"]
    )
    | toList
    | transpose
    | toList
    | Test
}

Or using the more conventional dot notation:

workflow {
    dbs = Channel.of(
        ["Ecoli", "Escherichia_coli_K_12_MG1655/Bowtie2Index", "bowtie2"],
        ["Scerevisiae","Saccharomyces_cerevisiae/Bowtie2Index","bowtie2"]
    )
        .toList()
        .transpose()
        .toList()

    Test(dbs)
}

Where the Test process might look like:

process OptionC {

    input:
    tuple val(db_names), path(db_paths, name: "db_path*"), val(aligners)

    script:
    """
    read -a species_array <<< '${db_names.join(' ')}'
    read -a db_paths_array <<< '${db_paths.join(' ')}'
    read -a tools_array <<< '${aligners.join(' ')}'
    for i in "\${!species_array[@]}"; do
        echo -e "DATABASE\t\${species_array[i]}\t\${db_paths_array[i]}\t\${tools_array[i]}" >> "fastq_screen.conf"
    done
    ls -lh
    """
}
2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.