Resolving variable number of s3 paths in a list of lists fed to a process

kedhammar · November 22, 2024, 11:33am

We are trying to set up a process that used FastQ Screen to map a sample to a number of references for contamination screening. We want to be able to provide a bunch of different references (their respective name, path and preferred aligner), have their paths resolved and symlinked in the work dir of the process, and written to a config on the form

DATABASE ref1 workdir/symlink/to/ref1 bowtie2
DATABASE ref2 workdir/symlink/to/ref2 bowtie2
etc

The following simple example runs, and illustrates what we want to do:

process TEST {
    input:
    tuple val(db_name), path(db_path, name: "db_path*"), val(aligner)

    script:
    """
    echo "DATABASE ${db_name} ./${db_path}/genome ${aligner}" >> fastq_screen.conf
    """
}

workflow {
    ch_db = Channel
        .fromList([
            ["Ecoli", "s3://ngi-igenomes/igenomes/Escherichia_coli_K_12_MG1655/NCBI/2001-10-15/Sequence/Bowtie2Index/", "bowtie2"],
        ])
        .collect()
        .view()

    TEST(ch_db)
}

but we can’t expand it to handle more than one entry, e.g. adding ["Scerevisiae","s3://ngi-igenomes/igenomes/Saccharomyces_cerevisiae/NCBI/build3.1/Sequence/Bowtie2Index/","bowtie2"] to the list of lists.

PR and discussion here

github.com/nf-core/seqinspector

Add FastQ-Screen database multiplexing

nf-core:dev ← nf-core:fastqscreen

opened 01:42PM - 29 Oct 24 UTC

edmundmiller

+652 -37

## PR checklist - [x] This comment contains a description of changes (with reason). - Add fastqscreen module - Limit scope of nf-test CI - [x] If you've fixed a bug or added code that should be tested, add tests! - [x] If you've added a new tool - have you followed the pipeline conventions in the [contribution docs](https://github.com/nf-core/seqinspector/tree/master/.github/CONTRIBUTING.md) - ~[ ] If necessary, also make a PR on the nf-core/seqinspector _branch_ on the [nf-core/test-datasets](https://github.com/nf-core/test-datasets) repository.~ - [x] Make sure your code lints (`nf-core lint`). - [x] Ensure the test suite passes (`nf-test test main.nf.test -profile test,docker`). - [x] Check for unexpected warnings in debug mode (`nextflow run . -profile debug,test,docker --outdir <OUTDIR>`). - [x] Usage Documentation in `docs/usage.md` is updated. - [x] Output Documentation in `docs/output.md` is updated. - [x] `CHANGELOG.md` is updated. - [x] `README.md` is updated (including new tool citations and authors/contributors).

bentsherman · November 25, 2024, 10:07pm

This is a limitation of the process input/output syntax. You can have a path, a path list, a tuple with path elements, and even a tuple with path list elements, but you can’t have a list of tuples containing paths, which seems to be what you want.

To make this work, you’ll need to transpose the way you provide inputs to the process:

process TEST {
    input:
    val(db_names)
    path(db_paths, name: "db_path*")
    val(aligners)

    // ...
}

You can then use multiMap (or just map) to split your ch_db into the three process inputs:

ch_db_multi = ch_db.multiMap { dbs ->
  db_name: dbs.collect { db -> db[0] }
  db_path: dbs.collect { db -> db[1] }
  aligner: dbs.collect { db -> db[2] }
}

TEST (
  ch_db_multi.db_name,
  ch_db_multi.db_path,
  ch_db_multi.aligner,
)

EDIT: corrected workflow logic

bentsherman · November 25, 2024, 10:54pm

One more thing – for my example to work you need to use toList instead of collect to create ch_db:

    ch_db = Channel
        .fromList([
            ["Ecoli", "s3://ngi-igenomes/igenomes/Escherichia_coli_K_12_MG1655/NCBI/2001-10-15/Sequence/Bowtie2Index/", "bowtie2"],
        ])
        .toList()
        .view()

This is because collect does this weird thing where it flattens the collected list by one level. I don’t know why

robsyme · November 25, 2024, 11:10pm

Another option:

workflow {
    Channel.of(
        ["Ecoli", "Escherichia_coli_K_12_MG1655//Bowtie2Index", "bowtie2"],
        ["Scerevisiae","Saccharomyces_cerevisiae/Bowtie2Index","bowtie2"]
    )
    | toList
    | transpose
    | toList
    | Test
}

Or using the more conventional dot notation:

workflow {
    dbs = Channel.of(
        ["Ecoli", "Escherichia_coli_K_12_MG1655/Bowtie2Index", "bowtie2"],
        ["Scerevisiae","Saccharomyces_cerevisiae/Bowtie2Index","bowtie2"]
    )
        .toList()
        .transpose()
        .toList()

    Test(dbs)
}

Where the Test process might look like:

process OptionC {

    input:
    tuple val(db_names), path(db_paths, name: "db_path*"), val(aligners)

    script:
    """
    read -a species_array <<< '${db_names.join(' ')}'
    read -a db_paths_array <<< '${db_paths.join(' ')}'
    read -a tools_array <<< '${aligners.join(' ')}'
    for i in "\${!species_array[@]}"; do
        echo -e "DATABASE\t\${species_array[i]}\t\${db_paths_array[i]}\t\${tools_array[i]}" >> "fastq_screen.conf"
    done
    ls -lh
    """
}

system · December 2, 2024, 11:11pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How do i accept multiple file paths/globs in my channel? Ask for help	2	94	April 23, 2024
Fastq columns in nf-core pipeline even though they are not part of the samplesheet Ask for help	2	29	February 6, 2025
Stream reference directories from S3 Ask for help	1	21	October 24, 2024
Recommended way of passing a file that lists filenames as input to a process Ask for help	1	42	September 11, 2024
Best way to set an identifier for identical tasks run in parallel Ask for help	0	22	October 8, 2024

Resolving variable number of s3 paths in a list of lists fed to a process

Related topics