Recommended way of passing a file that lists filenames as input to a process

Gullumluvl · September 11, 2024, 10:09am

Hi!

I think it is a very common use case with bioinformatics software, where they often require that the list of input files is read from a file.

In the example I am concerned with, it’s the only way of providing multiple bam files to the graphtyper genotype command (with --sams).

The individual bam files must typically be delivered by a channel, but so do the list file, ideally. So what is the approach to do this with Nextflow? I see several ways, but it seems that no solution is best:

Create the list file on the fly inside the same script as the command of interest:
```
process genotype {
    input:
    path '*'

    """
    ls *.bam > bamlist.txt
    graphtyper genotype ... --sams=bamlist.txt
    """
}
```
- pros:
  - simple
  - the list file always contains exactly the given inputs
- cons
  - inconvenient for reusing the list file in other processes

Create the list file in a separate process

process listfiles {
    input:
    path '*'

    output:
    path 'bamlist.txt'

    "ls *.bam > bamlist.txt"
}

process genotype {
    input:
    path bamlistfile
    path allbams  // they need to be a dependency as well, to be staged

    "graphtyper genotype ... --sams=${bamlistfile}"
}

workflow {
    allbams = Channel.fromPath('*.bam').collect()
    bamlistfile = allbams | listfiles
    genotype(bamlistfile, allbams)
}

pros:
- the list file is easily reusable

cons:
- feels like writing a lot of code just to save a file listing.
- the double dependency of the genotype process seems redundant.

Only use nextflow built-in channel operators. Maybe my command is clumsy because I am discovering, but I tested the following and it works:
```
workflow {
    bams = Channel.fromPath('*.bam')

    bamlistfile = bams | map { "${it.name}" }  // list of basenames
    | collectFile(name: 'bam.list', newLine: true)

    genotype(bamlistfile, bams.collect())
}
```
- Pros:
  - only builtin commands, no process to define
- Cons:
  - I don’t know if it’s possible to publish the list file
  - the double dependency of the genotype process seems redundant
  - potentially dangerous (see below)

What are your recommendations?

Bonus (DO NOT DO THIS)

I unfortunately wrote a command using collectFile that overwrote an input file.

My problem when first attempting it was that the construct Channel.fromPath('*.bam') | collectFile(name: 'supposedly_listfile') actually concatenates file contents. However if the input channel is a value channel, it writes each value into the file.

This is why converting the channel to a value channel is necessary. However, I tried the following WRONG command that overwrites the first input file:

/* DO NOT RUN !!!
 * (on your real inputs)
 */
Channel.fromPath('very_important_raw_input{1,2,3}.txt')
.collect()  // trying to make a value channel: WRONG
.collectFile(name: 'file.list', newLine: true)
.view( "written to: $it" )

Which prints… written to: very_important_raw_input1.txt!

And now, the content of very_important_raw_input1.txt is the concatenation of very_important_input{2,3}.txt.

What the collect step did was to pass the filenames as arguments to the next operator, effectively replacing name: 'filelist' by name: 'very_important_raw_input1.txt'.

I am to blame for not properly understanding Groovy/Nextflow’s syntax before playing around with real input files (that I was able to regenerate), but I am sharing it here just in case!

Alexander_Nater · September 11, 2024, 10:50am

I would clearly recommend your first approach. I don’t see the point of reusing the file list among processes. You want to make sure that the content of the file list actually matches the files in the input channel, so it’s much better to create a new file list in every process that needs it. It’s just a single line of code, so I don’t see any benefit of reusing it among processes.

Topic		Replies	Views
Generating a List of File Paths for Parallel Jobs from output of another process Training nextflow	7	64	March 22, 2025
Writing multiple filenames to an output file Ask for help	1	27	March 20, 2025
How to print files names after collect in a process to a file Ask for help	11	350	May 20, 2024
How to process a list of files, while knowing the file I"m processing? Ask for help nextflow	2	112	May 5, 2024
Collecting output from different processes to run another process Ask for help	3	122	July 23, 2024

Recommended way of passing a file that lists filenames as input to a process

Bonus (DO NOT DO THIS)

Related topics