Recommended way of passing a file that lists filenames as input to a process

Hi!

I think it is a very common use case with bioinformatics software, where they often require that the list of input files is read from a file.

In the example I am concerned with, it’s the only way of providing multiple bam files to the graphtyper genotype command (with --sams).

The individual bam files must typically be delivered by a channel, but so do the list file, ideally. So what is the approach to do this with Nextflow? I see several ways, but it seems that no solution is best:

  1. Create the list file on the fly inside the same script as the command of interest:

    process genotype {
        input:
        path '*'
    
        """
        ls *.bam > bamlist.txt
        graphtyper genotype ... --sams=bamlist.txt
        """
    }
    
    • pros:
      • simple
      • the list file always contains exactly the given inputs
    • cons
      • inconvenient for reusing the list file in other processes
  2. Create the list file in a separate process

    process listfiles {
        input:
        path '*'
    
        output:
        path 'bamlist.txt'
    
        "ls *.bam > bamlist.txt"
    }
    
    process genotype {
        input:
        path bamlistfile
        path allbams  // they need to be a dependency as well, to be staged
    
        "graphtyper genotype ... --sams=${bamlistfile}"
    }
    
    workflow {
        allbams = Channel.fromPath('*.bam').collect()
        bamlistfile = allbams | listfiles
        genotype(bamlistfile, allbams)
    }
    
    • pros:
      • the list file is easily reusable
    • cons:
      • feels like writing a lot of code just to save a file listing.
      • the double dependency of the genotype process seems redundant.
  3. Only use nextflow built-in channel operators. Maybe my command is clumsy because I am discovering, but I tested the following and it works:

    workflow {
        bams = Channel.fromPath('*.bam')
    
        bamlistfile = bams | map { "${it.name}" }  // list of basenames
        | collectFile(name: 'bam.list', newLine: true)
    
        genotype(bamlistfile, bams.collect())
    }
    
    • Pros:
      • only builtin commands, no process to define
    • Cons:
      • I don’t know if it’s possible to publish the list file
      • the double dependency of the genotype process seems redundant
      • potentially dangerous (see below)

What are your recommendations?


Bonus (DO NOT DO THIS)

I unfortunately wrote a command using collectFile that overwrote an input file.

My problem when first attempting it was that the construct Channel.fromPath('*.bam') | collectFile(name: 'supposedly_listfile') actually concatenates file contents. However if the input channel is a value channel, it writes each value into the file.

This is why converting the channel to a value channel is necessary. However, I tried the following WRONG command that overwrites the first input file:

/* DO NOT RUN !!!
 * (on your real inputs)
 */
Channel.fromPath('very_important_raw_input{1,2,3}.txt')
.collect()  // trying to make a value channel: WRONG
.collectFile(name: 'file.list', newLine: true)
.view( "written to: $it" )

Which prints… written to: very_important_raw_input1.txt!

And now, the content of very_important_raw_input1.txt is the concatenation of very_important_input{2,3}.txt.

What the collect step did was to pass the filenames as arguments to the next operator, effectively replacing name: 'filelist' by name: 'very_important_raw_input1.txt'.

I am to blame for not properly understanding Groovy/Nextflow’s syntax before playing around with real input files (that I was able to regenerate), but I am sharing it here just in case!

I would clearly recommend your first approach. I don’t see the point of reusing the file list among processes. You want to make sure that the content of the file list actually matches the files in the input channel, so it’s much better to create a new file list in every process that needs it. It’s just a single line of code, so I don’t see any benefit of reusing it among processes.

2 Likes