Hi!
I think it is a very common use case with bioinformatics software, where they often require that the list of input files is read from a file.
In the example I am concerned with, it’s the only way of providing multiple bam files to the graphtyper genotype
command (with --sams
).
The individual bam files must typically be delivered by a channel, but so do the list file, ideally. So what is the approach to do this with Nextflow? I see several ways, but it seems that no solution is best:
-
Create the list file on the fly inside the same script as the command of interest:
process genotype { input: path '*' """ ls *.bam > bamlist.txt graphtyper genotype ... --sams=bamlist.txt """ }
- pros:
- simple
- the list file always contains exactly the given inputs
- cons
- inconvenient for reusing the list file in other processes
- pros:
-
Create the list file in a separate process
process listfiles { input: path '*' output: path 'bamlist.txt' "ls *.bam > bamlist.txt" } process genotype { input: path bamlistfile path allbams // they need to be a dependency as well, to be staged "graphtyper genotype ... --sams=${bamlistfile}" } workflow { allbams = Channel.fromPath('*.bam').collect() bamlistfile = allbams | listfiles genotype(bamlistfile, allbams) }
- pros:
- the list file is easily reusable
- cons:
- feels like writing a lot of code just to save a file listing.
- the double dependency of the
genotype
process seems redundant.
- pros:
-
Only use nextflow built-in channel operators. Maybe my command is clumsy because I am discovering, but I tested the following and it works:
workflow { bams = Channel.fromPath('*.bam') bamlistfile = bams | map { "${it.name}" } // list of basenames | collectFile(name: 'bam.list', newLine: true) genotype(bamlistfile, bams.collect()) }
- Pros:
- only builtin commands, no
process
to define
- only builtin commands, no
- Cons:
- I don’t know if it’s possible to publish the list file
- the double dependency of the
genotype
process seems redundant - potentially dangerous (see below)
- Pros:
What are your recommendations?
Bonus (DO NOT DO THIS)
I unfortunately wrote a command using collectFile
that overwrote an input file.
My problem when first attempting it was that the construct Channel.fromPath('*.bam') | collectFile(name: 'supposedly_listfile')
actually concatenates file contents. However if the input channel is a value channel, it writes each value into the file.
This is why converting the channel to a value channel is necessary. However, I tried the following WRONG command that overwrites the first input file:
/* DO NOT RUN !!!
* (on your real inputs)
*/
Channel.fromPath('very_important_raw_input{1,2,3}.txt')
.collect() // trying to make a value channel: WRONG
.collectFile(name: 'file.list', newLine: true)
.view( "written to: $it" )
Which prints… written to: very_important_raw_input1.txt
!
And now, the content of very_important_raw_input1.txt
is the concatenation of very_important_input{2,3}.txt
.
What the collect
step did was to pass the filenames as arguments to the next operator, effectively replacing name: 'filelist'
by name: 'very_important_raw_input1.txt'
.
I am to blame for not properly understanding Groovy/Nextflow’s syntax before playing around with real input files (that I was able to regenerate), but I am sharing it here just in case!