Hi!
it’s more of a question about convention and style.
An example workflow would be to apply a process to different chromosomes in parallel, let’s say a process QC_filter
on a vcf input. All outputs are then concatenated.
I am wondering how to pass along the chromosome info (or any arbitrary metadata). It could be one of the following two ways:
-
through an input/output variable:
process QC_filter { input: tuple val(chrom), path(vcf) output: tuple val(chrom), path('qc_filtered.vcf') "command ... > qc_filtered.vcf" }
-
It could be using the filename:
process QC_filter { input: path(vcf) output: path('*.qc_filtered.vcf') "command ... > ${vcf.simpleName}.qc_filtered.vcf" }
The Nextflow documentation states
With Nextflow, in most cases, you don’t need to manage the naming of output files, because each task is executed in its own unique directory, so files produced by different tasks can’t overwrite each other. Also, metadata can be associated with outputs by using the tuple output qualifier, instead of including them in the output file name.
but in my example I see several drawbacks:
-
we pass “useless” variables to processes, and this can cause difficulties such as in this question:
Process input that is not cached and does not affect task hash -
Without distinct file names we might encounter input file name collision when collecting files into a concatenation process. As discussed here, there is no channel operator that would allow renaming the files, so it must be done from within the source process.
-
With tuple input, we cannot use the
each
modifier.
So even though manipulating file names for retrieving metadata is cumbersome, it might be better here.
Sorry in advance as this might be an unimportant piece of detail, but I am feeling that it is slightly altering the fluidity of Nextflow syntax…
So I am curious, what are the recipes for this?