I’m writing a bioinformatics pipeline where in one of the first steps I’m running the basecalled fastq files through nanofilt/chopper. Since I didn’t have any extra files in my virtual environment, I made copies of the same file (with different names) to simulate having multiple files or directories for testing purposes. There, I noticed that the outputs of copies of the same file, ends up having different lengths and contents, with differences of a few kb.
The worry here is that I lose data from possible pipeline, buffer or input/output errors which may impact the final results.
Main workflow:
params.fastqfiles = "$HOME/test/*.fastq.gz"
process qscoreFilter {
conda “$HOME/miniconda3/envs/software”
publishDir “filtered”, mode: “copy”
input:
path filename
output:
path '*_filtered.fastq.gz'
script:
def noextension = "${filename}".replace(".fastq.gz", "")
"""
gunzip -c ${filename} | chopper -q 10 | gzip > ${noextension}_filtered.fastq.gz
"""
}
workflow {
allfiles = files(params.fastqfiles)
input_ch = channel.from(allfiles)
output_qscore = qscoreFilter(input_ch)
}
When running the same commands simply from the commandline, I do not get the same problem. The output files are identical, no matter when and how often I run them. The same goes when I try to remove the “chopper -q 10” part of the code, but that does not explain why it behaves like that within Nextflow and not outside it, using the same conda environment.
Edit: I noticed that even if I split these pipes into individual processes or steps within the same process, the files are identical up to gzip, which will compress the same file(s) with up to 100kb difference between different tasks and runs, ONLY inside nextflow.
How would I make it so that identical inputs produce identical outputs? Is the buffer or config configured incorrectly? I just used the defaults.