Copies of the same file leading to different, shorter output files

I’m writing a bioinformatics pipeline where in one of the first steps I’m running the basecalled fastq files through nanofilt/chopper. Since I didn’t have any extra files in my virtual environment, I made copies of the same file (with different names) to simulate having multiple files or directories for testing purposes. There, I noticed that the outputs of copies of the same file, ends up having different lengths and contents, with differences of a few kb.

The worry here is that I lose data from possible pipeline, buffer or input/output errors which may impact the final results.

Main workflow:

params.fastqfiles = "$HOME/test/*.fastq.gz"

process qscoreFilter {
conda “$HOME/miniconda3/envs/software”
publishDir “filtered”, mode: “copy”

input:
path filename

output:
path '*_filtered.fastq.gz'

script:
def noextension = "${filename}".replace(".fastq.gz", "")
"""
gunzip -c ${filename} | chopper -q 10 | gzip > ${noextension}_filtered.fastq.gz
"""

}

workflow {
allfiles = files(params.fastqfiles)
input_ch = channel.from(allfiles)
output_qscore = qscoreFilter(input_ch)

}

When running the same commands simply from the commandline, I do not get the same problem. The output files are identical, no matter when and how often I run them. The same goes when I try to remove the “chopper -q 10” part of the code, but that does not explain why it behaves like that within Nextflow and not outside it, using the same conda environment.

Edit: I noticed that even if I split these pipes into individual processes or steps within the same process, the files are identical up to gzip, which will compress the same file(s) with up to 100kb difference between different tasks and runs, ONLY inside nextflow.

How would I make it so that identical inputs produce identical outputs? Is the buffer or config configured incorrectly? I just used the defaults.

Hi @SloEye,

Apologies - your post was flagged automatically as spam for some reason, I’m not sure why.

There’s no easy answer for your question. Nextflow doesn’t do anything special when publishing results and there’s no reason on the face of it that the files should not be identical. I would debug this further by going into the work directories and checking the .command.run and .command.sh files. Try running these manually and looking for any differences. Nextflow only orchestrates the commands, so at the end of the day if there are differences in outputs then there should typically be differences either in the commands or the environments that the commands run in.

I hope this helps. Good luck!

Phil