Copies of the same file leading to different, shorter output files

SloEye · February 27, 2026, 10:43am

I’m writing a bioinformatics pipeline where in one of the first steps I’m running the basecalled fastq files through nanofilt/chopper. Since I didn’t have any extra files in my virtual environment, I made copies of the same file (with different names) to simulate having multiple files or directories for testing purposes. There, I noticed that the outputs of copies of the same file, ends up having different lengths and contents, with differences of a few kb.

The worry here is that I lose data from possible pipeline, buffer or input/output errors which may impact the final results.

Main workflow:

params.fastqfiles = "$HOME/test/*.fastq.gz"

process qscoreFilter {
conda “$HOME/miniconda3/envs/software”
publishDir “filtered”, mode: “copy”

input:
path filename

output:
path '*_filtered.fastq.gz'

script:
def noextension = "${filename}".replace(".fastq.gz", "")
"""
gunzip -c ${filename} | chopper -q 10 | gzip > ${noextension}_filtered.fastq.gz
"""

}

workflow {
allfiles = files(params.fastqfiles)
input_ch = channel.from(allfiles)
output_qscore = qscoreFilter(input_ch)

}

When running the same commands simply from the commandline, I do not get the same problem. The output files are identical, no matter when and how often I run them. The same goes when I try to remove the “chopper -q 10” part of the code, but that does not explain why it behaves like that within Nextflow and not outside it, using the same conda environment.

Edit: I noticed that even if I split these pipes into individual processes or steps within the same process, the files are identical up to gzip, which will compress the same file(s) with up to 100kb difference between different tasks and runs, ONLY inside nextflow.

How would I make it so that identical inputs produce identical outputs? Is the buffer or config configured incorrectly? I just used the defaults.

ewels · March 2, 2026, 7:15pm

Hi @SloEye,

Apologies - your post was flagged automatically as spam for some reason, I’m not sure why.

There’s no easy answer for your question. Nextflow doesn’t do anything special when publishing results and there’s no reason on the face of it that the files should not be identical. I would debug this further by going into the work directories and checking the .command.run and .command.sh files. Try running these manually and looking for any differences. Nextflow only orchestrates the commands, so at the end of the day if there are differences in outputs then there should typically be differences either in the commands or the environments that the commands run in.

I hope this helps. Good luck!

Phil

SloEye · March 13, 2026, 10:30am

The real reason in the end was that gzip and gunzip actually compresses by varying order, as it seems that the joined files are shuffled around and not always sorted identically, leading to different repeating patterns within the files that change compression by a few kb. The uncompressed files then look the same. It’s not caused by Nextflow, but by gzip/gunzip itself.

Topic		Replies	Views
Why nextflow overwrite my input? Ask for help nextflow	5	136	March 21, 2025
Weird bug: workdir missing output file but results has output file, when mode=copy Ask for help nextflow	1	37	January 31, 2026
DRY Principle in Nextflow: Reusing Output Path Definitions in `output:` and `script:` sections Ask for help nextflow	2	123	June 18, 2025
Strange Error in publishing output of NEXTFLOW Ask for help	0	192	June 18, 2024
Running workflow on multiple samples Ask for help nextflow	4	489	August 12, 2024

Copies of the same file leading to different, shorter output files

Related topics