I’ve been working on a Nextflow RNA-Seq pipeline on AWS Batch and encountered an issue with MultiQC’s fastp module that I wanted to discuss before proposing any changes.
The Problem:
When processing multiple samples in parallel where input fastq files have identical names (e.g., forward.fastq/reverse.fastq across samples), MultiQC only captures one fastp report instead of aggregating all samples.
I looked into the MultiQC source and found that the fastp module uses input filenames (‘-i’, ‘-I’, ‘–in1’, ‘–in2’) as detection keys, if configuration-yaml file is not provided. When these names are identical across samples, each subsequent report overwrites the previous one in the data dictionary.
Proposed Solution:
I’m considering modifying the detection logic to use output filenames ‘-o’, ‘-O’, ‘–out1’, ‘–out2’) instead, since output files make way more sense for sample naming than input files. Inputs usually come with generic names from sequencing facilities which don’t tell you much. Outputs, on the other hand, are what users actually name to reflect the real sample and the analysis they care about. They’re unique to each run and avoid collisions. Since outputs are what move forward into downstream analysis and reports, it’s logical to base sample names on them, not on the raw input files.
I’ve locally tested a fix that changes the regex pattern to look for output options, and it resolved the issue in my pipeline.
Sorry to hear that you’re having problems and thanks for your post! Having clashing sample names is a very common problem for MultiQC users and not an easy one to fix. Using output file names is a solid idea, but I’m afraid that it’s not something that I would like to implement for now. Whilst it solves your issue, it’s not a universal solution - in other cases, folks may use generic file names for intermediate file names during processing and this would break their reports, for example. In short, MultiQC has so many users that there is very rarely a common usage pattern that you can rely on. MultiQC uses input files because we aim to share sample names across tools and the input file names are usually the most “pure” which give the best chance of this.
Thank you so much for the detailed explanation and for pointing me to the right resources.
On a side note, I also wanted to mention that your Nextflow tutorial videos on YouTube were incredibly helpful for me when I was getting started. They really made the concepts click.
Thanks again for all your great work and contributions to the community!