Sample name cleaning/truncation

MarieLataretu · November 19, 2025, 4:17pm

Hi folks,

I run MultiQC on Nexclade output and my samples have a lot of _s and |s, which seems to mess with the sample name cleaning in the general statistics and the Nexclade run table. The pattern I get is everything after the last _.
I found fn_clean_sample_names: false, but that didn’t solve the problem, only adds the last _

Is there any other config option?

Thanks
Marie

ewels · November 20, 2025, 8:00pm

Hi @MarieLataretu,

Sorry you’re having problems. fn_clean_sample_names: false does the opposite of what you want, that turns off sample name cleaning.

Could you please provide some more exact examples of the filenames you’re processing, the sample names you see in reports, and the same names you want to see in reports please?

The docs for this is here: Configuration | Seqera Docs

To remove everything after the last _ (and only the last) in every sample number you can do something like this:

extra_fn_clean_exts:
  - type: regex
    pattern: "_[^_]*$"

You can use this website to test regular expressions: https://regex101.com/

Phil

MarieLataretu · November 21, 2025, 10:32am

Hi @ewels , thanks for getting back to me!

Let me try to explain the situation - after revisiting the problem today, I figured that I might have used the wrong terms, and I’m not sure what to configure where

I have multiple Nextclade output tables. Each table contains fasta headers in the seqName column. Each fasta header is unique across all Nextclade tables (this, of course, depends on the input).

The fasta headers look like this (in different Nextclade tables):
ID1_lor-em_ipsum|foo|bar|A_/_H1N1|NS
ID2_lor-em_ipsum|foo|bar|A_/_H1N1|NS
ID3_lor-em_ipsum|foo|bar|A_/_H1N1|NS
ID4_lor-em_ipsum|foo|bar|A_/_H1N1|NS

ID1_lor-em_ipsum|foo|bar|A_/_H1N1|HA
ID2_lor-em_ipsum|foo|bar|A_/_H1N1|HA
ID42_lor-em_ipsum|foo|bar|A_/_H1N1|HA
ID43_lor-em_ipsum|foo|bar|A_/_H1N1|HA

MultiQC merges the Nextclade tables and appears to modify the sample names (seqName values across all tables) so that only the last part of the header is retrieved, leading to duplicate samples that get overwritten.

With fn_clean_sample_names: false, I get sample names _H1N1|NA and _H1N1|HA.
Without that, I get H1N1|NA and H1N1|HA. Thus, tables in the MultiQC report have two rows.

I would like to keep the original fasta header/seqName values. As they are unique, duplicated sample names shouldn’t be a problem. The tables in the MultiQC report should have 8 rows for the example above.

ewels · December 10, 2025, 9:18pm

Thanks @MarieLataretu, that helps. It’s not the _ and |s that are the problem, it’s the /s - these characters are path separators, and as part of the sample name cleaning, MultiQC takes the os.path.basename() on the name to get rid of parent directories. This trims it down to just the final part of the sample name, which is why they are overwriting one another.

I expected there to be some way to get around this with configuration, but I can’t find anything sorry - that particular cleanup happens before all the rest, so it’s not possible to skip or modify. The easiest fix is to rename your input data to not to have these characters in the sample identifiers. The only alternative is for me to add a new config option to selectively disable this basename call, but it’s a bit ugly and this is the first time it’s come up in over 10 years

How easy is it for you to omit that character in the sample identifiers?

Topic		Replies	Views
MultiQC clean trim/regex/etc help with names like sample_1 getting a second _1 appended Ask for help multiqc	3	66	July 17, 2025
Replace-names Ask for help multiqc	6	22	January 22, 2026
Parsing both lane split and full run sample names on fastp Ask for help multiqc , fastp	3	211	May 6, 2024
Collapse sample names in one table but not another Ask for help multiqc	9	501	October 27, 2023
MultiQC fastp module issue with input filenames Ask for help multiqc	4	100	November 5, 2025

Sample name cleaning/truncation

Related topics