I run MultiQC on Nexclade output and my samples have a lot of _s and |s, which seems to mess with the sample name cleaning in the general statistics and the Nexclade run table. The pattern I get is everything after the last _.
I found fn_clean_sample_names: false, but that didn’t solve the problem, only adds the last _
Sorry you’re having problems. fn_clean_sample_names: false does the opposite of what you want, that turns off sample name cleaning.
Could you please provide some more exact examples of the filenames you’re processing, the sample names you see in reports, and the same names you want to see in reports please?
Let me try to explain the situation - after revisiting the problem today, I figured that I might have used the wrong terms, and I’m not sure what to configure where
I have multiple Nextclade output tables. Each table contains fasta headers in the seqName column. Each fasta header is unique across all Nextclade tables (this, of course, depends on the input).
The fasta headers look like this (in different Nextclade tables): ID1_lor-em_ipsum|foo|bar|A_/_H1N1|NS ID2_lor-em_ipsum|foo|bar|A_/_H1N1|NS ID3_lor-em_ipsum|foo|bar|A_/_H1N1|NS ID4_lor-em_ipsum|foo|bar|A_/_H1N1|NS
MultiQC merges the Nextclade tables and appears to modify the sample names (seqName values across all tables) so that only the last part of the header is retrieved, leading to duplicate samples that get overwritten.
With fn_clean_sample_names: false, I get sample names _H1N1|NA and _H1N1|HA.
Without that, I get H1N1|NA and H1N1|HA. Thus, tables in the MultiQC report have two rows.
I would like to keep the original fasta header/seqName values. As they are unique, duplicated sample names shouldn’t be a problem. The tables in the MultiQC report should have 8 rows for the example above.
Thanks @MarieLataretu, that helps. It’s not the _ and |s that are the problem, it’s the /s - these characters are path separators, and as part of the sample name cleaning, MultiQC takes the os.path.basename() on the name to get rid of parent directories. This trims it down to just the final part of the sample name, which is why they are overwriting one another.
I expected there to be some way to get around this with configuration, but I can’t find anything sorry - that particular cleanup happens before all the rest, so it’s not possible to skip or modify. The easiest fix is to rename your input data to not to have these characters in the sample identifiers. The only alternative is for me to add a new config option to selectively disable this basename call, but it’s a bit ugly and this is the first time it’s come up in over 10 years
How easy is it for you to omit that character in the sample identifiers?