Sample name cleaning/truncation

Hi folks,

I run MultiQC on Nexclade output and my samples have a lot of _s and |s, which seems to mess with the sample name cleaning in the general statistics and the Nexclade run table. The pattern I get is everything after the last _.
I found fn_clean_sample_names: false, but that didn’t solve the problem, only adds the last _

Is there any other config option?

Thanks
Marie

Hi @MarieLataretu,

Sorry you’re having problems. fn_clean_sample_names: false does the opposite of what you want, that turns off sample name cleaning.

Could you please provide some more exact examples of the filenames you’re processing, the sample names you see in reports, and the same names you want to see in reports please?

The docs for this is here: Configuration | Seqera Docs

To remove everything after the last _ (and only the last) in every sample number you can do something like this:

extra_fn_clean_exts:
  - type: regex
    pattern: "_[^_]*$"

You can use this website to test regular expressions: https://regex101.com/

Phil

Hi @ewels , thanks for getting back to me!

Let me try to explain the situation - after revisiting the problem today, I figured that I might have used the wrong terms, and I’m not sure what to configure where :see_no_evil_monkey:

I have multiple Nextclade output tables. Each table contains fasta headers in the seqName column. Each fasta header is unique across all Nextclade tables (this, of course, depends on the input).

The fasta headers look like this (in different Nextclade tables):
ID1_lor-em_ipsum|foo|bar|A_/_H1N1|NS
ID2_lor-em_ipsum|foo|bar|A_/_H1N1|NS
ID3_lor-em_ipsum|foo|bar|A_/_H1N1|NS
ID4_lor-em_ipsum|foo|bar|A_/_H1N1|NS

ID1_lor-em_ipsum|foo|bar|A_/_H1N1|HA
ID2_lor-em_ipsum|foo|bar|A_/_H1N1|HA
ID42_lor-em_ipsum|foo|bar|A_/_H1N1|HA
ID43_lor-em_ipsum|foo|bar|A_/_H1N1|HA

MultiQC merges the Nextclade tables and appears to modify the sample names (seqName values across all tables) so that only the last part of the header is retrieved, leading to duplicate samples that get overwritten.

With fn_clean_sample_names: false, I get sample names _H1N1|NA and _H1N1|HA.
Without that, I get H1N1|NA and H1N1|HA. Thus, tables in the MultiQC report have two rows.

I would like to keep the original fasta header/seqName values. As they are unique, duplicated sample names shouldn’t be a problem. The tables in the MultiQC report should have 8 rows for the example above.