Parsing both lane split and full run sample names on fastp

Hello
We run our NovaSeq in both XP mode (split by lane) and standard mode (same samples over all lanes). This results in filenames in two basic syntaxes from BCL Convert:

SampleName_S123_R1_001.fastq.gz
SampleName_S123_L002_R1_001.fastq.gz

For some reason, I cannot get the sample name cleaning to work for fastp for the latter format. The non-lane samples work as expected, but in every case for files with the lane info, fastp puts the metrics on a separate line in the table with the “SampleName_S123_L002_R1_001” sample name. All other tools in the table are using “SampleName”. This is just an issue with fastp. I am using a custom config file provided with the -c flag when running MultiQC.

The output of fastp is set to create a file called SampleName.fastp.json which it does for both input file names.

I have tried including

use_filename_as_sample_name:
  - fastp

in the config file. This should work as the SampleName.fastp.json is the same regardless of input filename.

I have also tried adding a custom extra_fn_clean_exts with a regex expression (which correctly identifies the text to clean on both filename types when checked with regex101.com)

extra_fn_clean_exts:
  - type: regex
    pattern: "_S[0-9]+[_L[0-9]+]?_R[1-2]_001"
    module: fastp

What stupid thing am I missing here?

Thanks in advance

Hi @Tony_Brooks,

Please can you attach an example file which we can replicate this with?

Thanks!

Phil

Test.zip (643.5 KB)

I have attached two .fastp.json files generated from files with both types of file name (same fastq data, just re-named). I have also attached metrics from picard Collect RNASeq metrics and my config.yaml file. In the resultant report, Sample B lines up in the table, but Sample A is split.

Hi @Tony_Brooks,

Thanks for this!

Couple of quick things to note:

  • The report you’ve generated is created with MultiQC v1.12 which is pretty old now - released over 2 years ago.
    • The latest version is v1.21, I’d recommend always updating to the latest version if you ever hit problems as we are constantly fixing bugs and things are often already resolved.
  • You seem to have created a multiqc_config.yaml file based on the defaults from MultiQC with all attributes specified.
    • I’d recommend against doing this, it effectively stops us from being able to ship config changes for you in MultiQC and can lead to unexpected behaviour. Better to only specify the attributes that you want to change (you can always leave the others there if you want, just comment them out with a #).
    • The config has a defined order of parsing, but by putting everything in this file you’re making all the defaults have top priority which could cause problems.

It’s difficult to bugtest with all the config stuff there, so I tried on your logs without any config at all. As expected, I get the following:

SampleA
SampleA_S2__L001_R1_001
SampleB
SampleB_S2_R1_001

I made a minimal config with just your snippet above:

extra_fn_clean_exts:
  - type: regex
    pattern: "_S[0-9]+[_L[0-9]+]?_R[1-2]_001"
    module: fastp

That gives me the following:

SampleA
SampleB
SampleB_S2_R1_001

So Sample A is correctly collapsed but not Sample B. That’s expected, because it doesn’t have a Lane number in it.

I then tried again with the simplest config that I can think of:

extra_fn_clean_exts:
  - _S

And that correctly collapses the names:

SampleA
SampleB

All of these sample name patterns are working as I’d expect, so I think we’re ok here. I guess that there is some issue with config options fighting in your mega config, which might be causing your issue.

Let me know if you still have problems once you’ve cut your config down to just the things you want to change and I can take another look.

Phil