Collapse sample names in one table but not another

Hanh_Hoang · October 2, 2023, 2:47pm

Hello,

I am testing nf-core/sarek with test_full profile (somatic). Everything is good. But there is a thing with general stats table.

With default config of MultiQC, the table is quite sparse, thus, difficult for report readers.
I tried cleaning filenames, but this would lead to overwriting in bcftools stats.
Does it make sense if I remove those stats from general stats table, and create a general stats in bcftools section, just like VEP?

Thank you very much, I truly appreciate your support!

ewels · October 3, 2023, 6:37am

Hi @Hanh_Hoang,

Thanks for posting and welcome to the community! You’ve started us off with a really excellent question

To summarise (please correct me if I’m wrong), the problem here is that you want table-specific sample name cleaning. To clean + collapse sample names in General Stats, but not in other tables.

This is not currently possible within MultiQC, but it could be interesting to do. I think it’s basically the same as the ancient GitHub issue #542 from way back in 2017:

github.com/ewels/MultiQC

New base function to group samples into sets

opened 08:16AM - 08 Aug 17 UTC

ewels

core: back end

To avoid having half-empty rows from broken up samples, have the option to merge… the statistics given in General Statistics. This shouldn't affect plots or any data in sections, _only_ General Stats. For example, have the following config (modelled on [sample name cleaning](http://multiqc.info/docs/#sample-name-cleaning)): ```yaml general_stats_merge: - '_R1' - '_R2' - type: regex_keep pattern: '[A-Z]{3}[1-9]{2}' ``` Then group any samples matching these and merge the stats in a module-specific and statistic-specific manner. This would also allow situations such as multiplexed lanes etc. Benefits: * Sections in report maintain full original data * Saved data in `multiqc_data` maintains full original data * Summary table more concise and easy to skim, lines up with other modules * Doesn't affect front-end or any major existing code infrastructure Downsides: * Requires specific code to be written for each module that will support this.

I suspect that this would be the best solution , @vlad.savelyev maybe we can take a fresh look at this and see if we can move it up the roadmap a little.

Another approach would be to look at how this data is getting into the report in the first place, from the nf-core/sarek pipeline. @maxulysse / @FriederikeHanssen - have you guys had any similar requests in the past, or any ideas on this topic?

Phil

maxulysse · October 3, 2023, 9:09am

I think it’s a good idea, and I’m all for improving the MultiQC reports however we can.

ewels · October 3, 2023, 9:13am

I just made an issue for the simpler per-table sample name cleaning idea here:

github.com/ewels/MultiQC

Per-table sample name cleaning

opened 09:13AM - 03 Oct 23 UTC

ewels

core: back end

### Description of feature Related to https://github.com/ewels/MultiQC/issues/5…42 but a simpler concept, to get us started. New config option + back end code to allow sample name cleaning that is scoped to a specific table ID. This allows us to collapse non-overlapping table rows for a specific table (eg. General Stats) without collapsing samples in other sections of the report. Suggested config could look like: ```yml table_specific_fn_clean_exts: general_stats_table: - _R1 - _R2 ``` As with current sample name cleaning, if duplicates are encountered, they are overwritten. (Better handling will be tackled in #542).

Hanh_Hoang · October 3, 2023, 9:47am

Hi Phil,

Thank you for your really useful suggestion. I wonder if the sample_merge_groups lead to overwriting in general stats table? A bit more about the context, all my screenshots was from general stats table.

In addition, cleaning file name does affect the other sections of the report (reduce, ovewriting the samples, I think that your suggestion could help with this). Please kindly take a look at the default & cleaned fn reports here

I kinda wonder why bcftools stats is in general stats table, while VEP is not and has it own general stats table at VEP sections.

Also, should I do the same with bcftools, as VEP. Maybe if I do so, the row with sample name like HCC1395T_vs_HCC1395N.strelka.somatic_snvs that comes from bcftools, may not appear in general stats table.

Then I will only need to clean (the md and recal also lead to overwrite tho ):

  - "_val"
  - "_1"
  - "_2"
  - ".md"
  - ".recal"
  - "-1"

Please correct me if I misundestand something! Thank you very much!

ewels · October 3, 2023, 12:20pm

aha, I didn’t realise this - thank you for clarifying! Ok, then issue #2097 will not help you (at least, not in respect to Bcftools data being overwritten in the table).

This is a good point - that would be another easy fix at MultiQC level, to move that info into a separate table. The reason is to have these statistics in the General Stats table alongside other “general” stats from other tools, for comparison (eg. do samples with high % duplicates have low SNP counts, or whatever). But in this case, pairing with #2097 it’d be easier to have them in a separate table.

We could think about moving them, or even having a module-specific config flag to choose whether they go in General Stats (as now, default behaviour) or a separate table instead (opt-in). @maxulysse / @FriederikeHanssen - any opinions on this behaviour for Bcftools specifically?

Hanh_Hoang · October 6, 2023, 2:24am

Sincerely thank you!

Trích dẫn that would be another easy fix at MultiQC level, to move that info into a separate table.

Could you please specify how to do that? Thank you very much!

ewels · October 6, 2023, 7:35am

This is a change to core MultiQC module code, so for this please submit a new issue on the MultiQC GitHub repository requesting the change.

Hanh_Hoang · October 6, 2023, 5:50pm

Ahh I got it, thank you! (Sorry, I was thinking that it is something could be done with setting config, like we could select the stats from which tools will be add to general stats).

I think this is not a common need, and might be not a problem if just one or two tools are run. So maybe I will try customizing a bit on my own instead of create a request!

Once again, sincerely thank you and Seqera Lab Team for timely support!

FriederikeHanssen · October 27, 2023, 9:27am

Hey! apologies, I didn’t get any notifications about this thread . In general, I have no strong opinion about where to put the bcftools stats table. I agree that the general table at the top is pretty hard to read and it would be nice to split it up a bit. I don’t know if we want to collapse multiple variantcallers, but at least splitting up preprocessing and variantcalling into several stats tables would probably already improve readability a lot.

Topic		Replies	Views
Adding supported tool table columns to generalstats table Ask for help multiqc	3	288	February 5, 2024
Custom column on general stats table Ask for help multiqc	7	282	March 28, 2025
FastQC read quality scores in general stats table Ask for help multiqc	5	140	April 19, 2024
Naming Convention for Consolidating FASTQC and Bowtie2 Reports Ask for help multiqc	2	35	November 22, 2024
Issue w. grouping of samples in multiQC report Ask for help multiqc	1	42	December 20, 2024

Collapse sample names in one table but not another

Related topics