UMI-tagged cfDNA sources

Hello, I need simulated UMI-tagged cfDNA files (forward and reverse) to test my pipeline. Having a ground truth would be a huge plus. Do you know where I can get such simulated data?

Consider asking this to Seqera’s AI (Seqera AI Chat). This isn’t my field, but perhaps you can evaluate whether this is useful to you.

Here’s it’s output:

Based on my research, I can provide you with several excellent options for simulated UMI-tagged cfDNA files with ground truth data for testing your pipeline:

:dna: Simulation Tools for UMI-tagged cfDNA Data

1. SafeMut - The Premier UMI-aware cfDNA Simulator

SafeMut is the first and most sophisticated UMI-aware NGS variant simulator specifically designed for cfDNA analysis:

  • GitHub: Available as open-source software
  • Key Features:
    • Specifically designed for cell-free DNA with UMIs
    • Incorporates allele-fraction overdispersion modeling
    • Generates realistic tumor/normal FASTQ pairs
    • Validated against FDA SEQC2 datasets
    • Ground truth variants with known VAF (variant allele frequency)
    • Forward and reverse read simulation
  • Input Requirements: BAM file + VCF file
  • Output: Paired tumor/normal FASTQ files with UMIs

2. Combination Approach: ART + BAMSurgeon

For more general simulation with cfDNA characteristics:

  • ART (Illumina mode): For realistic read simulation with cfDNA parameters
    • Mean insert size: 166bp (typical for cfDNA fragments)
    • Insert size standard deviation: 24
    • Read length: 100bp
    • Profile: HiSeqX TruSeq for realistic quality scores
  • BAMSurgeon: For spiking in variants at known frequencies
    • Can add variants at specific VAF levels (0.1% and higher)
    • Maintains realistic coverage patterns

3. Public Reference Datasets with Ground Truth

SEQC2 Consortium Datasets

  • Source: FDA-led Sequencing Quality Control Phase 2
  • Contains: Well-characterized cfDNA samples with known variants
  • Advantage: Rigorously produced with detailed protocols
  • Access: Publicly available for research use
  • Ground Truth: Includes validated variant calls and confident regions

SeraCare Reference Materials

  • Purpose: Commercial cfDNA reference standards
  • Features: Known mutation profiles with certified VAFs
  • Applications: Widely used for assay validation
  • Availability: Can be purchased for research use

:hammer_and_wrench: Recommended Workflow for Creating Test Data

Option 1: Using SafeMut (Recommended)

`# 1. Get a normal cfDNA BAM file (from SEQC2 or similar)

2. Create a VCF with variants you want to test

3. Run SafeMut to generate synthetic data

safemut -i normal.bam -v variants.vcf -o output_prefix`

Option 2: ART + BAMSurgeon Pipeline

`# 1. Generate baseline cfDNA reads with ART
art_illumina -i reference.fa -l 100 -f 30 -m 166 -s 24
–profile HSXt -o cfDNA_reads

2. Align to reference

3. Spike in variants with BAMSurgeon

bamsurgeon -i aligned.bam -v variants.vcf -o spiked.bam`

:bar_chart: Ground Truth Components You’ll Get

  1. Known Variants: VCF files with exact positions and allele frequencies
  2. UMI Information: Synthetic UMIs with realistic error patterns
  3. Fragment Characteristics: Realistic cfDNA fragment size distributions
  4. Coverage Patterns: Authentic sequencing depth variations
  5. Quality Scores: Technology-specific error profiles

:bullseye: Pipeline Testing Strategy

  1. Start Small: Use SafeMut with a few known variants at different VAFs
  2. Gradual Complexity: Add more variants and lower allele frequencies
  3. Validation: Compare your pipeline’s calls against ground truth
  4. Performance Metrics: Calculate sensitivity, specificity, and precision

Would you like me to help you implement any of these simulation approaches, or would you prefer guidance on accessing the SEQC2 datasets? I can also assist with setting up the specific tools mentioned above.

1 Like