Skip to content

How do BAM preprocessing differences impact segmentation in CNVkit? #1049

Description

@minhyung0621

Hi everyone,

I am currently building a CNV detection pipeline using CNVkit with BAM files processed via GATK. To validate my pipeline, I am comparing my output with a *.cns file previously generated by a senior researcher in our lab.

The issue is that there is a significant discrepancy in the number of segments: the senior researcher’s results show around 1,000 segments per sample, whereas my pipeline yields only about 250 segments per sample.

We used the exact same CNVkit pipeline commands, but we differ in how we generated the recal.bam files. Unfortunately, I cannot verify the exact code or parameters the senior researcher used for the BAM preprocessing steps.

I am currently troubleshooting and evaluating various possibilities. Could anyone share insights on which specific steps or parameters during BAM file generation (e.g., alignment or GATK post-processing) can heavily influence downstream segmentation in CNVkit?

Any advice or pointers would be greatly appreciated. Thank you!

target.bed = Illumina twist version 2.5
REF="/data/minyu0242/pipeline_WES_code_integration/ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
REFFLAT="/data/minyu0242/pipeline_WES_code_integration/ref/nochr_refFlat.txt"

        ${CNVKIT} batch \
            "${TUMOR_BAM}" \
            --normal  "${NORMAL_BAM}" \
            --targets "${TARGET_BED}" \
            --fasta   "${REF}" \
            --access  "${ACCESS_BED}" \
            --output-reference "${SAMPLE_OUT}/${TUMOR_SM}_reference.cnn"\
            --annotate "${REFFLAT}" \
            --output-dir       "${SAMPLE_OUT}" \
            --drop-low-coverage \
            -p ${THREADS} \

--- Reference ---------------------------------------------------------------

REF_DIR="${PROJECT_DIR}/ref"
REF_FASTA="${REF_DIR}/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
TARGET_BED="${REF_DIR}/ensembl115_GRCh38_exons.bed"

--- Known-sites (BQSR) ------------------------------------------------------

DBSNP_VCF="${REF_DIR}/Homo_sapiens_assembly38.nochdbsnp138.bgzip.vcf.gz"
MILLS_VCF="${REF_DIR}/Homo_sapiens_assembly38.nochknown_indels.bgzip.vcf.gz"
KNOWN_INDELS_VCF="${REF_DIR}/Mills_and_1000G_gold_standard.indels.hg38.noch.bgzip.vcf.gz"

  gatk BaseRecalibrator \
    -R "${REF_FASTA}" \
    -I "${DEDUP_BAM}" \
    --known-sites "${DBSNP_VCF}" \
    --known-sites "${MILLS_VCF}" \
    --known-sites "${KNOWN_INDELS_VCF}" \
    -L "${TARGET_BED}" \
    -O "${RECAL_TABLE}" \



  gatk ApplyBQSR \
    -R "${REF_FASTA}" \
    -I "${DEDUP_BAM}" \
    --bqsr-recal-file "${RECAL_TABLE}" \
    -L "${TARGET_BED}" \
    -O "${FINAL_RECAL_BAM}" \

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions