How do BAM preprocessing differences impact segmentation in CNVkit?

Hi everyone,

I am currently building a CNV detection pipeline using CNVkit with BAM files processed via GATK. To validate my pipeline, I am comparing my output with a *.cns file previously generated by a senior researcher in our lab.

The issue is that there is a significant discrepancy in the number of segments: the senior researcher’s results show around 1,000 segments per sample, whereas my pipeline yields only about 250 segments per sample.

We used the exact same CNVkit pipeline commands, but we differ in how we generated the recal.bam files. Unfortunately, I cannot verify the exact code or parameters the senior researcher used for the BAM preprocessing steps.

I am currently troubleshooting and evaluating various possibilities. Could anyone share insights on which specific steps or parameters during BAM file generation (e.g., alignment or GATK post-processing) can heavily influence downstream segmentation in CNVkit?

Any advice or pointers would be greatly appreciated. Thank you!

target.bed = Illumina twist version 2.5
REF="/data/minyu0242/pipeline_WES_code_integration/ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
REFFLAT="/data/minyu0242/pipeline_WES_code_integration/ref/nochr_refFlat.txt"

            ${CNVKIT} batch \
                "${TUMOR_BAM}" \
                --normal  "${NORMAL_BAM}" \
                --targets "${TARGET_BED}" \
                --fasta   "${REF}" \
                --access  "${ACCESS_BED}" \
                --output-reference "${SAMPLE_OUT}/${TUMOR_SM}_reference.cnn"\
                --annotate "${REFFLAT}" \
                --output-dir       "${SAMPLE_OUT}" \
                --drop-low-coverage \
                -p ${THREADS} \


# --- Reference ---------------------------------------------------------------
REF_DIR="${PROJECT_DIR}/ref"
REF_FASTA="${REF_DIR}/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
TARGET_BED="${REF_DIR}/ensembl115_GRCh38_exons.bed"

# --- Known-sites (BQSR) ------------------------------------------------------
DBSNP_VCF="${REF_DIR}/Homo_sapiens_assembly38.nochdbsnp138.bgzip.vcf.gz"
MILLS_VCF="${REF_DIR}/Homo_sapiens_assembly38.nochknown_indels.bgzip.vcf.gz"
KNOWN_INDELS_VCF="${REF_DIR}/Mills_and_1000G_gold_standard.indels.hg38.noch.bgzip.vcf.gz"


      gatk BaseRecalibrator \
        -R "${REF_FASTA}" \
        -I "${DEDUP_BAM}" \
        --known-sites "${DBSNP_VCF}" \
        --known-sites "${MILLS_VCF}" \
        --known-sites "${KNOWN_INDELS_VCF}" \
        -L "${TARGET_BED}" \
        -O "${RECAL_TABLE}" \



      gatk ApplyBQSR \
        -R "${REF_FASTA}" \
        -I "${DEDUP_BAM}" \
        --bqsr-recal-file "${RECAL_TABLE}" \
        -L "${TARGET_BED}" \
        -O "${FINAL_RECAL_BAM}" \

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How do BAM preprocessing differences impact segmentation in CNVkit? #1049

--- Reference ---------------------------------------------------------------

--- Known-sites (BQSR) ------------------------------------------------------

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

How do BAM preprocessing differences impact segmentation in CNVkit? #1049

Description

--- Reference ---------------------------------------------------------------

--- Known-sites (BQSR) ------------------------------------------------------

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions