Hi everyone,
I am currently building a CNV detection pipeline using CNVkit with BAM files processed via GATK. To validate my pipeline, I am comparing my output with a *.cns file previously generated by a senior researcher in our lab.
The issue is that there is a significant discrepancy in the number of segments: the senior researcher’s results show around 1,000 segments per sample, whereas my pipeline yields only about 250 segments per sample.
We used the exact same CNVkit pipeline commands, but we differ in how we generated the recal.bam files. Unfortunately, I cannot verify the exact code or parameters the senior researcher used for the BAM preprocessing steps.
I am currently troubleshooting and evaluating various possibilities. Could anyone share insights on which specific steps or parameters during BAM file generation (e.g., alignment or GATK post-processing) can heavily influence downstream segmentation in CNVkit?
Any advice or pointers would be greatly appreciated. Thank you!
target.bed = Illumina twist version 2.5
REF="/data/minyu0242/pipeline_WES_code_integration/ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
REFFLAT="/data/minyu0242/pipeline_WES_code_integration/ref/nochr_refFlat.txt"
${CNVKIT} batch \
"${TUMOR_BAM}" \
--normal "${NORMAL_BAM}" \
--targets "${TARGET_BED}" \
--fasta "${REF}" \
--access "${ACCESS_BED}" \
--output-reference "${SAMPLE_OUT}/${TUMOR_SM}_reference.cnn"\
--annotate "${REFFLAT}" \
--output-dir "${SAMPLE_OUT}" \
--drop-low-coverage \
-p ${THREADS} \
--- Reference ---------------------------------------------------------------
REF_DIR="${PROJECT_DIR}/ref"
REF_FASTA="${REF_DIR}/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
TARGET_BED="${REF_DIR}/ensembl115_GRCh38_exons.bed"
--- Known-sites (BQSR) ------------------------------------------------------
DBSNP_VCF="${REF_DIR}/Homo_sapiens_assembly38.nochdbsnp138.bgzip.vcf.gz"
MILLS_VCF="${REF_DIR}/Homo_sapiens_assembly38.nochknown_indels.bgzip.vcf.gz"
KNOWN_INDELS_VCF="${REF_DIR}/Mills_and_1000G_gold_standard.indels.hg38.noch.bgzip.vcf.gz"
gatk BaseRecalibrator \
-R "${REF_FASTA}" \
-I "${DEDUP_BAM}" \
--known-sites "${DBSNP_VCF}" \
--known-sites "${MILLS_VCF}" \
--known-sites "${KNOWN_INDELS_VCF}" \
-L "${TARGET_BED}" \
-O "${RECAL_TABLE}" \
gatk ApplyBQSR \
-R "${REF_FASTA}" \
-I "${DEDUP_BAM}" \
--bqsr-recal-file "${RECAL_TABLE}" \
-L "${TARGET_BED}" \
-O "${FINAL_RECAL_BAM}" \
Hi everyone,
I am currently building a CNV detection pipeline using CNVkit with BAM files processed via GATK. To validate my pipeline, I am comparing my output with a *.cns file previously generated by a senior researcher in our lab.
The issue is that there is a significant discrepancy in the number of segments: the senior researcher’s results show around 1,000 segments per sample, whereas my pipeline yields only about 250 segments per sample.
We used the exact same CNVkit pipeline commands, but we differ in how we generated the recal.bam files. Unfortunately, I cannot verify the exact code or parameters the senior researcher used for the BAM preprocessing steps.
I am currently troubleshooting and evaluating various possibilities. Could anyone share insights on which specific steps or parameters during BAM file generation (e.g., alignment or GATK post-processing) can heavily influence downstream segmentation in CNVkit?
Any advice or pointers would be greatly appreciated. Thank you!
target.bed = Illumina twist version 2.5
REF="/data/minyu0242/pipeline_WES_code_integration/ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
REFFLAT="/data/minyu0242/pipeline_WES_code_integration/ref/nochr_refFlat.txt"
--- Reference ---------------------------------------------------------------
REF_DIR="${PROJECT_DIR}/ref"
REF_FASTA="${REF_DIR}/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
TARGET_BED="${REF_DIR}/ensembl115_GRCh38_exons.bed"
--- Known-sites (BQSR) ------------------------------------------------------
DBSNP_VCF="${REF_DIR}/Homo_sapiens_assembly38.nochdbsnp138.bgzip.vcf.gz"
MILLS_VCF="${REF_DIR}/Homo_sapiens_assembly38.nochknown_indels.bgzip.vcf.gz"
KNOWN_INDELS_VCF="${REF_DIR}/Mills_and_1000G_gold_standard.indels.hg38.noch.bgzip.vcf.gz"