Skip to content

Fix pileup indel support: correct D/I CIGAR inversion and bounds guard#168

Merged
pkerpedjiev merged 1 commit into
mainfrom
fix/pileup-indel-support
Apr 20, 2026
Merged

Fix pileup indel support: correct D/I CIGAR inversion and bounds guard#168
pkerpedjiev merged 1 commit into
mainfrom
fix/pileup-indel-support

Conversation

@pkerpedjiev
Copy link
Copy Markdown
Member

Summary

  • D/I CIGAR inversion (clodius/tiles/pileup.py): nw_trace_scan_profile_16 builds the alignment profile from the reference, so the returned CIGAR describes the alignment from the profile's perspective — its D and I ops are swapped relative to SAM convention. tile_functions_parasail now normalises the CIGAR before passing it to cigar_to_subs, so insertions and deletions are reported with the correct type.
  • Bounds guard in cigar_to_subs: an X (mismatch) op that extends past the end of the reference now clips gracefully instead of raising IndexError. This was benign when all query sequences had the same length as the reference (SNPs only) but was triggered as soon as queries had variable lengths.
  • sequence_pileup.py benchmark: updated mutate_sequence to apply Poisson-sampled insertions (~3) and deletions (~3) per sequence in addition to SNPs, and reports per-event-type counts from the alignment output.

Test plan

  • TestCigarToSubs::test_x_op_clipped_at_ref_boundary — verifies the bounds guard clips X ops that would index past the reference end
  • TestGetPileupAlignmentDataParasail::test_insertion_detected — a query longer than the reference produces an I event
  • TestGetPileupAlignmentDataParasail::test_deletion_detected — a query shorter than the reference produces a D event
  • TestGetPileupAlignmentDataParasail::test_mixed_indels_and_snps — a sequence with insertion + deletion + SNP produces all three event types (I, D, X)
  • TestGetPileupAlignmentDataParasail::test_variable_length_sequences_align_without_error — a batch of sequences with different lengths all align without error and produce the expected event types
  • All pre-existing pileup tests continue to pass

nw_trace_scan_profile_16 returns CIGAR ops from the profile's perspective,
so D and I were swapped relative to SAM convention; normalize before parsing
in tile_functions_parasail so insertions and deletions are reported correctly.

Also guard cigar_to_subs against X ops that index past the end of the
reference, which was exposed once query sequences had variable lengths.

Extend sequence_pileup.py benchmark to generate sequences with Poisson-
sampled insertions and deletions (~3 each) in addition to SNPs.

Add tests covering the bounds guard, end-to-end insertion detection,
deletion detection, mixed indels+SNPs, and variable-length sequence batches.
@pkerpedjiev pkerpedjiev merged commit d60c3d0 into main Apr 20, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant