Fix pileup indel support: correct D/I CIGAR inversion and bounds guard#168
Merged
Conversation
nw_trace_scan_profile_16 returns CIGAR ops from the profile's perspective, so D and I were swapped relative to SAM convention; normalize before parsing in tile_functions_parasail so insertions and deletions are reported correctly. Also guard cigar_to_subs against X ops that index past the end of the reference, which was exposed once query sequences had variable lengths. Extend sequence_pileup.py benchmark to generate sequences with Poisson- sampled insertions and deletions (~3 each) in addition to SNPs. Add tests covering the bounds guard, end-to-end insertion detection, deletion detection, mixed indels+SNPs, and variable-length sequence batches.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
clodius/tiles/pileup.py):nw_trace_scan_profile_16builds the alignment profile from the reference, so the returned CIGAR describes the alignment from the profile's perspective — itsDandIops are swapped relative to SAM convention.tile_functions_parasailnow normalises the CIGAR before passing it tocigar_to_subs, so insertions and deletions are reported with the correct type.cigar_to_subs: anX(mismatch) op that extends past the end of the reference now clips gracefully instead of raisingIndexError. This was benign when all query sequences had the same length as the reference (SNPs only) but was triggered as soon as queries had variable lengths.sequence_pileup.pybenchmark: updatedmutate_sequenceto apply Poisson-sampled insertions (~3) and deletions (~3) per sequence in addition to SNPs, and reports per-event-type counts from the alignment output.Test plan
TestCigarToSubs::test_x_op_clipped_at_ref_boundary— verifies the bounds guard clips X ops that would index past the reference endTestGetPileupAlignmentDataParasail::test_insertion_detected— a query longer than the reference produces anIeventTestGetPileupAlignmentDataParasail::test_deletion_detected— a query shorter than the reference produces aDeventTestGetPileupAlignmentDataParasail::test_mixed_indels_and_snps— a sequence with insertion + deletion + SNP produces all three event types (I,D,X)TestGetPileupAlignmentDataParasail::test_variable_length_sequences_align_without_error— a batch of sequences with different lengths all align without error and produce the expected event types