Skip to content

Add parasail alignment backend to pileup tile functions#167

Merged
pkerpedjiev merged 5 commits into
mainfrom
parasail-pileup-alignment
Apr 19, 2026
Merged

Add parasail alignment backend to pileup tile functions#167
pkerpedjiev merged 5 commits into
mainfrom
parasail-pileup-alignment

Conversation

@pkerpedjiev
Copy link
Copy Markdown
Member

Summary

  • Adds tile_functions_parasail() which pre-builds a parasail scoring profile from the reference once and aligns all query sequences against it using NW global alignment with CIGAR traceback — approximately 9× faster than the existing BioPython PairwiseAligner path on large batches (benchmarked at 50k × 250 nt sequences)
  • Adds cigar_to_subs() to convert parasail's extended-CIGAR (=/X/I/D) to the HiGlass pileup subs format, matching the coordinate convention of alignment_to_subs()
  • Extends get_pileup_alignment_data() with a method parameter ("biopython" default, "parasail")
  • Fixes the tile ID passed internally from "0.0" (which would raise IndexError) to "x.0.0"; renames the return key "type""tileset_info" for consistency
  • Makes the csv_sequence_tileset_functions import lazy to avoid pulling in smart_open (and its transitive deps) on every import clodius.tiles.pileup
  • Adds examples/sequence_pileup.py benchmark script

Test plan

  • Run existing pileup tests: pytest test/tiles/pileup_test.py
  • Run examples/sequence_pileup.py and verify parasail output is produced without error
  • Confirm from clodius.tiles.pileup import get_pileup_alignment_data does not raise ModuleNotFoundError for smart_open in a minimal environment

- Add cigar_to_subs() to convert parasail extended-CIGAR to HiGlass subs format
- Add tile_functions_parasail() that pre-builds one scoring profile per
  reference sequence and aligns all queries against it, ~9x faster than
  the BioPython PairwiseAligner path for large batches
- Update get_pileup_alignment_data() with a method parameter
  ("biopython" default, or "parasail") and fix the tile ID from "0.0"
  to "x.0.0"; return key renamed from "type" to "tileset_info"
- Make the csv_sequence_tileset_functions import lazy to avoid pulling
  in smart_open on every module import
- Add examples/sequence_pileup.py benchmark script
- Restructure test file so mappy-dependent tests skip individually
  (via autouse fixture) rather than skipping the whole module, allowing
  parasail and BioPython tests to run without mappy installed
- Add TestCigarToSubs: covers all-match, single mismatch, insertion,
  deletion, and multiple-mismatch CIGAR patterns
- Add TestGetPileupAlignmentDataParasail: covers result structure,
  identical sequence, substitution detection, multiple sequences,
  tileset_info fields, and read IDs
@pkerpedjiev pkerpedjiev merged commit 13f950b into main Apr 19, 2026
2 checks passed
@pkerpedjiev pkerpedjiev deleted the parasail-pileup-alignment branch April 19, 2026 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant