Add parasail alignment backend to pileup tile functions#167
Merged
Conversation
- Add cigar_to_subs() to convert parasail extended-CIGAR to HiGlass subs format
- Add tile_functions_parasail() that pre-builds one scoring profile per
reference sequence and aligns all queries against it, ~9x faster than
the BioPython PairwiseAligner path for large batches
- Update get_pileup_alignment_data() with a method parameter
("biopython" default, or "parasail") and fix the tile ID from "0.0"
to "x.0.0"; return key renamed from "type" to "tileset_info"
- Make the csv_sequence_tileset_functions import lazy to avoid pulling
in smart_open on every module import
- Add examples/sequence_pileup.py benchmark script
- Restructure test file so mappy-dependent tests skip individually (via autouse fixture) rather than skipping the whole module, allowing parasail and BioPython tests to run without mappy installed - Add TestCigarToSubs: covers all-match, single mismatch, insertion, deletion, and multiple-mismatch CIGAR patterns - Add TestGetPileupAlignmentDataParasail: covers result structure, identical sequence, substitution detection, multiple sequences, tileset_info fields, and read IDs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tile_functions_parasail()which pre-builds a parasail scoring profile from the reference once and aligns all query sequences against it using NW global alignment with CIGAR traceback — approximately 9× faster than the existing BioPythonPairwiseAlignerpath on large batches (benchmarked at 50k × 250 nt sequences)cigar_to_subs()to convert parasail's extended-CIGAR (=/X/I/D) to the HiGlass pileup subs format, matching the coordinate convention ofalignment_to_subs()get_pileup_alignment_data()with amethodparameter ("biopython"default,"parasail")"0.0"(which would raiseIndexError) to"x.0.0"; renames the return key"type"→"tileset_info"for consistencycsv_sequence_tileset_functionsimport lazy to avoid pulling insmart_open(and its transitive deps) on everyimport clodius.tiles.pileupexamples/sequence_pileup.pybenchmark scriptTest plan
pytest test/tiles/pileup_test.pyexamples/sequence_pileup.pyand verify parasail output is produced without errorfrom clodius.tiles.pileup import get_pileup_alignment_datadoes not raiseModuleNotFoundErrorforsmart_openin a minimal environment