Add tbx_itr_regions: multi-region iterator for tabix indexed text#2022
Open
carstenerickson wants to merge 2 commits into
Open
Add tbx_itr_regions: multi-region iterator for tabix indexed text#2022carstenerickson wants to merge 2 commits into
carstenerickson wants to merge 2 commits into
Conversation
Constructs an hts_itr_t driven by hts_itr_multi_next that yields records from many regions in one pass over a tabix indexed text file (VCF.gz, BED.gz, GFF.gz, custom TSV preset). Adjacent regions are coalesced into combined index lookups instead of one tbx_itr_queryi call per region. This brings tabix indexed text to parity with BAM and CRAM, which have had sam_itr_regions since the multi region driver was added. Affected workloads include ancestry and population genetics subsetting VCFs to large SNP panels (the motivating case in samtools/bcftools#2557 identified an 11 plus hour single chromosome subset against an 84M site panel), exome capture target subsetting, annotation overlap against large bed tracks, and trio analysis against candidate de novo position lists. Measured against tbx_itr_queryi per region on a 176 MB chr22 1kGP SNP only VCF.gz: panel single (s) multi (s) speedup 1819 6.0 5.7 1.05x 4797 15.8 10.2 1.56x 18197 60.6 12.8 4.72x 47975 159.4 12.8 12.45x 175909 562.9 13.3 42.48x Multi path wallclock plateaus at sequential scan time as the panel grows; single grows roughly linearly with region count. Every row above was byte verified between paths (matching fnv1a64 hash plus matching byte length); the 100k row corresponds to 1.79 GB of byte identical output. The implementation deep copies the caller reglist so callers always own their input regardless of outcome, pre resolves reglist[i].reg name strings via tbx_name2id so callers may use either tids or reference name strings, and sorts intervals defensively before handing off to hts_itr_multi_bam. tbx_readrec_multi retrieves tbx_t from BGZF private_data; this mirrors how bcf_readrec uses the same slot for BCF header context and inherits the same one multi region iterator per htsFile constraint (documented in tbx.h). test/test_tbx_multi.c verifies byte identical parity between the single region and multi region paths across BED, VCF, and GFF preset fixtures, plus dedup of overlapping intervals, plus six failure path inputs, plus two name resolution paths. test/bench_tbx_regions.c is the reproducible source of the table above, streaming an fnv1a64 hash over both paths' output for byte level parity at constant memory. Addresses samtools/bcftools#2557. Signed-off-by: Carsten Erickson <carstene@mailbox.org> Assisted-by: Claude (claude-opus-4-7)
Cirrus reported four configurations failing on the PR head:
rockylinux-gcc builds with -std=gnu90 -Werror and rejected the
for-loop initial declarations and mid-block variable
declarations introduced in tbx.c, test_tbx_multi.c,
and bench_tbx_regions.c.
debian-gcc, ubuntu-arm, ubuntu-clang three configs run a
DO_UNTRACKED_FILE_CHECK that flagged the two new
test binaries (test/test_tbx_multi and
test/bench_tbx_regions) as untracked.
This commit:
- Moves all variable declarations to block heads in tbx.c (in
tbx_reglist_dup and tbx_itr_regions), in test/test_tbx_multi.c,
and in test/bench_tbx_regions.c, so the files compile under
-std=gnu90.
- Adds /test/bench_tbx_regions and /test/test_tbx_multi entries
to .gitignore, alphabetised next to the existing test binary
entries.
Verified locally: gcc -std=gnu90 -Wall -Wformat=2 -Wextra -Werror
on all three .c files reports no errors, and the full ASan + UBSan
htslib test suite (361 tests) passes.
Signed-off-by: Carsten Erickson <carstene@mailbox.org>
Assisted-by: Claude (claude-opus-4-7)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
tbx_itr_regions(), a multi-region iterator constructor for tabix indexed text files. Returns anhts_itr_tdriven byhts_itr_multi_next, so callers can submit a list of intervals and get coalesced index lookups instead of one lookup per region.This brings tabix indexed text (VCF.gz, BED.gz, GFF.gz, custom TSV preset) to parity with the existing
sam_itr_regionsfor BAM and CRAM.Addresses samtools/bcftools#2557.
Why
tbx_itr_queryiis the right primitive for a handful of regions. It degrades when the region list is dense, because per region overhead dominates: each call does its own bin computation, index lookup, BGZF seek, and chunk walk; adjacent regions in the same BGZF block force that block to be decompressed twice; sorted dense lists (panels, capture targets, sites files) spend most of their wallclock on redundant index walks rather than data. The multi region driver already solves this for binary alignments by precomputing the full offset list, merging adjacent offsets, and decompressing each BGZF block once. What was missing was atbx_*constructor compatible with that driver.Workloads this unblocks:
Measurements
Measured against
tbx_itr_queryiper region on a 176 MB chr22 1kGP SNP only VCF.gz, using the bundledtest/bench_tbx_regions.c:Multi path wallclock plateaus at sequential scan time once the region count is large (it is bounded by file scan, not by region count). Single grows roughly linearly with region count. Every row above was byte verified between paths via FNV 1a 64 hash plus byte length comparison.
Implementation notes
Deep copy of reglist. The caller's reglist is deep copied internally before being handed to
hts_itr_regions. The caller always owns their original input regardless of outcome. This sidestepshts_itr_regions's ownership inconsistency on failure (where the reglist may or may not have been freed depending on which branch failed).Internal
.regresolution.tbx_itr_regionsresolves anyreglist[i].regname strings viatbx_name2idbefore handing off (passinggetid=NULLtohts_itr_regions). Callers may use either pre resolved tids or reference name strings, matching the ergonomics ofsam_itr_regions.Defensive sort. Intervals within each reglist entry are sorted before construction;
hts_itr_multi_bamrequires this. The exported comparatorcompare_hts_pair_pos_tfromregion.cis reused (previouslystatic, now declared inhts_internal.h).Shared readrec body.
tbx_readrec(single region path) andtbx_readrec_multi(multi region path) delegate to a common static helper. Future fixes tometa_charhandling orget_intvparsing apply to both paths automatically.BGZF private_data slot.
tbx_readrec_multiretrievestbx_tfrom BGZF private_data becausehts_itr_multi_nextdoes not thread a per iterator context to its readrec callback. This mirrors howbcf_readrecuses the same slot for BCF header context. The constraint "only one multi region tabix iterator per htsFile at a time" is documented intbx.hand matches the existing constraint on binary BCF iteration.Testing
test/test_tbx_multi.cwired throughtest/test.plcovers:test/bench_tbx_regions.cis the reproducible source of the measurement table above. It streams an FNV 1a 64 hash over both paths' output during an untimed parity pass, then runs timed reps with no hashing overhead. Constant memory regardless of panel size (verified at the 1.79 GB stream byte mark).Full htslib test suite (361 tests including the new ones) is clean under `-fsanitize=address,undefined`.
Scope notes
bcftools view -Tthrough the new constructor when the targets file is tabix indexed will be a separate PR.hts_itr_multi_nextand lift that constraint; out of scope here.Test plan
Assisted-by: Claude (claude-opus-4-7)