Skip to content

Add tbx_itr_regions: multi-region iterator for tabix indexed text#2022

Open
carstenerickson wants to merge 2 commits into
samtools:developfrom
carstenerickson:feat/tbx-readrec-multi
Open

Add tbx_itr_regions: multi-region iterator for tabix indexed text#2022
carstenerickson wants to merge 2 commits into
samtools:developfrom
carstenerickson:feat/tbx-readrec-multi

Conversation

@carstenerickson
Copy link
Copy Markdown

Summary

Adds tbx_itr_regions(), a multi-region iterator constructor for tabix indexed text files. Returns an hts_itr_t driven by hts_itr_multi_next, so callers can submit a list of intervals and get coalesced index lookups instead of one lookup per region.

This brings tabix indexed text (VCF.gz, BED.gz, GFF.gz, custom TSV preset) to parity with the existing sam_itr_regions for BAM and CRAM.

Addresses samtools/bcftools#2557.

Why

tbx_itr_queryi is the right primitive for a handful of regions. It degrades when the region list is dense, because per region overhead dominates: each call does its own bin computation, index lookup, BGZF seek, and chunk walk; adjacent regions in the same BGZF block force that block to be decompressed twice; sorted dense lists (panels, capture targets, sites files) spend most of their wallclock on redundant index walks rather than data. The multi region driver already solves this for binary alignments by precomputing the full offset list, merging adjacent offsets, and decompressing each BGZF block once. What was missing was a tbx_* constructor compatible with that driver.

Workloads this unblocks:

  • Ancestry, population genetics, and polygenic risk score pipelines that subset whole genome VCFs to large SNP panels (typically 10M to 100M sites)
  • Targeted sequencing pipelines that subset to exome capture targets (around 200k intervals)
  • Annotation overlap against multi million interval bed tracks (regulatory elements, conservation, repeats)
  • Trio and family workflows that subset a parent VCF to candidate de novo positions from a child

Measurements

Measured against tbx_itr_queryi per region on a 176 MB chr22 1kGP SNP only VCF.gz, using the bundled test/bench_tbx_regions.c:

panel regions records stream bytes parity single (s) multi (s) speedup
1k 1,819 1,819 18.5 MB OK (`0x7bf3...`) 6.0 5.7 1.05x
5k 4,797 4,797 48.7 MB OK (`0x8809...`) 15.8 10.2 1.56x
10k 18,197 18,197 185 MB OK (`0x2d01...`) 60.6 12.8 4.72x
50k 47,975 47,975 487 MB OK (`0xd775...`) 159.4 12.8 12.45x
100k 175,909 175,910 1.79 GB OK (`0xbc6e...`) 562.9 13.3 42.48x

Multi path wallclock plateaus at sequential scan time once the region count is large (it is bounded by file scan, not by region count). Single grows roughly linearly with region count. Every row above was byte verified between paths via FNV 1a 64 hash plus byte length comparison.

Implementation notes

  • Deep copy of reglist. The caller's reglist is deep copied internally before being handed to hts_itr_regions. The caller always owns their original input regardless of outcome. This sidesteps hts_itr_regions's ownership inconsistency on failure (where the reglist may or may not have been freed depending on which branch failed).

  • Internal .reg resolution. tbx_itr_regions resolves any reglist[i].reg name strings via tbx_name2id before handing off (passing getid=NULL to hts_itr_regions). Callers may use either pre resolved tids or reference name strings, matching the ergonomics of sam_itr_regions.

  • Defensive sort. Intervals within each reglist entry are sorted before construction; hts_itr_multi_bam requires this. The exported comparator compare_hts_pair_pos_t from region.c is reused (previously static, now declared in hts_internal.h).

  • Shared readrec body. tbx_readrec (single region path) and tbx_readrec_multi (multi region path) delegate to a common static helper. Future fixes to meta_char handling or get_intv parsing apply to both paths automatically.

  • BGZF private_data slot. tbx_readrec_multi retrieves tbx_t from BGZF private_data because hts_itr_multi_next does not thread a per iterator context to its readrec callback. This mirrors how bcf_readrec uses the same slot for BCF header context. The constraint "only one multi region tabix iterator per htsFile at a time" is documented in tbx.h and matches the existing constraint on binary BCF iteration.

Testing

test/test_tbx_multi.c wired through test/test.pl covers:

  • Byte identical parity between single region and multi region paths across BED, VCF, and GFF preset fixtures, over five region shapes (point, wide, empty, point after empty, full contig spread).
  • Dedup of overlapping intervals: three overlapping intervals covering three records each at least twice yield exactly three records.
  • Six failure path inputs (NULL fp, NULL tbx, both NULL, NULL reglist, count zero, negative count) return NULL cleanly without leaking or crashing.
  • Two name resolution paths (known and unknown contig name).

test/bench_tbx_regions.c is the reproducible source of the measurement table above. It streams an FNV 1a 64 hash over both paths' output during an untimed parity pass, then runs timed reps with no hashing overhead. Constant memory regardless of panel size (verified at the 1.79 GB stream byte mark).

Full htslib test suite (361 tests including the new ones) is clean under `-fsanitize=address,undefined`.

Scope notes

  • This PR is the constructor only. A small follow up in bcftools to route bcftools view -T through the new constructor when the targets file is tabix indexed will be a separate PR.
  • The "one iterator per htsFile" constraint inherited from the BGZF private_data slot is documented but not removed. A future PR could thread per iterator context through hts_itr_multi_next and lift that constraint; out of scope here.

Test plan

  • 361 tests pass under ASan + UBSan
  • Byte identical parity across BED, VCF, GFF presets (test/test_tbx_multi.c)
  • Byte identical parity across 5 panel sizes, up to 1.79 GB (test/bench_tbx_regions.c)
  • Six failure path inputs return NULL cleanly
  • CSI works (the constructor is index format agnostic)

Assisted-by: Claude (claude-opus-4-7)

Constructs an hts_itr_t driven by hts_itr_multi_next that yields
records from many regions in one pass over a tabix indexed text file
(VCF.gz, BED.gz, GFF.gz, custom TSV preset). Adjacent regions are
coalesced into combined index lookups instead of one tbx_itr_queryi
call per region.

This brings tabix indexed text to parity with BAM and CRAM, which
have had sam_itr_regions since the multi region driver was added.
Affected workloads include ancestry and population genetics
subsetting VCFs to large SNP panels (the motivating case in
samtools/bcftools#2557 identified an 11 plus hour single chromosome
subset against an 84M site panel), exome capture target subsetting,
annotation overlap against large bed tracks, and trio analysis
against candidate de novo position lists.

Measured against tbx_itr_queryi per region on a 176 MB chr22 1kGP
SNP only VCF.gz:

  panel    single (s)   multi (s)   speedup
   1819         6.0          5.7      1.05x
   4797        15.8         10.2      1.56x
  18197        60.6         12.8      4.72x
  47975       159.4         12.8     12.45x
 175909       562.9         13.3     42.48x

Multi path wallclock plateaus at sequential scan time as the panel
grows; single grows roughly linearly with region count. Every row
above was byte verified between paths (matching fnv1a64 hash plus
matching byte length); the 100k row corresponds to 1.79 GB of byte
identical output.

The implementation deep copies the caller reglist so callers always
own their input regardless of outcome, pre resolves reglist[i].reg
name strings via tbx_name2id so callers may use either tids or
reference name strings, and sorts intervals defensively before
handing off to hts_itr_multi_bam.

tbx_readrec_multi retrieves tbx_t from BGZF private_data; this
mirrors how bcf_readrec uses the same slot for BCF header context
and inherits the same one multi region iterator per htsFile
constraint (documented in tbx.h).

test/test_tbx_multi.c verifies byte identical parity between the
single region and multi region paths across BED, VCF, and GFF
preset fixtures, plus dedup of overlapping intervals, plus six
failure path inputs, plus two name resolution paths.
test/bench_tbx_regions.c is the reproducible source of the table
above, streaming an fnv1a64 hash over both paths' output for
byte level parity at constant memory.

Addresses samtools/bcftools#2557.

Signed-off-by: Carsten Erickson <carstene@mailbox.org>
Assisted-by: Claude (claude-opus-4-7)
Cirrus reported four configurations failing on the PR head:

  rockylinux-gcc  builds with -std=gnu90 -Werror and rejected the
                  for-loop initial declarations and mid-block variable
                  declarations introduced in tbx.c, test_tbx_multi.c,
                  and bench_tbx_regions.c.

  debian-gcc, ubuntu-arm, ubuntu-clang  three configs run a
                  DO_UNTRACKED_FILE_CHECK that flagged the two new
                  test binaries (test/test_tbx_multi and
                  test/bench_tbx_regions) as untracked.

This commit:

  - Moves all variable declarations to block heads in tbx.c (in
    tbx_reglist_dup and tbx_itr_regions), in test/test_tbx_multi.c,
    and in test/bench_tbx_regions.c, so the files compile under
    -std=gnu90.

  - Adds /test/bench_tbx_regions and /test/test_tbx_multi entries
    to .gitignore, alphabetised next to the existing test binary
    entries.

Verified locally: gcc -std=gnu90 -Wall -Wformat=2 -Wextra -Werror
on all three .c files reports no errors, and the full ASan + UBSan
htslib test suite (361 tests) passes.

Signed-off-by: Carsten Erickson <carstene@mailbox.org>
Assisted-by: Claude (claude-opus-4-7)
@daviesrob daviesrob self-assigned this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants