Use the PyUnicode API for str-returning functions instead of computing in bytes and decoding by jmarshall · Pull Request #1411 · pysam-developers/pysam

jmarshall · 2026-05-25T10:13:48Z

Numerous fields construct strings that we know contain only ASCII, but the Python 2 provenance of the code means that we build them as bytes and then convert to strings. By using the PyUnicode_* API directly we can write the final str directly and avoid extra copies and decoding.

…ties PyUnicode_MAX_CHAR_VALUE() is not declared by Cython's cpython/unicode.pxd so these wrappers around PyUnicode_New() encapsulate its use.

Implement it via low-level code for str, bytes, and int. A small script was used to generate pysam_seq_comp_table: import pysam iupac = "=ACMGRSVTWYHKDBN" for c in range(256): if c % 16 == 0: print("\n ", end='') base = chr(c).upper() if base == 'U': base = 'T' if (b := iupac.find(base)) > 0: comp = iupac[pysam.reverse_complement(b)] if chr(c).islower(): comp = comp.lower() print(f" '{comp}', ", end='') else: print(f" 0x{c:02x},", end='') Use the new function in AlignedSegment.get_forward_sequence().

This code wants the SEQ bases as bytes rather than str. Recode it to unpack each base individually instead, as get_query_sequences() does. There is only one caller of build_alignment_sequence() and that raises ValueError if it returns None. Instead raise ValueError here directly so we can provide a more precise message. Fix CIGAR length check (pysam_bam_get_cigar() never returns NULL) and check SEQ length. Remove src==NULL check as by construction this is never NULL (and other methods don't check src). Check all _delegate/src initialisations so that this statement is actually true.

jmarshall · 2026-05-25T10:36:05Z

The first candidate is AlignedSegment.get_forward_sequence(), for which we reimplement the existing

s.translate(str.maketrans("ACGTacgtNnXx", "TGCAtgcaNnXx"))[::-1]

by introducing a pysam.reverse_complement() function which fixes a bug by also complementing IUPAC ambiguity characters. Benchmarking various versions of a revcomp function indicates that this is about 5× the status quo:

-------------------------------------------------------------------------------------------
LEN=300 (time in ns)    Min               Mean              StdDev             OPS (Mops/s)
-------------------------------------------------------------------------------------------
rc_kind1            137.0796 (1.0)    139.4722 (1.0)     10.4052 (1.16)       7.1699 (1.0)    
reverse_complement  138.3299 (1.01)   140.5439 (1.01)    10.0244 (1.12)       7.1152 (0.99)   
rc_kindN            139.1600 (1.02)   141.3200 (1.01)     8.9315 (1.0)        7.0761 (0.99)   
rc_decode           332.9478 (2.43)   397.9166 (2.85)   278.2396 (31.15)      2.5131 (0.35)   
rc_xlate_base       495.8012 (3.62)   513.0620 (3.68)    51.7693 (5.80)       1.9491 (0.27)   
rc_xlate_iupac      499.9980 (3.65)   525.4157 (3.77)    47.0180 (5.26)       1.9033 (0.27)   
rc_xlate_mktrans    707.9798 (5.16)   828.5009 (5.94)    87.5041 (9.80)       1.2070 (0.17)   

-------------------------------------------------------------------------------------------
LEN=50000 (time in us)  Min               Mean              StdDev             OPS (Kops/s)
-------------------------------------------------------------------------------------------
rc_kind1             14.2910 (1.0)     14.4766 (1.00)     0.9447 (1.04)      69.0771 (1.00)
rc_kindN             14.2910 (1.0)     14.4912 (1.00)     0.9086 (1.00)      69.0072 (1.00)
reverse_complement   14.2910 (1.0)     14.4726 (1.0)      0.9053 (1.0)       69.0960 (1.0)
rc_decode            29.1250 (2.04)    29.4319 (2.03)     1.2421 (1.37)      33.9767 (0.49)
rc_xlate_base        64.3750 (4.50)    64.9110 (4.49)     2.2945 (2.53)      15.4057 (0.22)
rc_xlate_iupac       64.3750 (4.50)    65.0765 (4.50)     2.4087 (2.66)      15.3665 (0.22)
rc_xlate_mktrans     64.7080 (4.53)    65.2052 (4.51)     2.2621 (2.50)      15.3362 (0.22)
-------------------------------------------------------------------------------------------

rc_xlate_mktrans() is the existing code with str.mktrans inline and done repeatedly
rc_xlate_base() pre-computes the ATGCN mktrans once, for a noticeable improvement at LEN=300 but no real difference at LEN=50000
rc_xlate_iupac() is similar, but precomputes a table that complements all the IUPAC codes
rc_decode() encodes the input str as bytes, applies the C-based pysam_seq_comp_table reverse-complementing on bytes, and decodes the result back to str; for a noticeable improvement over the str.translate() functions
reverse_complement() applies the C-based translation directly to a newly-constructed str; for about a doubling of the rc_decode performance
rc_kind1() and rc_kindN() go directly to the 1BYTE_KIND and generic versions of reverse_complement(); sadly this is not a noticeable improvement, so there may be no need to write 1BYTE versions of these functions.

Also optimise it to unpack high and low nibbles explicitly.

jmarshall · 2026-05-28T11:59:09Z

Next candidate is getSequenceInRange(), which powers AlignedSegment.query_sequence. Again we've benchmarked various versions:

----------------------------------- benchmark: 5 tests ----------------------------------------
300 (time in ns)      Min                Mean             StdDev            OPS (Mops/s)          
-----------------------------------------------------------------------------------------------
qseq_bytable      95.4198 (1.0)       97.2691 (1.0)       3.4470 (1.0)           10.2808 (1.0)    
qseq_by2         122.0795 (1.28)     124.1991 (1.28)     14.0612 (4.08)           8.0516 (0.78)   
qseq_by1         247.0001 (2.59)     257.5154 (2.65)     18.8781 (5.48)           3.8833 (0.38)   
qseq_nils        305.4981 (3.20)     319.1197 (3.28)     25.9907 (7.54)           3.1336 (0.30)   
qseq_orig        332.9478 (3.49)     375.7928 (3.86)     29.9684 (8.69)           2.6610 (0.26)   

----------------------------------- benchmark: 5 tests ----------------------------------------
50000 (time in us)   Min               Mean            StdDev            OPS (Kops/s)          
-----------------------------------------------------------------------------------------------
qseq_bytable      7.1660 (1.0)       7.3359 (1.0)      0.3868 (1.0)          136.3158 (1.0)    
qseq_by2         12.9170 (1.80)     13.1533 (1.79)     0.5736 (1.48)          76.0267 (0.56)   
qseq_by1         34.1670 (4.77)     34.5578 (4.71)     0.8682 (2.24)          28.9370 (0.21)   
qseq_nils        39.4580 (5.51)     40.1932 (5.48)     2.6939 (6.96)          24.8798 (0.18)   
qseq_orig        40.2500 (5.62)     40.7149 (5.55)     2.0642 (5.34)          24.5610 (0.18)   
-----------------------------------------------------------------------------------------------

qseq_orig() is the existing code
qseq_nils() removes the redundant charptr_to_str() call from the existing code (cf PR Remove redundant copy in getSequenceInRange #1385), leading to a noticeable improvement for short reads
qseq_by1() applies the existing code directly to a str rather than a bytes
qseq_by2() is the code in this PR
qseq_bytable() translates htslib/sam_internal.h's code2base approach to Cython.

This PR's code represents about a 3× improvement on the status quo. The code2base approach is another 30% to 80% improvement but requires another table to be added. Instead of doing that, when samtools/htslib#1974 lands we will be able to take advantage of SIMD-optimised unpacking, which should be a much larger improvement.

jmarshall added 4 commits May 10, 2026 16:09

Add PysamUnicode_NewClone() and PysamUnicode_NewCloneWithSize() utili…

11dea5e

…ties PyUnicode_MAX_CHAR_VALUE() is not declared by Cython's cpython/unicode.pxd so these wrappers around PyUnicode_New() encapsulate its use.

Add pysam.reverse_complement_inplace() utility function

d731535

jmarshall mentioned this pull request May 25, 2026

Cache complement translation table in get_forward_sequence #1386

Closed

1 task

Recode getSequenceInRange() to return str directly

07801ea

Also optimise it to unpack high and low nibbles explicitly.

jmarshall mentioned this pull request May 28, 2026

Remove redundant copy in getSequenceInRange #1385

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the PyUnicode API for str-returning functions instead of computing in bytes and decoding#1411

Use the PyUnicode API for str-returning functions instead of computing in bytes and decoding#1411
jmarshall wants to merge 5 commits into
pysam-developers:masterfrom
jmarshall:unicode-api

jmarshall commented May 25, 2026

Uh oh!

jmarshall commented May 25, 2026

Uh oh!

jmarshall commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jmarshall commented May 25, 2026

Uh oh!

jmarshall commented May 25, 2026

Uh oh!

jmarshall commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant