Skip to content

Use the PyUnicode API for str-returning functions instead of computing in bytes and decoding#1411

Open
jmarshall wants to merge 5 commits into
pysam-developers:masterfrom
jmarshall:unicode-api
Open

Use the PyUnicode API for str-returning functions instead of computing in bytes and decoding#1411
jmarshall wants to merge 5 commits into
pysam-developers:masterfrom
jmarshall:unicode-api

Conversation

@jmarshall
Copy link
Copy Markdown
Member

Numerous fields construct strings that we know contain only ASCII, but the Python 2 provenance of the code means that we build them as bytes and then convert to strings. By using the PyUnicode_* API directly we can write the final str directly and avoid extra copies and decoding.

jmarshall added 4 commits May 10, 2026 16:09
…ties

PyUnicode_MAX_CHAR_VALUE() is not declared by Cython's cpython/unicode.pxd
so these wrappers around PyUnicode_New() encapsulate its use.
Implement it via low-level code for str, bytes, and int.
A small script was used to generate pysam_seq_comp_table:

    import pysam
    iupac = "=ACMGRSVTWYHKDBN"
    for c in range(256):
	if c % 16 == 0: print("\n   ", end='')
	base = chr(c).upper()
	if base == 'U': base = 'T'
	if (b := iupac.find(base)) > 0:
	    comp = iupac[pysam.reverse_complement(b)]
	    if chr(c).islower(): comp = comp.lower()
	    print(f" '{comp}', ", end='')
	else:
	    print(f" 0x{c:02x},", end='')

Use the new function in AlignedSegment.get_forward_sequence().
This code wants the SEQ bases as bytes rather than str. Recode it to
unpack each base individually instead, as get_query_sequences() does.

There is only one caller of build_alignment_sequence() and that raises
ValueError if it returns None. Instead raise ValueError here directly
so we can provide a more precise message. Fix CIGAR length check
(pysam_bam_get_cigar() never returns NULL) and check SEQ length.
Remove src==NULL check as by construction this is never NULL (and
other methods don't check src). Check all _delegate/src initialisations
so that this statement is actually true.
@jmarshall
Copy link
Copy Markdown
Member Author

The first candidate is AlignedSegment.get_forward_sequence(), for which we reimplement the existing

s.translate(str.maketrans("ACGTacgtNnXx", "TGCAtgcaNnXx"))[::-1]

by introducing a pysam.reverse_complement() function which fixes a bug by also complementing IUPAC ambiguity characters. Benchmarking various versions of a revcomp function indicates that this is about 5× the status quo:

-------------------------------------------------------------------------------------------
LEN=300 (time in ns)    Min               Mean              StdDev             OPS (Mops/s)
-------------------------------------------------------------------------------------------
rc_kind1            137.0796 (1.0)    139.4722 (1.0)     10.4052 (1.16)       7.1699 (1.0)    
reverse_complement  138.3299 (1.01)   140.5439 (1.01)    10.0244 (1.12)       7.1152 (0.99)   
rc_kindN            139.1600 (1.02)   141.3200 (1.01)     8.9315 (1.0)        7.0761 (0.99)   
rc_decode           332.9478 (2.43)   397.9166 (2.85)   278.2396 (31.15)      2.5131 (0.35)   
rc_xlate_base       495.8012 (3.62)   513.0620 (3.68)    51.7693 (5.80)       1.9491 (0.27)   
rc_xlate_iupac      499.9980 (3.65)   525.4157 (3.77)    47.0180 (5.26)       1.9033 (0.27)   
rc_xlate_mktrans    707.9798 (5.16)   828.5009 (5.94)    87.5041 (9.80)       1.2070 (0.17)   

-------------------------------------------------------------------------------------------
LEN=50000 (time in us)  Min               Mean              StdDev             OPS (Kops/s)
-------------------------------------------------------------------------------------------
rc_kind1             14.2910 (1.0)     14.4766 (1.00)     0.9447 (1.04)      69.0771 (1.00)
rc_kindN             14.2910 (1.0)     14.4912 (1.00)     0.9086 (1.00)      69.0072 (1.00)
reverse_complement   14.2910 (1.0)     14.4726 (1.0)      0.9053 (1.0)       69.0960 (1.0)
rc_decode            29.1250 (2.04)    29.4319 (2.03)     1.2421 (1.37)      33.9767 (0.49)
rc_xlate_base        64.3750 (4.50)    64.9110 (4.49)     2.2945 (2.53)      15.4057 (0.22)
rc_xlate_iupac       64.3750 (4.50)    65.0765 (4.50)     2.4087 (2.66)      15.3665 (0.22)
rc_xlate_mktrans     64.7080 (4.53)    65.2052 (4.51)     2.2621 (2.50)      15.3362 (0.22)
-------------------------------------------------------------------------------------------
  • rc_xlate_mktrans() is the existing code with str.mktrans inline and done repeatedly
  • rc_xlate_base() pre-computes the ATGCN mktrans once, for a noticeable improvement at LEN=300 but no real difference at LEN=50000
  • rc_xlate_iupac() is similar, but precomputes a table that complements all the IUPAC codes
  • rc_decode() encodes the input str as bytes, applies the C-based pysam_seq_comp_table reverse-complementing on bytes, and decodes the result back to str; for a noticeable improvement over the str.translate() functions
  • reverse_complement() applies the C-based translation directly to a newly-constructed str; for about a doubling of the rc_decode performance
  • rc_kind1() and rc_kindN() go directly to the 1BYTE_KIND and generic versions of reverse_complement(); sadly this is not a noticeable improvement, so there may be no need to write 1BYTE versions of these functions.

Also optimise it to unpack high and low nibbles explicitly.
@jmarshall
Copy link
Copy Markdown
Member Author

Next candidate is getSequenceInRange(), which powers AlignedSegment.query_sequence. Again we've benchmarked various versions:

----------------------------------- benchmark: 5 tests ----------------------------------------
300 (time in ns)      Min                Mean             StdDev            OPS (Mops/s)          
-----------------------------------------------------------------------------------------------
qseq_bytable      95.4198 (1.0)       97.2691 (1.0)       3.4470 (1.0)           10.2808 (1.0)    
qseq_by2         122.0795 (1.28)     124.1991 (1.28)     14.0612 (4.08)           8.0516 (0.78)   
qseq_by1         247.0001 (2.59)     257.5154 (2.65)     18.8781 (5.48)           3.8833 (0.38)   
qseq_nils        305.4981 (3.20)     319.1197 (3.28)     25.9907 (7.54)           3.1336 (0.30)   
qseq_orig        332.9478 (3.49)     375.7928 (3.86)     29.9684 (8.69)           2.6610 (0.26)   

----------------------------------- benchmark: 5 tests ----------------------------------------
50000 (time in us)   Min               Mean            StdDev            OPS (Kops/s)          
-----------------------------------------------------------------------------------------------
qseq_bytable      7.1660 (1.0)       7.3359 (1.0)      0.3868 (1.0)          136.3158 (1.0)    
qseq_by2         12.9170 (1.80)     13.1533 (1.79)     0.5736 (1.48)          76.0267 (0.56)   
qseq_by1         34.1670 (4.77)     34.5578 (4.71)     0.8682 (2.24)          28.9370 (0.21)   
qseq_nils        39.4580 (5.51)     40.1932 (5.48)     2.6939 (6.96)          24.8798 (0.18)   
qseq_orig        40.2500 (5.62)     40.7149 (5.55)     2.0642 (5.34)          24.5610 (0.18)   
-----------------------------------------------------------------------------------------------
  • qseq_orig() is the existing code
  • qseq_nils() removes the redundant charptr_to_str() call from the existing code (cf PR Remove redundant copy in getSequenceInRange #1385), leading to a noticeable improvement for short reads
  • qseq_by1() applies the existing code directly to a str rather than a bytes
  • qseq_by2() is the code in this PR
  • qseq_bytable() translates htslib/sam_internal.h's code2base approach to Cython.

This PR's code represents about a 3× improvement on the status quo. The code2base approach is another 30% to 80% improvement but requires another table to be added. Instead of doing that, when samtools/htslib#1974 lands we will be able to take advantage of SIMD-optimised unpacking, which should be a much larger improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant