Use the PyUnicode API for str-returning functions instead of computing in bytes and decoding#1411
Use the PyUnicode API for str-returning functions instead of computing in bytes and decoding#1411jmarshall wants to merge 5 commits into
Conversation
…ties PyUnicode_MAX_CHAR_VALUE() is not declared by Cython's cpython/unicode.pxd so these wrappers around PyUnicode_New() encapsulate its use.
Implement it via low-level code for str, bytes, and int.
A small script was used to generate pysam_seq_comp_table:
import pysam
iupac = "=ACMGRSVTWYHKDBN"
for c in range(256):
if c % 16 == 0: print("\n ", end='')
base = chr(c).upper()
if base == 'U': base = 'T'
if (b := iupac.find(base)) > 0:
comp = iupac[pysam.reverse_complement(b)]
if chr(c).islower(): comp = comp.lower()
print(f" '{comp}', ", end='')
else:
print(f" 0x{c:02x},", end='')
Use the new function in AlignedSegment.get_forward_sequence().
This code wants the SEQ bases as bytes rather than str. Recode it to unpack each base individually instead, as get_query_sequences() does. There is only one caller of build_alignment_sequence() and that raises ValueError if it returns None. Instead raise ValueError here directly so we can provide a more precise message. Fix CIGAR length check (pysam_bam_get_cigar() never returns NULL) and check SEQ length. Remove src==NULL check as by construction this is never NULL (and other methods don't check src). Check all _delegate/src initialisations so that this statement is actually true.
|
The first candidate is s.translate(str.maketrans("ACGTacgtNnXx", "TGCAtgcaNnXx"))[::-1]by introducing a
|
Also optimise it to unpack high and low nibbles explicitly.
|
Next candidate is
This PR's code represents about a 3× improvement on the status quo. The |
Numerous fields construct strings that we know contain only ASCII, but the Python 2 provenance of the code means that we build them as
bytesand then convert to strings. By using the PyUnicode_* API directly we can write the finalstrdirectly and avoid extra copies and decoding.