Encode/decode astral characters as surrogate pairs on UTF-16 hosts by snmsts · Pull Request #67 · cl-babel/babel

snmsts · 2026-06-30T08:52:27Z

Encode/decode astral characters as surrogate pairs on UTF-16 hosts

Problem

On Lisp implementations whose char-code-limit is exactly #x10000 -- ABCL
(JVM char), and others backed by a UTF-16 string type -- a string holds
UTF-16 code units, so a character above the BMP (a code point > #xFFFF) is
stored as a surrogate pair (two characters). Babel's codecs work in full
code-point space, but the string<->code-point bridge (string-get /
string-set, i.e. char-code / code-char) assumes one character per code
point. As a result astral text is mangled on these hosts:

;; ABCL 1.9.3, stock babel:
(babel:string-to-octets (string (code-char #x1F600)) :encoding :utf-8)
;; => #(#xED #xA0 #xBD #xED #xB8 #x80)   ; CESU-8, should be #(#xF0 #x9F #x98 #x80)

(length (babel:octets-to-string #(#xF0 #x9F #x98 #x80) :encoding :utf-8))
;; => 1, re-encodes to #(#xEF #x98 #x80)  ; U+1F600 silently truncated to U+F600

Encode: each surrogate half is encoded separately -> CESU-8, not UTF-8.
Decode: an astral code point is passed to code-char; ABCL silently
masks it to 16 bits (U+1F600 -> U+F600), other UTF-16 hosts signal
character-out-of-range.

(Found while bringing babel up on a .NET-backed Lisp; ABCL exhibits the same
bug, which is how this is reproduced above.)

Fix

Bridge surrogates at the string<->code-point boundary, leaving every codec
(UTF-8/16/32, CJK, 8-bit, etc.) and the mapping machinery completely untouched
-- they already work in full code-point space:

string-to-octets: combine surrogate pairs into a code-point vector
(%string-to-codepoints), then run the existing encoder over it.
octets-to-string: decode into a code-point vector, then split astral code
points back into surrogate pairs (%codepoints-to-string).
babel-streams stream-write-char: buffer a high surrogate until its low
half arrives, then encode the combined code point.

Everything is gated on +utf16-host-p+ ((= char-code-limit #x10000) -- an
exact test, so a non-Unicode build with char-code-limit 256 is excluded), so
it is dead code on UTF-32 hosts (SBCL, CCL, ECL, ...) -- no behavior change
there. BMP-only strings take an unchanged fast path (no extra allocation). A
new *codepoint-vector-mappings* reuses the existing codecs via aref
accessors. Lone surrogates keep babel's current behavior (encoded as-is, i.e.
WTF-8), identical on all hosts.

The detection assumes a host with 16-bit characters is UTF-16 (true for ABCL
and the other current implementations of this kind); the char-code-limit
autodetect needs no per-implementation feature list and handles future UTF-16
hosts as well.

Tests

Adds host-independent astral round-trip tests for UTF-8/16/32 (the octets are
the source of truth, so they assert octets -> string -> octets is identity):
astral.utf-8.roundtrip, .mixed, .consecutive, .boundary (U+FFFF /
U+10000), astral.utf-16le.roundtrip, astral.utf-32le.roundtrip.

Verification

host	char-code-limit	result
ABCL 1.9.3	65536 (UTF-16)	astral round-trips correctly (was CESU-8 / silent truncation)
SBCL	1114112 (UTF-32)	patch inert; new tests pass; no regression

(Two pre-existing test failures on SBCL -- encoder/decoder-retvals and
rw-equiv.1 -- are unrelated to this change and present on master as well.)

Notes

No new conditions or codecs; no public API change.
Diff: src/strings.lisp (boundary transcode + helpers), src/streams.lisp
(stream-write-char surrogate buffering), tests/tests.lisp (astral tests).

On Lisp implementations whose CHAR-CODE-LIMIT is exactly #x10000 (ABCL, and others backed by a UTF-16 string type) a string holds UTF-16 code units, so a character above the BMP is a surrogate pair (two characters). Babel's codecs work in full code-point space, but the string<->code-point bridge assumed one character per code point, so astral text was mangled: - STRING-TO-OCTETS encoded each surrogate half separately (CESU-8): e.g. U+1F600 -> ED A0 BD ED B8 80 instead of F0 9F 98 80. - OCTETS-TO-STRING handed an astral code point to CODE-CHAR; ABCL silently truncated it to 16 bits (U+1F600 -> U+F600), other UTF-16 hosts errored. Fix the bridge, leaving every codec untouched. STRING-TO-OCTETS combines surrogate pairs into a code-point vector before encoding; OCTETS-TO-STRING decodes into a code-point vector and splits astral code points back into surrogate pairs. BMP-only strings keep an unchanged fast path. Everything is gated on +UTF16-HOST-P+ ((= CHAR-CODE-LIMIT #x10000), an exact test so a non-Unicode build with CHAR-CODE-LIMIT 256 is excluded), and is dead code on UTF-32 hosts, so SBCL/CCL/etc. are unaffected. babel-streams' STREAM-WRITE-CHAR buffers a high surrogate until its low half arrives. Add host-independent astral round-trip tests (UTF-8/16/32; octets are the source of truth). Verified on ABCL 1.9.3 (UTF-16) and SBCL (UTF-32, inert).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Encode/decode astral characters as surrogate pairs on UTF-16 hosts#67

Encode/decode astral characters as surrogate pairs on UTF-16 hosts#67
snmsts wants to merge 1 commit into
cl-babel:masterfrom
snmsts:utf-16-host-surrogate-support

snmsts commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

snmsts commented Jun 30, 2026

Encode/decode astral characters as surrogate pairs on UTF-16 hosts

Problem

Fix

Tests

Verification

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant