Skip to content

Encode/decode astral characters as surrogate pairs on UTF-16 hosts#67

Open
snmsts wants to merge 1 commit into
cl-babel:masterfrom
snmsts:utf-16-host-surrogate-support
Open

Encode/decode astral characters as surrogate pairs on UTF-16 hosts#67
snmsts wants to merge 1 commit into
cl-babel:masterfrom
snmsts:utf-16-host-surrogate-support

Conversation

@snmsts

@snmsts snmsts commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Encode/decode astral characters as surrogate pairs on UTF-16 hosts

Problem

On Lisp implementations whose char-code-limit is exactly #x10000 -- ABCL
(JVM char), and others backed by a UTF-16 string type -- a string holds
UTF-16 code units, so a character above the BMP (a code point > #xFFFF) is
stored as a surrogate pair (two characters). Babel's codecs work in full
code-point space, but the string<->code-point bridge (string-get /
string-set, i.e. char-code / code-char) assumes one character per code
point. As a result astral text is mangled on these hosts:

;; ABCL 1.9.3, stock babel:
(babel:string-to-octets (string (code-char #x1F600)) :encoding :utf-8)
;; => #(#xED #xA0 #xBD #xED #xB8 #x80)   ; CESU-8, should be #(#xF0 #x9F #x98 #x80)

(length (babel:octets-to-string #(#xF0 #x9F #x98 #x80) :encoding :utf-8))
;; => 1, re-encodes to #(#xEF #x98 #x80)  ; U+1F600 silently truncated to U+F600
  • Encode: each surrogate half is encoded separately -> CESU-8, not UTF-8.
  • Decode: an astral code point is passed to code-char; ABCL silently
    masks it to 16 bits (U+1F600 -> U+F600), other UTF-16 hosts signal
    character-out-of-range.

(Found while bringing babel up on a .NET-backed Lisp; ABCL exhibits the same
bug, which is how this is reproduced above.)

Fix

Bridge surrogates at the string<->code-point boundary, leaving every codec
(UTF-8/16/32, CJK, 8-bit, etc.) and the mapping machinery completely untouched
-- they already work in full code-point space:

  • string-to-octets: combine surrogate pairs into a code-point vector
    (%string-to-codepoints), then run the existing encoder over it.
  • octets-to-string: decode into a code-point vector, then split astral code
    points back into surrogate pairs (%codepoints-to-string).
  • babel-streams stream-write-char: buffer a high surrogate until its low
    half arrives, then encode the combined code point.

Everything is gated on +utf16-host-p+ ((= char-code-limit #x10000) -- an
exact test, so a non-Unicode build with char-code-limit 256 is excluded), so
it is dead code on UTF-32 hosts (SBCL, CCL, ECL, ...) -- no behavior change
there. BMP-only strings take an unchanged fast path (no extra allocation). A
new *codepoint-vector-mappings* reuses the existing codecs via aref
accessors. Lone surrogates keep babel's current behavior (encoded as-is, i.e.
WTF-8), identical on all hosts.

The detection assumes a host with 16-bit characters is UTF-16 (true for ABCL
and the other current implementations of this kind); the char-code-limit
autodetect needs no per-implementation feature list and handles future UTF-16
hosts as well.

Tests

Adds host-independent astral round-trip tests for UTF-8/16/32 (the octets are
the source of truth, so they assert octets -> string -> octets is identity):
astral.utf-8.roundtrip, .mixed, .consecutive, .boundary (U+FFFF /
U+10000), astral.utf-16le.roundtrip, astral.utf-32le.roundtrip.

Verification

host char-code-limit result
ABCL 1.9.3 65536 (UTF-16) astral round-trips correctly (was CESU-8 / silent truncation)
SBCL 1114112 (UTF-32) patch inert; new tests pass; no regression

(Two pre-existing test failures on SBCL -- encoder/decoder-retvals and
rw-equiv.1 -- are unrelated to this change and present on master as well.)

Notes

  • No new conditions or codecs; no public API change.
  • Diff: src/strings.lisp (boundary transcode + helpers), src/streams.lisp
    (stream-write-char surrogate buffering), tests/tests.lisp (astral tests).

On Lisp implementations whose CHAR-CODE-LIMIT is exactly #x10000 (ABCL, and
others backed by a UTF-16 string type) a string holds UTF-16 code units, so a
character above the BMP is a surrogate pair (two characters).  Babel's codecs
work in full code-point space, but the string<->code-point bridge assumed one
character per code point, so astral text was mangled:

  - STRING-TO-OCTETS encoded each surrogate half separately (CESU-8): e.g.
    U+1F600 -> ED A0 BD ED B8 80 instead of F0 9F 98 80.
  - OCTETS-TO-STRING handed an astral code point to CODE-CHAR; ABCL silently
    truncated it to 16 bits (U+1F600 -> U+F600), other UTF-16 hosts errored.

Fix the bridge, leaving every codec untouched.  STRING-TO-OCTETS combines
surrogate pairs into a code-point vector before encoding; OCTETS-TO-STRING
decodes into a code-point vector and splits astral code points back into
surrogate pairs.  BMP-only strings keep an unchanged fast path.  Everything is
gated on +UTF16-HOST-P+ ((= CHAR-CODE-LIMIT #x10000), an exact test so a
non-Unicode build with CHAR-CODE-LIMIT 256 is excluded), and is dead code on
UTF-32 hosts, so SBCL/CCL/etc. are unaffected.  babel-streams'
STREAM-WRITE-CHAR buffers a high surrogate until its low half arrives.

Add host-independent astral round-trip tests (UTF-8/16/32; octets are the
source of truth).  Verified on ABCL 1.9.3 (UTF-16) and SBCL (UTF-32, inert).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant