Encode/decode astral characters as surrogate pairs on UTF-16 hosts#67
Open
snmsts wants to merge 1 commit into
Open
Encode/decode astral characters as surrogate pairs on UTF-16 hosts#67snmsts wants to merge 1 commit into
snmsts wants to merge 1 commit into
Conversation
On Lisp implementations whose CHAR-CODE-LIMIT is exactly #x10000 (ABCL, and
others backed by a UTF-16 string type) a string holds UTF-16 code units, so a
character above the BMP is a surrogate pair (two characters). Babel's codecs
work in full code-point space, but the string<->code-point bridge assumed one
character per code point, so astral text was mangled:
- STRING-TO-OCTETS encoded each surrogate half separately (CESU-8): e.g.
U+1F600 -> ED A0 BD ED B8 80 instead of F0 9F 98 80.
- OCTETS-TO-STRING handed an astral code point to CODE-CHAR; ABCL silently
truncated it to 16 bits (U+1F600 -> U+F600), other UTF-16 hosts errored.
Fix the bridge, leaving every codec untouched. STRING-TO-OCTETS combines
surrogate pairs into a code-point vector before encoding; OCTETS-TO-STRING
decodes into a code-point vector and splits astral code points back into
surrogate pairs. BMP-only strings keep an unchanged fast path. Everything is
gated on +UTF16-HOST-P+ ((= CHAR-CODE-LIMIT #x10000), an exact test so a
non-Unicode build with CHAR-CODE-LIMIT 256 is excluded), and is dead code on
UTF-32 hosts, so SBCL/CCL/etc. are unaffected. babel-streams'
STREAM-WRITE-CHAR buffers a high surrogate until its low half arrives.
Add host-independent astral round-trip tests (UTF-8/16/32; octets are the
source of truth). Verified on ABCL 1.9.3 (UTF-16) and SBCL (UTF-32, inert).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Encode/decode astral characters as surrogate pairs on UTF-16 hosts
Problem
On Lisp implementations whose
char-code-limitis exactly#x10000-- ABCL(JVM
char), and others backed by a UTF-16 string type -- a string holdsUTF-16 code units, so a character above the BMP (a code point >
#xFFFF) isstored as a surrogate pair (two characters). Babel's codecs work in full
code-point space, but the string<->code-point bridge (
string-get/string-set, i.e.char-code/code-char) assumes one character per codepoint. As a result astral text is mangled on these hosts:
code-char; ABCL silentlymasks it to 16 bits (
U+1F600->U+F600), other UTF-16 hosts signalcharacter-out-of-range.(Found while bringing babel up on a .NET-backed Lisp; ABCL exhibits the same
bug, which is how this is reproduced above.)
Fix
Bridge surrogates at the string<->code-point boundary, leaving every codec
(UTF-8/16/32, CJK, 8-bit, etc.) and the mapping machinery completely untouched
-- they already work in full code-point space:
string-to-octets: combine surrogate pairs into a code-point vector(
%string-to-codepoints), then run the existing encoder over it.octets-to-string: decode into a code-point vector, then split astral codepoints back into surrogate pairs (
%codepoints-to-string).babel-streamsstream-write-char: buffer a high surrogate until its lowhalf arrives, then encode the combined code point.
Everything is gated on
+utf16-host-p+((= char-code-limit #x10000)-- anexact test, so a non-Unicode build with
char-code-limit256 is excluded), soit is dead code on UTF-32 hosts (SBCL, CCL, ECL, ...) -- no behavior change
there. BMP-only strings take an unchanged fast path (no extra allocation). A
new
*codepoint-vector-mappings*reuses the existing codecs viaarefaccessors. Lone surrogates keep babel's current behavior (encoded as-is, i.e.
WTF-8), identical on all hosts.
The detection assumes a host with 16-bit characters is UTF-16 (true for ABCL
and the other current implementations of this kind); the
char-code-limitautodetect needs no per-implementation feature list and handles future UTF-16
hosts as well.
Tests
Adds host-independent astral round-trip tests for UTF-8/16/32 (the octets are
the source of truth, so they assert
octets -> string -> octetsis identity):astral.utf-8.roundtrip,.mixed,.consecutive,.boundary(U+FFFF /U+10000),
astral.utf-16le.roundtrip,astral.utf-32le.roundtrip.Verification
(Two pre-existing test failures on SBCL --
encoder/decoder-retvalsandrw-equiv.1-- are unrelated to this change and present onmasteras well.)Notes
src/strings.lisp(boundary transcode + helpers),src/streams.lisp(
stream-write-charsurrogate buffering),tests/tests.lisp(astral tests).