Optimize TextEncoder.encode: restore SIMD ASCII fast paths lost in the Rust port#31385
Conversation
…e Rust port The Rust port of TextEncoder.encode was slower than the Zig implementation it replaced: every call zero-filled a 2 KB stack buffer the Zig version left undefined, encode8/encode16 copied the output twice (stack buffer then JSC memory), and the Latin-1 ASCII copy loop lost its fused vector/SWAR implementation (it re-scanned with simdutf and then memcpy'd each span, which is especially costly for the per-leaf callbacks of the rope fast path). - Add a CopyAsciiPrefix highway kernel (fused scan+copy of the leading ASCII run, dynamic dispatch like the rest of highway_strings.cpp) and expose it as bun_highway::copy_ascii_prefix; allowlist its per-target variants for the baseline ISA verifier. - Rewrite copy_latin1_into_utf8_stop_on_non_ascii around a fused copy helper: SWAR u64 for short runs (rope leaves), the highway kernel for long runs, writing only buf[..written]. - element_length_latin1_into_utf8 now uses simdutf (as the Zig original did) with a scalar count for short inputs. - copy_utf16_into_utf8 skips the exact-length pass when the destination can hold the 3-bytes-per-unit worst case, and a _with_utf8_len variant lets callers that already sized the buffer reuse that length. - TextEncoder__encode8/encode16 now encode straight into an exactly-sized, JSC-owned Uint8Array (no stack buffer, no double copy); encode16 keeps a small 192-byte stack path for tiny strings and falls back to the allocating U+FFFD path for unpaired surrogates. TextEncoder__encodeRopeString allocates the array up front and lets the rope iterator write into it directly. - Extend text-encoder.test.js with boundary coverage for the new paths (SWAR/ SIMD threshold lengths, non-ASCII at word boundaries, encodeInto partial fits, large rope segments, long UTF-16 with unpaired surrogates).
|
Updated 6:51 PM PT - May 25th, 2026
❌ @robobun, your commit 83d3e32 has 1 failures in
🧪 To try this PR locally: bunx bun-pr 31385That installs a local version of the PR into your bun-31385 --bun |
|
Found 1 issue this PR may fix:
🤖 Generated with Claude Code |
|
Warning Review limit reached
More reviews will be available in 38 minutes and 32 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (5)
WalkthroughThis PR adds a SIMD-accelerated ChangesASCII Prefix Copy SIMD Optimization
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
|
Not adding |
|
CI status for reviewers: 74 of the 76 checks are green, including every build, test, and verify-baseline lane that actually exercises this diff (linux x64/musl/baseline, linux+macOS aarch64, windows x64, ASAN, etc.). The two remaining non-green entries are infrastructure, not code:
Retrying just the expired job in Buildkite should turn the build fully green — I'm deliberately not pushing an empty commit for it, to avoid restarting the entire pipeline and throwing away the existing green results. Update: the expired |
Jarred-Sumner
left a comment
There was a problem hiding this comment.
Delete all the code comments you added.
|
Delete all the code comments you added. |
|
Done — removed the added code comments in 83d3e32 (only the |
There was a problem hiding this comment.
No bugs found, but this rewrites hot-path encoding logic (new Highway SIMD kernel with overlapping tail stores, SWAR fallback, direct writes into uninitialized JSC-owned buffers sized by precomputed lengths, and a new surrogate fallback path in encode16) — worth a human pass on the buffer-sizing invariants and the rope/encode16 allocation changes.
Extended reasoning...
Overview
This PR restores SIMD fast paths in TextEncoder.encode that were lost in the Zig→Rust port. It touches eight files: a new ~50-line Highway SIMD kernel CopyAsciiPrefixImpl in highway_strings.cpp with overlapping tail-vector handling; a Rust FFI wrapper in src/highway/lib.rs; a rewrite of copy_latin1_into_utf8_stop_on_non_ascii and element_length_latin1_into_utf8 plus a new copy_utf16_into_utf8_with_utf8_len variant in src/bun_core/lib.rs; a substantial rewrite of TextEncoder__encode8/16/encodeRopeString in TextEncoder.rs to allocate exact-sized uninitialized Uint8Arrays and write directly into them; mechanical allowlist additions for the new per-target symbols; and ~170 lines of new boundary tests.
Security risks
The main risk surface is memory safety rather than classic injection/auth: the new code creates uninitialized JSC Uint8Array buffers sized by element_length_latin1_into_utf8 / element_length_utf16_into_utf8 and then relies on the encoder filling exactly that many bytes. If the length computation and the writer ever disagree, uninitialized heap bytes would be exposed to JavaScript. The Highway kernel does an overlapping StoreU at dst + (len - N) for the tail, and HWY_RESTRICT is applied to src/dst — both look correct (src and dst don't alias each other; the overlap is dst-with-dst, which restrict doesn't forbid), but this is exactly the kind of low-level invariant that benefits from a second pair of eyes. No auth, crypto, or permission code is touched.
Level of scrutiny
High. TextEncoder.encode is a Web-standard hot path called from arbitrary user JS; the change introduces hand-written SIMD with manual tail handling, SWAR bit tricks (trailing_zeros()/8 on a high-bit mask), unsafe FFI, and a new control-flow branch in encode16_impl where an under-sized exact buffer triggers a second allocation via to_utf8_alloc_with_type (the first Uint8Array is abandoned to GC). encodeRopeString now always heap-allocates up front and discards the array if a non-ASCII segment is hit. These are all plausible and well-tested, but they're design/behavior changes in a critical path, not mechanical edits.
Other factors
CI is fully green across all platforms including ASAN. The PR adds thorough boundary tests (every length around the SWAR/SIMD thresholds, non-ASCII at every vector boundary, encodeInto past-written checks, rope segments, unpaired-surrogate fallback) cross-checked against a pure-JS reference encoder. The maintainer has already engaged (requested comment removal, which was done in 83d3e32). The bug-hunting system found nothing. Given the scope and the fact that a human reviewer is already in the loop, deferring rather than auto-approving is the right call.
|
|
||
| const HIGH_BITS: u64 = 0x8080_8080_8080_8080; | ||
| let mut copied = 0usize; | ||
| for (d, s) in dst.chunks_exact_mut(8).zip(src.chunks_exact(8)) { |
* oven/main (10 new commits): Optimize TextEncoder.encode: restore SIMD ASCII fast paths lost in the Rust port (oven-sh#31385) js_parser: sanitize auto-generated default export name for digit-named modules (oven-sh#31403) fetch: run checkServerIdentity before writing the request (oven-sh#31325) ffi: avoid copying the threadsafe callback wrapper on the calling thread (oven-sh#31332) install: gate the exit-callback cache teardown to the main thread (oven-sh#31376) fix(node:module): don't register native helpers as their own constructors (oven-sh#31393) css: escape custom pseudo-class/element names when printing (oven-sh#31404) Deepen the lots-of-for-loop fixture so the transpiler stack-overflow tests throw on Windows (oven-sh#31382) Hardening: input validation and bounds tightening across 36 subsystems (round 4) (oven-sh#31339) Speed up FormData multipart serialization (oven-sh#31379) Auto-merged: src/install/PackageManager.rs, src/runtime/cli/upgrade_command.rs, src/runtime/webcore/Blob.rs, src/sys/lib.rs
### What does this PR do? Restores the block-comment SIMD skip that the Rust port of the JS lexer lost, by moving it to the same Google Highway kernels the lexer already uses for single-line comments. **Before this PR** - `scan_single_line_comment` kept its SIMD path in the port (`bun_highway::index_of_newline_or_non_ascii_or_hash_or_at`). - `scan_multi_line_comment_body` kept the Zig structure (only take the fast path while the current code point is ASCII and ≥ 512 bytes remain), but the inner scan — Zig's `skipToInterestingCharacterInMultilineComment`, a 16-byte `@Vector` loop — was ported as a scalar byte-at-a-time loop with a `TODO(port): SIMD reimplementation` note (`src/js_parser/lexer.rs`). Large license headers / JSDoc blocks were scanned one byte per iteration. **This PR** - Adds an `IndexOfInterestingCharacterInMultilineCommentImpl` kernel to `src/jsc/bindings/highway_strings.cpp` (same `HWY_EXPORT` + `HWY_DYNAMIC_DISPATCH` pattern as the existing kernels). It returns the index of the first `*` (potential `*/` terminator), `\r`, `\n` (newline tracking for ASI), or non-ASCII byte (so U+2028/U+2029 and other multi-byte sequences are still decoded by the scalar path) — exactly the byte classes the Zig `@Vector` version stopped at. - Exposes it as `bun_highway::index_of_interesting_character_in_multiline_comment` (with the same debug-only result validation as the neighboring wrappers). - `skip_to_interesting_character_in_multiline_comment` in the lexer now calls it instead of the scalar loop. The `Environment::ENABLE_SIMD` gate at the call site is dropped: it exists for portable-vector codegen, but Highway dispatches per-CPU at runtime, so baseline builds take the fast path too — matching how `scan_single_line_comment` and the string-literal scan already call Highway unconditionally. The ≥ 512-byte threshold and the ASCII-code-point check are unchanged. - Adds the new per-target kernel symbols to the `verify-baseline-static` allowlists (x64, x64-windows, aarch64), with feature ceilings copied from the structurally identical `IndexOfNewlineOrNonASCIIImpl` entries. If the baseline CI scan reports different feature sets, I'll update the entries to match its report. ### Benchmarks `Bun.Transpiler.transformSync` on linux x64 (AVX2), comparing release builds of main (`19dd34df3`) and this branch (`84fe8aa26`) built back-to-back from this tree. Interleaved runs, median of 9 inner rounds, two outer rounds per binary (values were stable to ~1%); transpiled output hashes are identical between the two binaries for every case. Shared container, so ratios are the signal. | case | main | this PR | Δ | | --- | --- | --- | --- | | 200 × ~2 KB JSDoc block comments + 200 small fns (385 KB) | 1.071 ms | 0.834 ms | **1.28× faster** | | 1 MB block comment, ~100-char lines | 1.331 ms | 0.822 ms | **1.62× faster** | | 1 MB block comment, ~1 KB lines | 1.300 ms | 0.681 ms | **1.91× faster** | | 16 KB license header + 200 fns (31 KB total) | 0.476 ms | 0.464 ms | ~2% (run time dominated by the code, not the comment) | | control: same 200 fns, no comments | 0.443 ms | 0.438 ms | parity | | control: 200 fns each behind a `//` line comment | 0.439 ms | 0.439 ms | parity | The two controls confirm the delta is attributable to the block-comment path (nothing else in the lexer changed between the two revisions). ### How did you verify your code works? All of the following ran against the debug (ASAN) build with this change: - `bun bd test test/bundler/transpiler/transpiler.test.js` — 165 pass / 0 fail, including a new `multi-line comment scanning` group: comment-size sweep across the 512-byte threshold and vector-width boundaries, a lone `*` at every offset in the first 80 bytes of a large comment, all-`*` bodies, `\n` / `\r` / `\r\n` / U+2028 / U+2029 inside large comments observable through ASI, non-ASCII bodies (2-byte, 4-byte, mixed), comments ending exactly at EOF, and unterminated large comments producing `Expected "*/" to terminate multi-line comment`. - `bun bd test test/bundler/bundler_comments.test.ts` — 45 pass / 0 fail, including new end-to-end bundles: a large `/*! ... */` legal comment (CRLF + non-ASCII) preserved verbatim with the code after it intact, and ASI behavior across a large block comment verified by running the bundled output. - `bun bd test test/js/bun/transpiler/transpiler-truncated-utf8.test.ts` — the guard-page fixture now also places ≥ 512-byte block comments (terminated, unterminated, trailing `*`, trailing truncated UTF-8 lead) so that the input ends exactly at a `PROT_NONE` page; any read past the end of the source by the new kernel faults deterministically. - `cargo clippy -p bun_highway -p bun_js_parser` clean, `cargo fmt` applied, `highway_strings.cpp` is clang-format clean. - The expected outputs for every new test were cross-checked against the current scalar implementation before making the change, so this is verified behavior-preserving. Like #31385, the new tests are regression coverage for the rewritten path rather than a failing-before/passing-after proof — there is no functional delta to assert for a pure performance restoration; the before/after evidence for the optimization itself is the benchmark table above.
What does this PR do?
TextEncoder.encode's Rust port was measurably slower than the Zig implementation it replaced (4-char ASCII ~2× slower, 12 KB ASCII ~40% slower; only the large UTF-8 case was a tie). This PR restores the lost fast paths and vectorization:Cause
TextEncoder__encode8/16/encodeRopeStringzero-filled a 2 KB stack buffer on every call ([0u8; 2048]) where the Zig code usedundefined, then copied the output a second time from that buffer into JSC memory. For a 4-byte input the memset alone is most of the regression."Hello World!".repeat(1024)stays a rope): ~1024 12-byte leaves hitcopy_latin1_into_utf8_stop_on_non_ascii, whose port replaced the Zig fused@Vector(16,u8)/SWAR copy with a scan (first_non_ascii) followed by amemcpyper span — two passes plus per-leaf call overhead.element_length_latin1_into_utf8was ported as a scalar span loop instead ofsimdutf.length.utf8.from.latin1.utf8_length_from_utf16lepass even when the destination buffer already provably fit the worst case (the Zigout_lenshortcut was dropped).Fix
CopyAsciiPrefixhighway kernel inhighway_strings.cpp(sameHWY_EXPORT+HWY_DYNAMIC_DISPATCHpattern as the existing kernels): fused scan+copy of the leading ASCII run that stops at the first byte ≥ 0x80 and writes only the bytes it reports. Exposed asbun_highway::copy_ascii_prefix, with per-target symbols added to theverify-baseline-staticallowlists.copy_latin1_into_utf8_stop_on_non_asciinow streams through a fusedcopy_ascii_prefixhelper: SWARu64for short runs (rope leaves are usually a dozen bytes, so FFI dispatch isn't worth it), the highway kernel for runs ≥ 64 bytes. Onlybuf[..written]is ever written, same as before.element_length_latin1_into_utf8uses simdutf (scalar count for ≤ 32 bytes), matching the Zig original.copy_utf16_into_utf8skips the exact-length pass whenbuf.len() >= 3 * utf16.len()(worst case), mirroring the Zigout_lenselection; acopy_utf16_into_utf8_with_utf8_lenvariant lets callers that already computed the length (to size the destination) avoid recomputing it.TextEncoder__encode8/__encode16now encode straight into an exactly-sized, JSC-ownedUint8Array— no stack buffer, no second copy, and the former large-string path no longer hands JSC an externalVec.encode16keeps a small 192-byte stack path for tiny strings and falls back to the allocating U+FFFD path when unpaired surrogates make the exact-size buffer insufficient.TextEncoder__encodeRopeStringallocates the array up front and lets the rope iterator write each segment directly into it.Output bytes,
encodeIntoread/written semantics, and the "never write pastwritten" guarantee are unchanged.Benchmarks
bench/snippets/text-encoder.mjs, median of 3 interleaved rounds on linux x64 (AVX2/AVX-512), comparing the last Zig-based release (1.3.9), the current Rust implementation, and this PR (both built with thereleaseprofile from this tree):The tiny-string cases are now bounded by shared machinery (call dispatch,
Uint8Arrayallocation), so they land at parity with Zig instead of 1.5–2× behind; the large cases win outright. Additional paths not in the stock benchmark, same methodology:(The benchmark host is a shared/noisy container; medians of interleaved runs are reported. Ratios, not absolute numbers, are the signal.)
How did you verify your code works?
bun bd test test/js/web/encoding/— all TextEncoder/TextEncoderStream/TextDecoder tests pass with the debug (ASAN) build. The only failures in that directory are the two pre-existingTextDecoder ... should not leak the output bufferRSS tests, which fail on any local ASAN debug build because the binary isn't namedbun-asan(the measured delta is within the test's own ASAN allowance).test/js/web/encoding/text-encoder.test.js(10 new tests, ~10k assertions): every length around the SWAR/SIMD thresholds, a non-ASCII byte at every word/vector boundary position,encodeIntoexact-fit and partial-fit behavior including "bytes pastwrittenare untouched", rope strings with >64-byte segments and with non-ASCII segments (bail path), and long UTF-16 strings with and without unpaired surrogates, all cross-checked against a pure-JS reference encoder.cargo clippy -p bun_core -p bun_highwayclean; websocket/inspect suites (which share the rewritten Latin-1 kernels) pass apart from tests that dial external network endpoints unavailable in the sandbox.FirstNonAscii8Impl/FillWithSkipMaskImplceilings; ifverify-baseline-staticreports different feature sets on the baseline/aarch64 CI builds I'll update the entries to match its report.bench/snippets/text-encoder.mjs.