js_parser: restore SIMD block comment skipping via Highway#31433
Conversation
|
Warning Review limit reached
More reviews will be available in 28 minutes and 11 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (10)
Comment |
|
Updated 8:38 AM PT - May 26th, 2026
✅ @robobun, your commit 1f598d20cfab74d6a4132fa13b0f9d5d49cc264f passed in 🧪 To try this PR locally: bunx bun-pr 31433That installs a local version of the PR into your bun-31433 --bun |
|
Found 1 issue this PR may fix:
🤖 Generated with Claude Code |
There was a problem hiding this comment.
I didn't find any issues — the kernel mirrors the neighboring Highway scanners and the test coverage (size sweep, ASI, guard-page OOB) is thorough — but since this rewrites a hot path in the JS lexer with new SIMD code and subtly changes the call-site control flow, it's worth a human look.
Extended reasoning...
Overview
This PR restores SIMD scanning for /* ... */ block comment bodies in the JS lexer. It adds a new Highway kernel IndexOfInterestingCharacterInMultilineCommentImpl in src/jsc/bindings/highway_strings.cpp (finds first *, \r, \n, or non-ASCII byte), exposes it via a Rust FFI wrapper in src/highway/lib.rs, and replaces the scalar TODO loop in src/js_parser/lexer.rs with a call to it. The call site is simplified: the Environment::ENABLE_SIMD gate and the Option<u32> match are removed (the old function never actually returned None, and the new one returns text_.len() for not-found, which correctly drives the lexer to EOF → unterminated-comment error). Three baseline-static allowlists gain the new per-target symbol entries, and four test files gain extensive coverage.
Security risks
None identified. The kernel only reads within [text, text + text_len) using LoadU over whole-vector chunks plus a bounded scalar tail; the Rust wrapper guards slice.is_empty() and the lexer only calls it with ≥512 bytes. The guard-page fixture (transpiler-truncated-utf8-fixture.ts) places block-comment inputs flush against a PROT_NONE page to deterministically catch OOB reads. No untrusted-input parsing semantics change.
Level of scrutiny
High. The JS lexer runs on every source file Bun parses, so a subtle off-by-one or missed byte class here would mis-tokenize real code. The new C++ kernel is structurally a near-copy of IndexOfNewlineOrNonASCIIOrHashOrAtImpl, and the Rust wrapper follows the same pattern as its neighbors, but the call-site refactor changes the return type and removes a branch, and the ENABLE_SIMD gate removal means baseline builds now take this path too. That's all well-reasoned in the PR description, but it's parser-critical logic rather than a mechanical change.
Other factors
Test coverage is unusually strong: a 480–576 byte size sweep across the 512-byte threshold, lone * at every offset 0–79, all-* bodies, all five line terminators verified via observable ASI, non-ASCII bodies, EOF-adjacent comments, unterminated-comment errors, end-to-end bundler runs, and the mmap guard-page OOB check. The author cross-checked expected outputs against the prior scalar implementation. The allowlist feature ceilings were copied from a structurally identical kernel and the author notes they'll adjust if baseline CI reports differently. No CODEOWNERS paths are touched. Given the criticality of the code path, deferring to a human reviewer rather than auto-approving.
There was a problem hiding this comment.
I didn't find any issues — the kernel, FFI wrapper, and call-site change all look correct and the test coverage is excellent — but since this rewrites a hot path in the JS lexer with new SIMD/FFI code and drops the ENABLE_SIMD gate, it's worth a human sanity check before merge.
Extended reasoning...
Overview
This PR restores SIMD scanning for /* ... */ block-comment bodies in the JS lexer by adding a new Google Highway kernel. It touches:
src/jsc/bindings/highway_strings.cpp— newIndexOfInterestingCharacterInMultilineCommentImplkernel (scans for*,\r,\n, or non-ASCII), plusHWY_EXPORTand anextern "C"dispatch shim. Structurally a near-copy of the adjacentIndexOfNewlineOrNonASCIIOrHashOrAtImpl.src/highway/lib.rs— FFI declaration + safe wrapper with the same empty-check / len-sentinel / debug-assert pattern as its neighbors.src/js_parser/lexer.rs— replaces the scalarskip_to_interesting_character_in_multiline_commentwith a one-line call into the new wrapper, and simplifies the call site inscan_multi_line_comment_body(drops theEnvironment::ENABLE_SIMDgate and the deadNonearm; return type changesOption<u32>→usizewithlen()meaning not-found).- Three
verify-baseline-staticallowlists — mechanical additions of the new per-target symbol. - Four test files — substantial new coverage (size sweep across the 512-byte threshold, lone-
*offset sweep, all newline flavors via ASI, non-ASCII bodies, EOF/unterminated cases, and guard-page OOB checks).
Security risks
None identified. The kernel reads within [text, text+text_len) via hn::LoadU over full-vector chunks plus a scalar tail; the Rust wrapper guards is_empty() and the only call site guarantees len >= 512. The guard-page fixture in transpiler-truncated-utf8.test.ts would fault deterministically on any overread. No untrusted-length arithmetic or writes.
Level of scrutiny
This warrants a human look despite being pattern-following. The JS lexer is on the critical path for every file Bun parses/bundles, and the change includes (a) new C++ SIMD with raw pointer arithmetic, (b) a Rust↔C FFI surface, (c) a semantic shift at the call site — the old scalar helper returned at most the vector-aligned prefix length and never None, while the new path scans the full tail and returns len() on miss, relying on the subsequent step_with to land on EOF and emit the unterminated-comment error — and (d) removal of the Environment::ENABLE_SIMD gate so baseline builds now take this path. Each of these looks correct to me and is well-argued in the PR description, but they're exactly the kind of subtle lexer-state interactions a maintainer should eyeball.
Other factors
- No bugs flagged by the bug-hunting system.
- No CODEOWNERS for the touched paths.
- Test coverage is unusually thorough for a perf change (threshold/boundary sweeps, ASI observability, ASAN + guard-page OOB).
- The allowlist feature ceilings were copied from
IndexOfNewlineOrNonASCIIImplrather than measured; the author notes they'll adjust if CI's baseline scan disagrees, so CI will catch any mismatch.
The Zig lexer skipped block comment bodies with a 16-byte @vector scan (skipToInterestingCharacterInMultilineComment). The Rust port kept the >=512-byte fast-path gate but replaced the inner scan with a scalar byte loop. Move it to the Google Highway kernels that the single-line comment scan already uses: - new IndexOfInterestingCharacterInMultilineCommentImpl kernel in highway_strings.cpp (first '*', CR, LF, or non-ASCII byte), exposed as bun_highway::index_of_interesting_character_in_multiline_comment - skip_to_interesting_character_in_multiline_comment now calls it; the ENABLE_SIMD gate is dropped because Highway dispatches at runtime, so baseline builds take the fast path too - verify-baseline-static allowlists list the new per-target symbols - tests: block comment size sweep around the 512-byte threshold, '*' offsets, CRLF/U+2028/U+2029 ASI, non-ASCII bodies, unterminated comments, large legal comment preservation, and guard-page cases that fault on any read past the end of the input
1f65a2c to
1f598d2
Compare
|
CI status notes:
If |
There was a problem hiding this comment.
I didn't find any bugs, but this adds a new Highway SIMD kernel wired through FFI into the JS lexer's hot path (and drops the ENABLE_SIMD gate), so it's worth a human pass before merging.
Extended reasoning...
Overview
This PR restores SIMD scanning for /* ... */ block-comment bodies in the Rust JS lexer by adding a new Google Highway kernel. It touches:
src/jsc/bindings/highway_strings.cpp— newIndexOfInterestingCharacterInMultilineCommentImplkernel +HWY_EXPORT/HWY_DYNAMIC_DISPATCHplumbingsrc/highway/lib.rs— newextern "C"decl and safe wrapper with debug-assert validationsrc/js_parser/lexer.rs— replaces the scalar fallback inskip_to_interesting_character_in_multiline_comment, changes its return type fromOption<u32>tousize, and removes theEnvironment::ENABLE_SIMDgate at the call site- Three
verify-baseline-staticallowlists with the new per-target symbols - Four test files with thorough new coverage (threshold/offset sweeps, ASI via all line-terminator forms, non-ASCII bodies, EOF/unterminated, guard-page OOB)
Security risks
None identified. The kernel reads within [text, text+text_len) only (full-vector loop bounded by text_len - (text_len % N), scalar tail to text_len); the Rust wrapper passes slice.as_ptr()/len() and guards is_empty(). The guard-page fixture explicitly tests for OOB reads at the buffer boundary. No untrusted-input parsing semantics change.
Level of scrutiny
This deserves a real human look rather than a bot-only approval:
- It's a hot path in the production JS/TS lexer that runs on every source file Bun parses; a subtle off-by-one here would mis-tokenize user code.
- It introduces a new C++ SIMD kernel with FFI surface and per-CPU dispatch — correctness depends on the byte-class set exactly matching what
scan_multi_line_comment_bodyexpects. - There's a deliberate design decision to drop the
Environment::ENABLE_SIMDgate (justified in the description as matching the single-line-comment/string-literal paths), which a maintainer should sign off on. - The return-value semantics changed slightly: the old scalar helper returned the vector-truncated length when nothing interesting was found, while the new one returns the full
text_.len(). The caller was simplified to a single branch accordingly. This looks correct (and is covered by the new unterminated-comment tests), but it's a behavioral delta worth a second pair of eyes.
Other factors
The new test coverage is unusually thorough (size sweeps across the 512-byte threshold and vector widths, every * offset in the first 80 bytes, all five JS line terminators via observable ASI, mmap/PROT_NONE guard-page OOB checks), and the kernel is structurally a near-copy of the adjacent IndexOfNewlineOrNonASCIIOrHashOrAtImpl. CI shows a Windows binary-size check failure (+~16 MB), which is implausible for one small kernel and almost certainly a stale canary baseline from the recent Windows cross-compile change on main rather than something this PR introduced — but it's still a red X the author/reviewer will want to resolve.
What does this PR do?
Restores the block-comment SIMD skip that the Rust port of the JS lexer lost, by moving it to the same Google Highway kernels the lexer already uses for single-line comments.
Before this PR
scan_single_line_commentkept its SIMD path in the port (bun_highway::index_of_newline_or_non_ascii_or_hash_or_at).scan_multi_line_comment_bodykept the Zig structure (only take the fast path while the current code point is ASCII and ≥ 512 bytes remain), but the inner scan — Zig'sskipToInterestingCharacterInMultilineComment, a 16-byte@Vectorloop — was ported as a scalar byte-at-a-time loop with aTODO(port): SIMD reimplementationnote (src/js_parser/lexer.rs). Large license headers / JSDoc blocks were scanned one byte per iteration.This PR
IndexOfInterestingCharacterInMultilineCommentImplkernel tosrc/jsc/bindings/highway_strings.cpp(sameHWY_EXPORT+HWY_DYNAMIC_DISPATCHpattern as the existing kernels). It returns the index of the first*(potential*/terminator),\r,\n(newline tracking for ASI), or non-ASCII byte (so U+2028/U+2029 and other multi-byte sequences are still decoded by the scalar path) — exactly the byte classes the Zig@Vectorversion stopped at.bun_highway::index_of_interesting_character_in_multiline_comment(with the same debug-only result validation as the neighboring wrappers).skip_to_interesting_character_in_multiline_commentin the lexer now calls it instead of the scalar loop. TheEnvironment::ENABLE_SIMDgate at the call site is dropped: it exists for portable-vector codegen, but Highway dispatches per-CPU at runtime, so baseline builds take the fast path too — matching howscan_single_line_commentand the string-literal scan already call Highway unconditionally. The ≥ 512-byte threshold and the ASCII-code-point check are unchanged.verify-baseline-staticallowlists (x64, x64-windows, aarch64), with feature ceilings copied from the structurally identicalIndexOfNewlineOrNonASCIIImplentries. If the baseline CI scan reports different feature sets, I'll update the entries to match its report.Benchmarks
Bun.Transpiler.transformSyncon linux x64 (AVX2), comparing release builds of main (19dd34df3) and this branch (84fe8aa26) built back-to-back from this tree. Interleaved runs, median of 9 inner rounds, two outer rounds per binary (values were stable to ~1%); transpiled output hashes are identical between the two binaries for every case. Shared container, so ratios are the signal.//line commentThe two controls confirm the delta is attributable to the block-comment path (nothing else in the lexer changed between the two revisions).
How did you verify your code works?
All of the following ran against the debug (ASAN) build with this change:
bun bd test test/bundler/transpiler/transpiler.test.js— 165 pass / 0 fail, including a newmulti-line comment scanninggroup: comment-size sweep across the 512-byte threshold and vector-width boundaries, a lone*at every offset in the first 80 bytes of a large comment, all-*bodies,\n/\r/\r\n/ U+2028 / U+2029 inside large comments observable through ASI, non-ASCII bodies (2-byte, 4-byte, mixed), comments ending exactly at EOF, and unterminated large comments producingExpected "*/" to terminate multi-line comment.bun bd test test/bundler/bundler_comments.test.ts— 45 pass / 0 fail, including new end-to-end bundles: a large/*! ... */legal comment (CRLF + non-ASCII) preserved verbatim with the code after it intact, and ASI behavior across a large block comment verified by running the bundled output.bun bd test test/js/bun/transpiler/transpiler-truncated-utf8.test.ts— the guard-page fixture now also places ≥ 512-byte block comments (terminated, unterminated, trailing*, trailing truncated UTF-8 lead) so that the input ends exactly at aPROT_NONEpage; any read past the end of the source by the new kernel faults deterministically.cargo clippy -p bun_highway -p bun_js_parserclean,cargo fmtapplied,highway_strings.cppis clang-format clean.