Deduplicate number scanning and escape decoding across the js, json, and toml lexers#31997
Deduplicate number scanning and escape decoding across the js, json, and toml lexers#31997alii wants to merge 4 commits into
Conversation
|
Updated 7:04 AM PT - Jun 10th, 2026
✅ @robobun, your commit 000af9785d30b3316b95760d52511e17f6a39694 passed in 🧪 To try this PR locally: bunx bun-pr 31997That installs a local version of the PR into your bun-31997 --bun |
|
@robobun adopt |
6aa3e98 to
4b72a0a
Compare
WalkthroughThis PR extracts shared escape-sequence and decimal-number scanning utilities into new trait-based infrastructure used by JavaScript, JSON, and TOML lexers; refactors parser initialization patterns; and deduplicates empty-input fast paths across JSON parser entry points. ChangesLexer utilities and parser cleanup
Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
The shared number_scan and decode_escape_sequences helpers encode these behaviors in const parameters; these tests lock in the observable rules: underscore separator placement, exponent digits, the \x single-line vs multiline asymmetry, line continuations, and unicode escape range checks.
|
Adopted. Audited every hunk against main, ran the affected suites (toml, jsonc/json5, transpiler, bundler, bunfig) on the debug build, and A/B diffed a corpus of valid and invalid TOML/JSON/JS inputs against the release build: output is byte-identical, error spans included. Added characterization tests to toml-parse.test.ts pinning the quirks the shared decoder preserves. CI note: the remaining red on build 61508 is unrelated to this diff. bunx.test.ts "should handle package that requires node 24" fails identically on the release build of main: @angular/cli 22.0.0 (published after main last built) requires node ^22.22.3 || ^24.15.0 || >=26.0.0 while bun reports v24.3.0, so Angular exits 3. Fixes are already in flight (#31820 pins the version in that test, #31991 bumps the reported node version). The other two failures (serve-body-leak on asan, node-http-connect on windows) were auto-retried lanes CI itself tags as flaky, in files this PR does not touch. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/ast/lexer_log.rs`:
- Around line 365-399: The variable-length Unicode escape loop in the
EscapeLexer helper currently breaks on iterator.next() EOF and falls through,
allowing inputs like "\u{41" to be accepted; modify the 'variable_length' loop
so that when iterator.next(&mut iter) returns false you set an appropriate error
span via *lexer.end_mut() (using start + iter.i and widths similar to the
existing EOF/brace handling) and immediately return lexer.syntax_error() instead
of breaking; update the handling inside the loop (the branch that currently does
`if !iterator.next(&mut iter) { break 'variable_length; }`) to detect EOF and
call lexer.syntax_error() so malformed `\u{...` is rejected consistently by the
EscapeLexer code path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1971277f-11fd-43e5-bbdb-06c473e171c0
📒 Files selected for processing (9)
src/ast/lexer_log.rssrc/js_parser/lexer.rssrc/js_parser/parse/parse_entry.rssrc/parsers/json.rssrc/parsers/json_lexer.rssrc/parsers/lib.rssrc/parsers/number_scan.rssrc/parsers/toml/lexer.rstest/js/bun/resolve/toml/toml-parse.test.ts
What this does
Four mechanical deduplications across the lexers, zero intended behavior change:
src/parsers/number_scan.rs(new): the byte-identical ~80 line decimal digit scan (underscore separator rules, fraction, exponent) that the json and toml lexers each carried, extracted into one#[inline]generic over a small lexer-accessor trait. Monomorphizes per lexer type, so codegen matches the previous inline copies.src/ast/lexer_log.rs: the ~330 line string escape-sequence decoder that the js, json, and toml lexers each carried, extracted into one generic (decode_escape_sequencesover anEscapeLexertrait). The behavioral differences between the three are encoded in const parameters:IS_JSON(strict JSON escape set),ALLOW_LINE_CONTINUATIONSandREJECT_HEX_ESCAPE(toml multiline vs single-line basic strings), andLEGACY_ERROR_SPANS(toml keeps its historical error span shape, seebun-rustdiagnostic error message has incorrect location for template strings #31134).src/parsers/json.rs: seven copies of the empty-source fast path collapse into one helper. Inparsethis also moves the check ahead of parser init, matching the other six entry points; init on an empty source has no observable effect, so nothing changes.src/js_parser/parse/parse_entry.rs: the three identical move-lexer-out-and-init-in-place prologues become one macro.Split from #31912 (whole-repo simplification pass, closed in favor of module-scoped splits).
Verification
\xand line-continuation asymmetries), bundler output for json imports, tsconfigpathsresolution, and empty sources all produce byte-identical output on both binaries.toml-parse.test.tspin the quirks the const parameters preserve: underscore placement, exponent digits,\xallowed in single-line but not multiline basic strings, line continuations in multiline only, unicode escape range checks. They pass unchanged on both this branch and the current release.