Skip to content

Deduplicate number scanning and escape decoding across the js, json, and toml lexers#31997

Open
alii wants to merge 4 commits into
mainfrom
claude/split/parsers
Open

Deduplicate number scanning and escape decoding across the js, json, and toml lexers#31997
alii wants to merge 4 commits into
mainfrom
claude/split/parsers

Conversation

@alii

@alii alii commented Jun 8, 2026

Copy link
Copy Markdown
Member

What this does

Four mechanical deduplications across the lexers, zero intended behavior change:

  • src/parsers/number_scan.rs (new): the byte-identical ~80 line decimal digit scan (underscore separator rules, fraction, exponent) that the json and toml lexers each carried, extracted into one #[inline] generic over a small lexer-accessor trait. Monomorphizes per lexer type, so codegen matches the previous inline copies.
  • src/ast/lexer_log.rs: the ~330 line string escape-sequence decoder that the js, json, and toml lexers each carried, extracted into one generic (decode_escape_sequences over an EscapeLexer trait). The behavioral differences between the three are encoded in const parameters: IS_JSON (strict JSON escape set), ALLOW_LINE_CONTINUATIONS and REJECT_HEX_ESCAPE (toml multiline vs single-line basic strings), and LEGACY_ERROR_SPANS (toml keeps its historical error span shape, see bun-rust diagnostic error message has incorrect location for template strings #31134).
  • src/parsers/json.rs: seven copies of the empty-source fast path collapse into one helper. In parse this also moves the check ahead of parser init, matching the other six entry points; init on an empty source has no observable effect, so nothing changes.
  • src/js_parser/parse/parse_entry.rs: the three identical move-lexer-out-and-init-in-place prologues become one macro.

Split from #31912 (whole-repo simplification pass, closed in favor of module-scoped splits).

Verification

  • Per-hunk audit against main: each extracted region diffed against every reference copy it replaces.
  • Existing suites pass on the debug build: toml resolve, jsonc/json5/jsonl, transpiler, bundler_string, template literals, macros, bunfig config.
  • A/B against the release build of main: a 37-file corpus of valid and invalid TOML/JSON/JS inputs (error messages and spans included, covering the toml legacy spans and the \x and line-continuation asymmetries), bundler output for json imports, tsconfig paths resolution, and empty sources all produce byte-identical output on both binaries.
  • New characterization tests in toml-parse.test.ts pin the quirks the const parameters preserve: underscore placement, exponent digits, \x allowed in single-line but not multiline basic strings, line continuations in multiline only, unicode escape range checks. They pass unchanged on both this branch and the current release.

@robobun

robobun commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator
Updated 7:04 AM PT - Jun 10th, 2026

@robobun, your commit 000af9785d30b3316b95760d52511e17f6a39694 passed in Build #61744! 🎉


🧪   To try this PR locally:

bunx bun-pr 31997

That installs a local version of the PR into your bun-31997 executable, so you can run:

bun-31997 --bun

@alii alii marked this pull request as ready for review June 9, 2026 20:18
@alii

alii commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

@robobun adopt

@alii alii force-pushed the claude/split/parsers branch from 6aa3e98 to 4b72a0a Compare June 9, 2026 20:19
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

This PR extracts shared escape-sequence and decimal-number scanning utilities into new trait-based infrastructure used by JavaScript, JSON, and TOML lexers; refactors parser initialization patterns; and deduplicates empty-input fast paths across JSON parser entry points.

Changes

Lexer utilities and parser cleanup

Layer / File(s) Summary
Shared escape-sequence decoding trait and implementation
src/ast/lexer_log.rs
Module documentation updated; EscapeLexer trait and decode_escape_sequences function added to centralize string escape decoding (CR/LF normalization, standard/legacy/Unicode escapes, line continuations) with per-lexer mode constants and error reporting hooks.
Shared decimal-number scanning trait and implementation
src/parsers/lib.rs, src/parsers/number_scan.rs
New number_scan module with DecimalLexer trait, DecimalScan result struct, and scan_decimal_digits function to unify decimal literal scanning (digits, fractions, exponents, underscore validation, legacy octal detection) across lexers.
JavaScript lexer escape-decoder integration
src/js_parser/lexer.rs
LexerType implements EscapeLexer trait (UTF-16 buffer, codepoint callback); inline escape-decoding logic replaced with call to shared decode_escape_sequences; local hex_digit_value_u32 import removed.
JSON lexer decimal-scanner integration
src/parsers/json_lexer.rs
Lexer implements DecimalLexer trait; numeric scanning in parse_numeric_literal_or_dot refactored to delegate to scan_decimal_digits while preserving underscore filtering and numeric parsing downstream logic.
TOML lexer escape and decimal scanner integration
src/parsers/toml/lexer.rs
Lexer<'a> implements both EscapeLexer and DecimalLexer traits; escape decoding and numeric scanning replaced with calls to shared decode_escape_sequences and scan_decimal_digits functions; local hex_digit_value_u32 import removed.
Parser ownership initialization macro
src/js_parser/parse/parse_entry.rs
New take_and_init_p! macro centralizes unsafe-sensitive lexer/options ownership transfer into P instances; applied to _scan_imports, to_lazy_export_ast, and analyze, replacing duplicated boilerplate.
JSON parser empty-input fast-path optimization
src/parsers/json.rs
empty_source_fast_path helper extracts empty and 2-byte literal mapping to prebuilt Expr values; applied across eight entry points (parse, parse_package_json_utf8, parse_package_json_utf8_with_opts_rt, parse_utf8_impl, parse_for_macro, parse_for_bundling, parse_env_json, parse_ts_config) to eliminate duplicated fast-path logic.
TOML parser numeric and escape test coverage
test/js/bun/resolve/toml/toml-parse.test.ts
New tests validate underscore digit separators, reject misplaced underscores and invalid exponents, distinguish \x escape acceptance in single-line vs. rejection in multiline strings, allow backslash-newline continuations only in multiline strings, and verify Unicode escape decoding.

Possibly related PRs

  • oven-sh/bun#30895: Refactors JS/TOML string escape decoding to route through the same shared bun_ast::lexer_log::decode_escape_sequences codepath; related fix addresses usize error-offset arithmetic saturation.

Suggested reviewers

  • Jarred-Sumner
  • cirospaciari
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main refactoring goal: consolidating duplicate number scanning and escape decoding logic across three lexers (js, json, toml) into shared components.
Description check ✅ Passed The description comprehensively explains what the PR does (four deduplication areas), how the code was verified (per-hunk audit, test suite passes, A/B binary equivalence testing, characterization tests), and confirms zero intended behavior change.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

The shared number_scan and decode_escape_sequences helpers encode these
behaviors in const parameters; these tests lock in the observable rules:
underscore separator placement, exponent digits, the \x single-line vs
multiline asymmetry, line continuations, and unicode escape range checks.
@robobun robobun changed the title Share the decimal-literal scanner between JSON and TOML lexers Deduplicate number scanning and escape decoding across the js, json, and toml lexers Jun 9, 2026
@robobun

robobun commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Adopted. Audited every hunk against main, ran the affected suites (toml, jsonc/json5, transpiler, bundler, bunfig) on the debug build, and A/B diffed a corpus of valid and invalid TOML/JSON/JS inputs against the release build: output is byte-identical, error spans included. Added characterization tests to toml-parse.test.ts pinning the quirks the shared decoder preserves.

CI note: the remaining red on build 61508 is unrelated to this diff. bunx.test.ts "should handle package that requires node 24" fails identically on the release build of main: @angular/cli 22.0.0 (published after main last built) requires node ^22.22.3 || ^24.15.0 || >=26.0.0 while bun reports v24.3.0, so Angular exits 3. Fixes are already in flight (#31820 pins the version in that test, #31991 bumps the reported node version). The other two failures (serve-body-leak on asan, node-http-connect on windows) were auto-retried lanes CI itself tags as flaky, in files this PR does not touch.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/ast/lexer_log.rs`:
- Around line 365-399: The variable-length Unicode escape loop in the
EscapeLexer helper currently breaks on iterator.next() EOF and falls through,
allowing inputs like "\u{41" to be accepted; modify the 'variable_length' loop
so that when iterator.next(&mut iter) returns false you set an appropriate error
span via *lexer.end_mut() (using start + iter.i and widths similar to the
existing EOF/brace handling) and immediately return lexer.syntax_error() instead
of breaking; update the handling inside the loop (the branch that currently does
`if !iterator.next(&mut iter) { break 'variable_length; }`) to detect EOF and
call lexer.syntax_error() so malformed `\u{...` is rejected consistently by the
EscapeLexer code path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1971277f-11fd-43e5-bbdb-06c473e171c0

📥 Commits

Reviewing files that changed from the base of the PR and between a988615 and 1c37c64.

📒 Files selected for processing (9)
  • src/ast/lexer_log.rs
  • src/js_parser/lexer.rs
  • src/js_parser/parse/parse_entry.rs
  • src/parsers/json.rs
  • src/parsers/json_lexer.rs
  • src/parsers/lib.rs
  • src/parsers/number_scan.rs
  • src/parsers/toml/lexer.rs
  • test/js/bun/resolve/toml/toml-parse.test.ts

Comment thread src/ast/lexer_log.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants