Deduplicate number scanning and escape decoding across the js, json, and toml lexers by alii · Pull Request #31997 · oven-sh/bun

alii · 2026-06-08T22:50:38Z

What this does

Four mechanical deduplications across the lexers, zero intended behavior change:

src/parsers/number_scan.rs (new): the byte-identical ~80 line decimal digit scan (underscore separator rules, fraction, exponent) that the json and toml lexers each carried, extracted into one #[inline] generic over a small lexer-accessor trait. Monomorphizes per lexer type, so codegen matches the previous inline copies.
src/ast/lexer_log.rs: the ~330 line string escape-sequence decoder that the js, json, and toml lexers each carried, extracted into one generic (decode_escape_sequences over an EscapeLexer trait). The behavioral differences between the three are encoded in const parameters: IS_JSON (strict JSON escape set), ALLOW_LINE_CONTINUATIONS and REJECT_HEX_ESCAPE (toml multiline vs single-line basic strings), and LEGACY_ERROR_SPANS (toml keeps its historical error span shape, see bun-rust diagnostic error message has incorrect location for template strings #31134).
src/parsers/json.rs: seven copies of the empty-source fast path collapse into one helper. In parse this also moves the check ahead of parser init, matching the other six entry points; init on an empty source has no observable effect, so nothing changes.
src/js_parser/parse/parse_entry.rs: the three identical move-lexer-out-and-init-in-place prologues become one macro.

Split from #31912 (whole-repo simplification pass, closed in favor of module-scoped splits).

Verification

Per-hunk audit against main: each extracted region diffed against every reference copy it replaces.
Existing suites pass on the debug build: toml resolve, jsonc/json5/jsonl, transpiler, bundler_string, template literals, macros, bunfig config.
A/B against the release build of main: a 37-file corpus of valid and invalid TOML/JSON/JS inputs (error messages and spans included, covering the toml legacy spans and the \x and line-continuation asymmetries), bundler output for json imports, tsconfig paths resolution, and empty sources all produce byte-identical output on both binaries.
New characterization tests in toml-parse.test.ts pin the quirks the const parameters preserve: underscore placement, exponent digits, \x allowed in single-line but not multiline basic strings, line continuations in multiline only, unicode escape range checks. They pass unchanged on both this branch and the current release.

robobun · 2026-06-08T22:50:47Z

^{Updated 7:04 AM PT - Jun 10th, 2026}

✅ @robobun, your commit 000af9785d30b3316b95760d52511e17f6a39694 passed in Build #61744! 🎉

🧪 To try this PR locally:

bunx bun-pr 31997

That installs a local version of the PR into your bun-31997 executable, so you can run:

bun-31997 --bun

alii · 2026-06-09T20:18:49Z

@robobun adopt

coderabbitai · 2026-06-09T20:21:07Z

Walkthrough

This PR extracts shared escape-sequence and decimal-number scanning utilities into new trait-based infrastructure used by JavaScript, JSON, and TOML lexers; refactors parser initialization patterns; and deduplicates empty-input fast paths across JSON parser entry points.

Changes

Lexer utilities and parser cleanup

Layer / File(s)	Summary
Shared escape-sequence decoding trait and implementation `src/ast/lexer_log.rs`	Module documentation updated; `EscapeLexer` trait and `decode_escape_sequences` function added to centralize string escape decoding (CR/LF normalization, standard/legacy/Unicode escapes, line continuations) with per-lexer mode constants and error reporting hooks.
Shared decimal-number scanning trait and implementation `src/parsers/lib.rs`, `src/parsers/number_scan.rs`	New `number_scan` module with `DecimalLexer` trait, `DecimalScan` result struct, and `scan_decimal_digits` function to unify decimal literal scanning (digits, fractions, exponents, underscore validation, legacy octal detection) across lexers.
JavaScript lexer escape-decoder integration `src/js_parser/lexer.rs`	`LexerType` implements `EscapeLexer` trait (UTF-16 buffer, codepoint callback); inline escape-decoding logic replaced with call to shared `decode_escape_sequences`; local `hex_digit_value_u32` import removed.
JSON lexer decimal-scanner integration `src/parsers/json_lexer.rs`	`Lexer` implements `DecimalLexer` trait; numeric scanning in `parse_numeric_literal_or_dot` refactored to delegate to `scan_decimal_digits` while preserving underscore filtering and numeric parsing downstream logic.
TOML lexer escape and decimal scanner integration `src/parsers/toml/lexer.rs`	`Lexer<'a>` implements both `EscapeLexer` and `DecimalLexer` traits; escape decoding and numeric scanning replaced with calls to shared `decode_escape_sequences` and `scan_decimal_digits` functions; local `hex_digit_value_u32` import removed.
Parser ownership initialization macro `src/js_parser/parse/parse_entry.rs`	New `take_and_init_p!` macro centralizes unsafe-sensitive lexer/options ownership transfer into `P` instances; applied to `_scan_imports`, `to_lazy_export_ast`, and `analyze`, replacing duplicated boilerplate.
JSON parser empty-input fast-path optimization `src/parsers/json.rs`	`empty_source_fast_path` helper extracts empty and 2-byte literal mapping to prebuilt `Expr` values; applied across eight entry points (`parse`, `parse_package_json_utf8`, `parse_package_json_utf8_with_opts_rt`, `parse_utf8_impl`, `parse_for_macro`, `parse_for_bundling`, `parse_env_json`, `parse_ts_config`) to eliminate duplicated fast-path logic.
TOML parser numeric and escape test coverage `test/js/bun/resolve/toml/toml-parse.test.ts`	New tests validate underscore digit separators, reject misplaced underscores and invalid exponents, distinguish `\x` escape acceptance in single-line vs. rejection in multiline strings, allow backslash-newline continuations only in multiline strings, and verify Unicode escape decoding.

Possibly related PRs

oven-sh/bun#30895: Refactors JS/TOML string escape decoding to route through the same shared bun_ast::lexer_log::decode_escape_sequences codepath; related fix addresses usize error-offset arithmetic saturation.

Suggested reviewers

Jarred-Sumner
cirospaciari

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main refactoring goal: consolidating duplicate number scanning and escape decoding logic across three lexers (js, json, toml) into shared components.
Description check	✅ Passed	The description comprehensively explains what the PR does (four deduplication areas), how the code was verified (per-hunk audit, test suite passes, A/B binary equivalence testing, characterization tests), and confirms zero intended behavior change.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The shared number_scan and decode_escape_sequences helpers encode these behaviors in const parameters; these tests lock in the observable rules: underscore separator placement, exponent digits, the \x single-line vs multiline asymmetry, line continuations, and unicode escape range checks.

robobun · 2026-06-09T20:41:14Z

Adopted. Audited every hunk against main, ran the affected suites (toml, jsonc/json5, transpiler, bundler, bunfig) on the debug build, and A/B diffed a corpus of valid and invalid TOML/JSON/JS inputs against the release build: output is byte-identical, error spans included. Added characterization tests to toml-parse.test.ts pinning the quirks the shared decoder preserves.

CI note: the remaining red on build 61508 is unrelated to this diff. bunx.test.ts "should handle package that requires node 24" fails identically on the release build of main: @angular/cli 22.0.0 (published after main last built) requires node ^22.22.3 || ^24.15.0 || >=26.0.0 while bun reports v24.3.0, so Angular exits 3. Fixes are already in flight (#31820 pins the version in that test, #31991 bumps the reported node version). The other two failures (serve-body-leak on asan, node-http-connect on windows) were auto-retried lanes CI itself tags as flaky, in files this PR does not touch.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/ast/lexer_log.rs`:
- Around line 365-399: The variable-length Unicode escape loop in the
EscapeLexer helper currently breaks on iterator.next() EOF and falls through,
allowing inputs like "\u{41" to be accepted; modify the 'variable_length' loop
so that when iterator.next(&mut iter) returns false you set an appropriate error
span via *lexer.end_mut() (using start + iter.i and widths similar to the
existing EOF/brace handling) and immediately return lexer.syntax_error() instead
of breaking; update the handling inside the loop (the branch that currently does
`if !iterator.next(&mut iter) { break 'variable_length; }`) to detect EOF and
call lexer.syntax_error() so malformed `\u{...` is rejected consistently by the
EscapeLexer code path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1971277f-11fd-43e5-bbdb-06c473e171c0

📥 Commits

Reviewing files that changed from the base of the PR and between a988615 and 1c37c64.

📒 Files selected for processing (9)

src/ast/lexer_log.rs
src/js_parser/lexer.rs
src/js_parser/parse/parse_entry.rs
src/parsers/json.rs
src/parsers/json_lexer.rs
src/parsers/lib.rs
src/parsers/number_scan.rs
src/parsers/toml/lexer.rs
test/js/bun/resolve/toml/toml-parse.test.ts

alii marked this pull request as ready for review June 9, 2026 20:18

parsers: consolidate lexer logging and number scanning helpers

4b72a0a

alii force-pushed the claude/split/parsers branch from 6aa3e98 to 4b72a0a Compare June 9, 2026 20:19

robobun changed the title ~~Share the decimal-literal scanner between JSON and TOML lexers~~ Deduplicate number scanning and escape decoding across the js, json, and toml lexers Jun 9, 2026

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread src/ast/lexer_log.rs

robobun added 2 commits June 9, 2026 18:37

Merge branch 'main' into claude/split/parsers

02b0fcd

ci: retrigger

000af97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplicate number scanning and escape decoding across the js, json, and toml lexers#31997

Deduplicate number scanning and escape decoding across the js, json, and toml lexers#31997
alii wants to merge 4 commits into
mainfrom
claude/split/parsers

alii commented Jun 8, 2026 •

edited by robobun

Loading

Uh oh!

robobun commented Jun 8, 2026 •

edited

Loading

Uh oh!

alii commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

robobun commented Jun 9, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alii commented Jun 8, 2026 • edited by robobun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

Verification

Uh oh!

robobun commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alii commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Uh oh!

robobun commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alii commented Jun 8, 2026 •

edited by robobun

Loading

robobun commented Jun 8, 2026 •

edited

Loading

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

robobun commented Jun 9, 2026 •

edited

Loading