Skip to content

fix(parser): recover from unexpected tokens in expression position#1709

Open
ghaith wants to merge 6 commits into
masterfrom
fix/parser-recovery-1306
Open

fix(parser): recover from unexpected tokens in expression position#1709
ghaith wants to merge 6 commits into
masterfrom
fix/parser-recovery-1306

Conversation

@ghaith
Copy link
Copy Markdown
Collaborator

@ghaith ghaith commented Apr 30, 2026

Summary

parse_atomic_leaf_expression's fallback emitted an E007 unexpected_token_found diagnostic but did not advance the lexer, leaving the bad token in the stream for binary-op parsers higher in the cascade to re-consume. For prog(p := &g) from #1306 this meant the AST became BinaryExpression { And, EmptyStatement, g }; the empty LHS reached codegen and produced a confusing no type hint available for EmptyStatement while compilation reported success. The same shape bit any other binary-only operator misused as a unary prefix (MOD x, OR x, …).

Recovery now:

  • emits a clearer "expected expression" diagnostic (was "Literal");
  • when the bad token is itself an operator, advances past it and retries the leaf so the operand the user wrote is captured (&gg);
  • otherwise leaves the token in place so outer parsers can use it for synchronization (e.g. END_CASE, END_VAR).

The missing-parameter-assignment carve-out (foo(p := )) is narrowed to fire only when the next token actually closes the parameter list () or ,); previously any unrecognised token after := was silently treated as an empty parameter, masking real errors like the one in #1306.

Resolver tests that exercised pointer arithmetic via the dead-code &x path are migrated to REF(x).

Fixes #1306

Behaviour change for users

Before:

$ plc bug.st --check
$ echo $?
0

After:

$ plc bug.st --check
error[E007]: Unexpected token: expected expression but found &
   ┌─ bug.st:13:14
   │
13 │     foo(p := &g);
   │              ^ Unexpected token: expected expression but found &

Compilation aborted due to critical parse errors.
$ echo $?
1

Test plan

  • New inline-snapshot tests in expressions_parser_tests.rs covering prog(p := &g) (the Using the removed & - operator in a formal parameter assignment will cause an error during codegen #1306 regression), bare &y, and MOD y (generality witness).
  • Defensive amp_as_and_test (binary b & c) stays green.
  • 16 existing parser-error snapshots updated to reflect the "Literal""expression" wording. Locations and column markers are unchanged across all of them.
  • qualified_reference_location_test no longer covers &aaa.bbb.ccc since that case was anchoring the buggy EmptyStatement AND expr shape.
  • Two resolver tests migrated from &x to REF(x) (the & path was dead code that fell through the buggy parser recovery).
  • cargo test --workspace clean (2466/2466).
  • cargo xtask lit clean (347/347).
  • cargo fmt --all + cargo clippy --workspace -- -Dwarnings clean.
  • Manual repro: plc <file> --check on the issue's input now exits 1 with E007; full compile aborts before codegen with the same diagnostic.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 30, 2026

Build Artifacts

🐧 Linux

Artifact Link Size
deb-x86_64 Download 38.5 MB
schema Download 0.0 MB
stdlib Download 32.4 MB
plc-x86_64 Download 43.7 MB
deb-aarch64 Download 31.0 MB
plc-aarch64 Download 43.6 MB

From workflow run

🪟 Windows

Artifact Link Size
stdlib.lib Download 4.3 MB
stdlib.dll Download 0.1 MB
plc.exe Download 38.5 MB

From workflow run

Copy link
Copy Markdown
Member

@mhasel mhasel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this review was generated by Claude (via Claude Code) and posted on my behalf. Treat the suggestions as machine-generated input — verify before acting. — Michael

Overview

Fixes #1306: parse_atomic_leaf_expression's fallback diagnosed E007 but didn't advance the lexer, so binary-op parsers higher in the cascade re-consumed the bad token (e.g. & re-parsed as binary-AND with EmptyStatement LHS). The malformed AST then flowed into codegen, producing a confusing EmptyStatement type-hint error while compilation reported success. The fix advances past binary-only operators used as prefixes and narrows the foo(p := ) empty-parameter carve-out to only fire when the next token is actually ) or ,.

What I like

  • The diagnosis is sharp. The PR body correctly identifies why the AST was malformed: silent token re-consumption, plus an over-broad empty-parameter carve-out. Both root causes are addressed.
  • The carve-out tightening is the cleaner of the two changes. Adding && matches!(lexer.token, KeywordParensClose | KeywordComma) is a one-line invariant that prevents a recovery path from masking real errors. This is a strict improvement.
  • Diagnostic wording. "Literal""expression" is more accurate. The 16 mechanical snapshot updates all preserve location/column markers and only change the wording, which is exactly the right blast radius.
  • Test coverage. Three new tests pin: the #1306 regression (&g in a call arg), bare &y in an assignment, and MOD y (generality witness — proves the fix isn't &-specific). The retained amp_as_and_test confirms b & c still parses correctly. The combination is convincing.
  • Resolver-test migration to REF(x) is the right call given that &x was never first-class syntax — only working via the buggy path. The migrated tests still exercise the underlying pointer-arithmetic resolver logic.

Suggestions

  • Recursion after advance. return parse_atomic_leaf_expression(lexer) is a tail call that recurses for each consecutive bad operator. Pathological input like & & & g would emit one diagnostic per & and recurse N times. The lexer is bounded so this won't OOM, but converting to a loop would (a) avoid burning stack on adversarial input and (b) make the bound visible:
    while to_operator(&lexer.token).is_some() {
        lexer.advance();
    }
    // then fall through to a single retry
    Optional — current form is fine in practice.
  • What counts as "binary-only" is implicit. to_operator(&lexer.token).is_some() returns true for unary-eligible tokens too (-, +, NOT). Those shouldn't reach this fallback because the prefix-operator parser handles them earlier — but the comment frames this as recovery for "binary-only operators misused as a prefix," which doesn't quite match what the predicate actually does. A one-liner clarifying "any token that has an operator interpretation; unary-eligible operators normally don't reach here" would close the gap.
  • A negative test for the narrowed carve-out would help. The fix tightens foo(p := ) recovery to only fire for ) or ,. The new tests cover the case where the carve-out should not fire (foo(p := &g) now produces E007 instead of silent empty-param), but there's no positive test that the carve-out still works for the original foo(p := ) shape. If one already exists, no action; otherwise consider adding one to anchor that the legitimate recovery path is preserved.
  • Behavior-change call-out. The PR body lists the user-visible change (plc --check now exits 1 on previously-silent input). This belongs in release notes too — projects with broken-but-compiling code will see a new build failure, and they should know it's the parser becoming honest, not a regression.

Risks

  • Loud failure for previously-silent code. Any project that had &x (or other binary-only-as-prefix) buried in code now gets E007 and fails compilation. Strictly an improvement, but worth a heads-up in release notes.
  • The resolver-test migration is the most fragile part. Two tests changed semantics from "parse &x and resolve through the buggy AST shape" to "parse REF(x) cleanly." The behavior being asserted (pointer arithmetic resolves to a pointer) is preserved, but if anyone has external lit/integration tests that relied on &x parsing, they'll break — please grep for & followed by an identifier in tests/lit/ and tests/integration/ to confirm no other coverage was leaning on the dead path.
  • to_operator coverage. If to_operator returns None for any token that should be recovered (e.g., a punctuation-only token that intuitively reads as binary), the recovery won't fire and we fall back to leaving the token in the stream. That's the conservative direction (no false advance), so I don't think it's a real risk, just worth noting.

Verdict

LGTM. Well-scoped fix, good test coverage, appropriate snapshot churn, and the carve-out tightening is genuinely valuable on its own. The recursion-vs-loop point is a nit; the comment-clarification and release-notes points are small. Approve once the lit-test grep confirms no external &x usage.

ghaith added a commit that referenced this pull request May 5, 2026
…-out test

Review followups on PR #1709:
- Replace the tail-recursing recovery with a loop drain. On adversarial
  input like `& & & g` the previous form emitted one diagnostic per `&`
  and recursed N times; the loop drains the operator run and retries
  once, producing exactly one diagnostic. Guards against infinite
  retry on non-operator bad tokens via an `advanced` flag — the
  reviewer's bare-while suggestion would have looped forever on
  inputs like `END_VAR` directly.
- Sharpen the recovery comment: `to_operator()` also matches
  unary-eligible operators (`-`, `+`, `NOT`), but those don't reach
  this fallback because the prefix-operator parser handles them
  earlier in the cascade.
- Add `consecutive_bad_operators_emit_a_single_diagnostic` to lock
  the N→1 reduction.
- Add `empty_parameter_assignment_carve_out_still_fires` to anchor
  the positive complement of the narrowed `foo(p := )` carve-out.

Verified during followup that no `.st` file under tests/lit,
tests/integration, or libs relied on the previously-tolerated
`&<ident>` shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/parser/tests/expressions_parser_tests.rs
ghaith and others added 3 commits May 6, 2026 08:59
…1306)

`parse_atomic_leaf_expression`'s fallback emitted an `E007` diagnostic but
left the bad token in the stream, so binary-op parsers higher in the
cascade re-consumed it (e.g. `&` taken as binary AND with an `EmptyStatement`
LHS). The malformed AST flowed into codegen, producing a confusing
"no type hint available for EmptyStatement" while compilation reported
success. `prog(p := &g)` from #1306 was the salient case; any binary-only
operator misused as a unary prefix had the same shape.

Recovery now:
- emits a clearer "expected expression" diagnostic (was "Literal");
- when the bad token is itself an operator, advances past it and retries
  the leaf so the operand the user wrote is captured (`&g` -> `g`);
- otherwise leaves the token in place so outer parsers can use it for
  synchronization (e.g. `END_CASE`, `END_VAR`).

The missing-parameter-assignment carve-out (`foo(p := )`) is narrowed
to fire only when the next token is actually `)` or `,`; previously
any unrecognised token after `:=` was silently treated as an empty
parameter, masking real errors.

Resolver tests that exercised pointer arithmetic via the dead-code
`&x` path are migrated to `REF(x)`.

Fixes #1306

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-out test

Review followups on PR #1709:
- Replace the tail-recursing recovery with a loop drain. On adversarial
  input like `& & & g` the previous form emitted one diagnostic per `&`
  and recursed N times; the loop drains the operator run and retries
  once, producing exactly one diagnostic. Guards against infinite
  retry on non-operator bad tokens via an `advanced` flag — the
  reviewer's bare-while suggestion would have looped forever on
  inputs like `END_VAR` directly.
- Sharpen the recovery comment: `to_operator()` also matches
  unary-eligible operators (`-`, `+`, `NOT`), but those don't reach
  this fallback because the prefix-operator parser handles them
  earlier in the cascade.
- Add `consecutive_bad_operators_emit_a_single_diagnostic` to lock
  the N→1 reduction.
- Add `empty_parameter_assignment_carve_out_still_fires` to anchor
  the positive complement of the narrowed `foo(p := )` carve-out.

Verified during followup that no `.st` file under tests/lit,
tests/integration, or libs relied on the previously-tolerated
`&<ident>` shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…test

Spell out that `p := )` and `p := ,` are explicitly allowed by the parser
(land as `Assignment { right: EmptyStatement }`, no diagnostic) and that
the test only locks in the carve-out — i.e. the new `parse_atomic_leaf_
expression` recovery must not consume the trailing `)` or `,`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ghaith ghaith force-pushed the fix/parser-recovery-1306 branch from 0bba529 to 4b3fd5c Compare May 6, 2026 07:13
@ghaith ghaith added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@ghaith ghaith added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@ghaith ghaith added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@ghaith ghaith enabled auto-merge May 29, 2026 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Using the removed & - operator in a formal parameter assignment will cause an error during codegen

3 participants