Skip to content

Merge upstream v4.3.0 + proposer L1 block pinned sync#138

Merged
seolaoh merged 34 commits into
developfrom
seolaoh/merge-v4.3.0
May 4, 2026
Merged

Merge upstream v4.3.0 + proposer L1 block pinned sync#138
seolaoh merged 34 commits into
developfrom
seolaoh/merge-v4.3.0

Conversation

@seolaoh
Copy link
Copy Markdown
Collaborator

@seolaoh seolaoh commented Apr 27, 2026

Merged upstream v4.3.0 which includes some proposer improvements, sp1 crates bump to v6.1.0, and kona bump to v1.2.13. Confirmed that cost-estimator on Mainnet arbitrary 5 blocks worked.

Related to https://github.com/celo-org/celo-blockchain-planning/issues/1324

Remarkable changes

New configs/metrics

  • add TX_CONFIRMATION_TIMEOUT to challenger/proposer and set to 180

  • add SYNC_L1_CONFIRMATIONS to proposer and set to whatever we want to force as a safety gap from the latest block

  • proposer new metric: op_succinct_fp_canonical_head_game_index, op_succinct_fp_anchor_game_index

    • sustained op_succinct_fp_canonical_head_game_index = -1 after games should exist
    • decreases in op_succinct_fp_canonical_head_game_index, which can indicate canonical head rewind or orphaning
    • the gap between op_succinct_fp_anchor_game_index and op_succinct_fp_canonical_head_game_index
    • op_succinct_fp_finalized_l2_block_number = 0, which indicates that finalized lookup is currently unavailable

fakedev9999 and others added 27 commits March 10, 2026 20:34
…inctlabs#829)

* fix: graceful fallback when eth_getBlockReceipts is unavailable

Restricted RPC endpoints return 403 for eth_getBlockReceipts, crashing
get_l2_block_data_range and losing expensive SP1 execution results.

- fetcher: wrap get_block_receipts in match, return zero fees on failure
- stats: guard empty block_data and prevent division by zero
- cost_estimator: same division-by-zero guard in aggregate_execution_stats

* style: fix rustfmt formatting in fetcher.rs match expression

* fix: log actual error in eth_getBlockReceipts fallback

Address review feedback: split Ok(None) | Err(_) into separate match
arms so the error is logged via `error = %e` when the RPC call fails.
…cinctlabs#843)

* ci: remove PR triggers from heavy CI workflows to reduce costs

E2E sysgo tests and integration tests were running on every PR push,
costing ~$2.5k/mo with 9+ concurrent 64-core runners. Move these to
merge_group + main push only, with /sysgo and /integration comment
triggers for on-demand runs during development.

Refs: succinctlabs/cloud-ops#69

* ci: restore PR triggers with reduced matrix per review feedback

- PR: run reduced e2e matrix (validity/bootstrap + faultproof/bootstrap)
  with path filters for fast feedback
- PR: run reduced integration matrix (sync only) with path filters
- merge_group/main/tags: full matrix as before
- /sysgo and /integration comment triggers: full matrix on-demand
- Long-running tests: merge_group/main/sysgo only (not on PR)
…tlabs#839)

* fix: prevent duplicate game creation via anchor parent index

When a finalized game becomes the anchor, `anchors(GAME_TYPE)` returns
the same data as looking up the game by its factory index. This allows
creating two games with the same (rootClaim, l2BlockNumber, startingOutputRoot)
but different parentIndex values, bypassing the factory's UUID uniqueness check.

Fix: use `getStartingAnchorRoot()` instead of `anchors()` in the
`parentIndex == uint32.max` branch, so it always resolves to the genesis
anchor root rather than the current anchor.

Bumps optimism submodule from op-contracts/v5.0.0 to v6.0.0 to pick up
`getStartingAnchorRoot()` (Optimism PR #18343).

Refs: succinctlabs/cloud-ops#56
Co-authored-by: Roger Bai <roger-bai-coinbase@users.noreply.github.com>

* refactor: extract _finalizeAndClose helper and improve test naming

- Extract duplicated warp→resolve→warp→closeGame into _finalizeAndClose
- Rename testNoDuplicateGameWithAnchorParent to testDifferentStartingRootForIndexVsGenesisParent
- Add assertion that byIndex and byGenesis are distinct contracts

* fix: require parent game ahead of anchor and use current anchor for genesis path

- Parent game's L2 block must be strictly ahead of the anchor state,
  preventing duplicate games when anchor == parent.
- uint32.max path uses getAnchorRoot() (current anchor) instead of
  getStartingAnchorRoot() (genesis), enabling recovery after game type
  switches without resetting to genesis.

* style: fix forge fmt

* refactor: remove redundant IAnchorStateRegistry casts

* test: add edge case coverage for anchor-aware parent validation

- testCannotCreateGenesisPathBelowAnchor: uint32.max with l2Block <= anchor
- testChainFromAnchorPathGame: chain child via parent index from uint32.max game
- testGenesisPathUsesLatestAnchorAfterMultipleAdvances: multiple anchor updates
- testParentOneBlockAheadOfAnchorSucceeds: boundary condition (anchor + 1)

* test: fix wrong comment and remove misleading boundary test

* test: remove redundant testCannotUseAnchorGameAsParentIndex

* fix: proposer uses anchor path when canonical head is the anchor game

The contract now requires parent.l2SeqNum > anchor.l2SeqNum. When the
canonical head IS the anchor game (e.g., after first game finalizes),
the proposer must use uint32.max (anchor path) instead of referencing
it by index. Also reorders contract checks so the cheap CHALLENGER_WINS
SLOAD runs before the getAnchorRoot external call.

* test: fix CI failures from parent-ahead-of-anchor constraint

- Fix clippy: map_or → direct Option comparison (proposer.rs:1937)
- Fix integration test: accept u32::MAX as valid parent when canonical
  head is the anchor game (test_game_chain_validation_anchor_reset)
- Fix sync tests: create all games before setting anchors, since the
  contract now rejects parent.l2SeqNum <= anchor.l2SeqNum

* style: rename dl to deadline in _finalizeAndClose helper

* style: fix rustfmt operator placement

* fix: restore time gaps between game creations in deadline lag test

The two-phase refactor (create all games, then set anchors) removed
the time warps that originally occurred between game creations. This
caused all games to have nearly identical deadlines, preventing the
MAX_GAME_DEADLINE_LAG filter from dropping stale games during sync.

Add explicit time warps between game creations in phase 1 to replicate
the original timing gaps while keeping the two-phase structure.

* style: fix rustfmt line wrap

---------

Co-authored-by: Roger Bai <roger-bai-coinbase@users.noreply.github.com>
…cinctlabs#842)

* feat: cache-first rollup config loading for hardfork transitions

When a node is upgraded before hardfork activation, the node RPC returns
post-hardfork rollup config prematurely. The proposer then computes a
wrong rollup_config_hash, creating invalid dispute games.

Change fetch_and_save_rollup_config to check for a cached config file
before fetching from RPC. On first run the config is fetched and cached;
on subsequent runs the cached file is reused. A best-effort RPC
comparison (5s timeout) warns operators when configs diverge, signaling
that the cache should be refreshed after hardfork activation.

Closes succinctlabs/cloud-ops#67

* docs: remove manual cache deletion instructions

The proposer's on-chain vkey check (on_chain_vkeys_match) already
prevents game creation with a mismatched rollup_config_hash, so
operators never need to manually delete the cached config file.
The WARN log is kept for visibility but no longer prescribes action.

* fix: validate chain ID after RPC fetch, clarify warn-only intent

- Add chain ID validation after initial RPC fetch to catch L2_RPC /
  L2_NODE_RPC misconfiguration on first run (symmetric with cache path)
- Add doc comment on compare_config_with_rpc clarifying that warn-only
  is intentional — on-chain vkey check is the authoritative gate

* style: rustfmt

* fix: use struct init syntax to satisfy clippy field_reassign_with_default

* test: drop stdlib/serde tests per review
…s#844)

* feat(fault-proof): filter games by anchor state registry in fetch_game

During hardfork transitions, old ASR games in the proposer's local DAG
can pollute canonical head selection, causing the proposer to create
new games on top of invalid parents.

Add an anchorStateRegistry() check in fetch_game() after the game type
check and before other on-chain calls. Games with a different ASR are
rejected with UnsupportedAnchorStateRegistry, keeping the DAG clean.

Operators should run one proposer per ASR during transitions: the old
proposer winds down existing games while the new one creates fresh ones.

Upstreams the approach from #118.
Closes succinctlabs/cloud-ops#68

* test(fault-proof): add ASR filtering tests for canonical head selection

Three tests that exercise ProposerState.descendants_of() to verify
the DAG behavior that motivates the ASR filter:

- old_asr_games_pollute_canonical_head: demonstrates the bug where
  old ASR games with higher L2 blocks override the canonical head
- filtered_dag_produces_correct_canonical_head: verifies clean DAG
  after filtering produces correct head selection
- descendants_excludes_unrelated_chains: verifies DAG traversal
  correctly isolates independent game chains

* test: trim test comments

* refactor: address review feedback

- Update fetch_game docstring to include ASR check
- Move ASR filtering tests from inline mod to tests/sync.rs
- Remove old_asr_games_pollute_canonical_head (asserts own literals)
…inctlabs#857)

* ci: skip heavy checks on docs-only PRs

Remove `paths` filters from `pull_request` triggers and add a
lightweight `changes` job using dorny/paths-filter to detect code
changes. Heavy CI jobs (integration tests, e2e tests, lint) now
depend on `changes` and skip when only docs files are modified.

This fixes docs-only PRs being permanently blocked by required
status checks that never trigger. When a job is skipped via `if`
condition, GitHub reports it as "skipped" which satisfies the
required check in rulesets.

* ci: pin dorny/paths-filter to commit SHA

Pin dorny/paths-filter@v3 to de90cc6f to prevent supply-chain
attacks via tag mutation. Also pins existing usage in
e2e-sysgo-network-prover.yml for consistency.

* ci: fix review feedback — preserve non-PR runs and materialize matrix checks

P1: Add always() to lint and integration-tests job conditions so they
    run on push/merge_group even when the changes job is skipped.

P2: E2E matrix job now always runs on PRs (if: always()) so the matrix
    expands and individual check names are reported. On docs-only PRs,
    runs on ubuntu-latest and exits at step level. On code PRs, uses
    the original 64cpu runner.

* ci: fix fp-integration-tests-pr to always report check on docs-only PRs

Apply the same pattern as e2e-sysgo-tests-pr: always run the job on
PRs (so "Fault Proof (sync)" check is reported), use conditional
runs-on for cheap runner on docs-only PRs, and skip steps at step
level instead of job level.

* ci: default to running full job when changes detection fails

If the changes job fails (rather than reporting code=false), treat
it as a code change and run the full CI job. This prevents silently
skipping required checks on real code PRs when path detection errors.

Logic: skip only when changes.result == 'success' AND code != 'true'.
Otherwise (changes failed, skipped, or code == 'true'), run full job.

* ci: add missing paths to change detection filters

- e2e: add contracts/** (e2e tests build and use contracts)
- lint: add bindings/** (cargo check covers bindings workspace)
- all: add rust-toolchain.toml (toolchain changes affect all cargo commands)
Add Swatinem/rust-cache@v2 to the lint CI job. Uses a dedicated
shared-key ("lint") to avoid cross-contamination with stable-toolchain
caches used by other workflows. cache-on-failure is enabled so the
dependency cache is preserved even when clippy or fmt finds issues.

Reduces lint CI time from ~15m to ~5m.

Co-authored-by: Piers Powlesland <pierspowlesland@gmail.com>
…tlabs#845)

* docs: add migration guide from optimistic to ZK fault proofs

Add a step-by-step guide for migrating an existing OP Stack chain from
optimistic fault proofs (e.g., Cannon) to ZK fault proofs using OP
Succinct Lite (game type 42). Covers contract deployment, game type
activation, old system wind-down, withdrawal safety, and rollback.

* docs: extend migration guide to cover both validity and fault proof modes

Move migration.md from fault_proofs/ to advanced/ since it now covers
both OP Succinct (validity, game type 6) and OP Succinct Lite (fault
proofs, game type 42). Phase 1 branches by mode with mode-specific
contract deployment steps; Phases 2-3 and withdrawal safety are shared.

* docs: fix review issues in migration guide

- Fix critical: replace `just deploy-dispute-game-factory` (deploys new
  factory) with manual setImplementation on existing factory for validity path
- Fix broken anchor: #safe-db-configuration → #safedb-configuration
- Trim redundant wasRespectedGameTypeWhenCreated re-explanation in rollback
- Note factory cleanup behavior on rollback

* docs: address Farhad's review feedback on migration guide

1. Scope wasRespectedGameTypeWhenCreated admonish to fault proof mode
   only — OPSuccinctDisputeGame hardcodes the flag to true, so proposer
   start ordering doesn't affect game validity in validity mode.

2. Move migration.md from advanced/ to top-level — migration is a core
   operational workflow for existing customers, not an advanced topic.

3. Update Withdrawal Safety section to explain the behavioral difference
   between validity mode (flag always true) and fault proof mode
   (flag dynamically set based on respected type at creation time).

* chore: trigger check re-evaluation
…tlabs#846)

* fix(fault-proof): remove orphan check from backup validation

The backup orphan check rejects any backup where a game's parent_index
doesn't exist in the backup. This conflicts with anchor-based fetching
(which only keeps recent games) and ASR filtering (succinctlabs#844, which skips
old ASR games) — both produce partial DAGs with legitimate orphans.

Remove the check. The remaining validations (cursor/games consistency,
anchor game existence) are sufficient for detecting real corruption.
Backup writes are atomic (tempfile + rename + fsync), so partial writes
cannot produce orphans.

Fixes succinctlabs#810

* style: condense doc comment, restore inline comments

* style: rustfmt doc comment line wrap
The `tibdex/backport` CI workflow has been non-functional: the label
`backport/v3.x` makes it try `git switch v3.x`, but the branch is
`release/v3.x`. Even with a fixed label, the action fails on any
cherry-pick conflict (ELF binaries, Cargo.lock, diverged code), which
covers most real backports.

Replacing with the `/backport-op-succinct` Claude Code skill, which
handles conflicts intelligently, rebuilds ELFs from the backport
branch, and verifies equivalence against the source PR.
…ccinctlabs#828)

The 64Gi recommendation was based on weight sums (Controller=16 + PlonkWrap=32 = 48)
but didn't account for actual peak memory overhead during co-scheduling:
- Jemalloc fragmentation (~3-5 GB)
- Circuit artifact loading (groth16 + plonk, ~3 GB)
- Splicing worker overhead during large CONTROLLER tasks
- OS/runtime overhead

Actual peak memory during CONTROLLER + PlonkWrap co-scheduling reaches ~60-70 GB,
causing OOMKill with a 64Gi limit. The 96Gi recommendation provides sufficient headroom.
…labs#860)

Previously, hardcoded noise-suppression directives were added AFTER
RUST_LOG parsing, so they always won — making it impossible to debug
specific modules (e.g. RUST_LOG=single_hint_handler=debug had no
effect).

Flip the order: apply defaults first, then layer RUST_LOG directives on
top. Matching targets get overridden; everything else stays suppressed.

Based on succinctlabs#770 by @kien-rise.

Co-authored-by: kien-rise <157339831+kien-rise@users.noreply.github.com>
This cron job has failed every single run since creation (100+
consecutive failures) due to Conduit RPC rate limiting (HTTP 429).
It hits a free-tier L2 node with 15 concurrent requests across 1000
blocks, which always exceeds the rate limit.

The PR-triggered Cost Estimator workflow (cost-estimator.yml) already
provides the same coverage at a smaller range (300 blocks) and passes
consistently.
…ost-Isthmus) (succinctlabs#858)

* perf(fault-proof): use header withdrawals_root for L2 storage root (post-Isthmus)

Skip the eth_getProof RPC call when the L2 block header already carries
the L2ToL1MessagePasser storage root in its withdrawals_root field.

This optimization only applies to post-Isthmus blocks. Pre-Isthmus
blocks (Canyon→Fjord) set withdrawals_root to the empty MPT root
(0x56e81f...), which is NOT the storage root — the code falls back to
eth_getProof for those blocks.

Based on succinctlabs#771 by @kien-rise, with a fix for the Canyon→Isthmus case.
Ref: https://specs.optimism.io/protocol/isthmus/exec-engine.html

* perf: use header withdrawals_root for L2 storage root (post-Isthmus)

Skip eth_getProof RPC calls when the L2 block header already carries
the L2ToL1MessagePasser storage root in its withdrawals_root field.

This optimization only applies to post-Isthmus blocks. Pre-Isthmus
blocks (Canyon→Fjord) set withdrawals_root to the empty MPT root
(EMPTY_ROOT_HASH), which is NOT the storage root — the code falls back
to eth_getProof for those blocks.

Applied to both code paths:
- fault-proof: compute_output_root_at_block (proposer/challenger)
- host-utils: fetcher (validity proof)

Based on succinctlabs#771 by @kien-rise, with a fix for the Canyon→Isthmus case.
Ref: https://specs.optimism.io/protocol/isthmus/exec-engine.html

Co-authored-by: kien-rise <157339831+kien-rise@users.noreply.github.com>

---------

Co-authored-by: kien-rise <157339831+kien-rise@users.noreply.github.com>
…uccinctlabs#861)

* fix(test): fix race condition in proposer recovery integration test

The test checked for a game with the expected parent index only once
after the first new game appeared. If the proposer hadn't completed
recovery by that point (common with eigenda DA layer due to slower
sync cycles), the scan missed the correct game.

Replace the single-pass scan with a polling loop that continuously
checks newly created games until the expected parent is found or
WAIT_TIMEOUT (60s) expires.

* fix: remove unused mut

* refactor: labeled break, conditional sleep, remove dead variable
…uccinctlabs#864)

The test verified proposer recovery AFTER bond claims, which could
advance the anchor to game 1. When canonical head == anchor, the
proposer uses u32::MAX (anchor path) instead of parent_index=1 —
a valid but non-deterministic outcome depending on DA layer speed.

Move recovery check to BEFORE finalization (new Phase 4 → Phase 5
ordering). With anchor still at game 0, canonical head = game 1 is
guaranteed, making parent_index=1 deterministic across all DA layers.

Root cause: succinctlabs#839 added the anchor path logic but didn't update this
test, which was written in succinctlabs#719 (Dec 2025) before anchor paths existed.
…bs#869)

* fix(scripts): honor explicit --batch-size and reject zero

Since succinctlabs#820, `effective_batch_size()` unconditionally returned `end - start`
whenever both `--start` and `--end` were provided, silently ignoring any
explicit `--batch-size`. This broke the documented cost-estimator example
and the recommended workaround for split-aware PGU estimation
(`--start X --end Y --batch-size N`).

Change `batch_size` to `Option<u64>` and rewrite `effective_batch_size()`
with explicit precedence: explicit flag > full-range DX default > fallback.
The PR succinctlabs#820 convenience (single batch when only a range is given) is
preserved when `--batch-size` is omitted.

Additionally reject `--batch-size 0` at the clap parse layer so the range
splitter cannot receive a degenerate `max_range_size = 0` via the CLI.

Covered by a decision-table of unit tests plus parse-layer tests for the
zero-rejection and precedence behavior.

* refactor(scripts): encode --batch-size non-zero invariant with NonZeroU64

Addresses review feedback: `HostExecutorArgs` is a public struct in a
library crate, so parser-only validation is insufficient — any in-tree
or downstream code that constructs the struct programmatically could
still set `Some(0)` and slip through to the splitter.

Move the invariant into the type: change `batch_size` from
`Option<u64>` to `Option<NonZeroU64>`. The clap value parser now
returns `NonZeroU64`, keeping the user-facing error message
("--batch-size must be greater than 0") unchanged. `effective_batch_size`
calls `.get()` on the inner value; precedence behavior is preserved
exactly.

The existing 10 tests are retained; the test helper `args()` keeps its
`Option<u64>` signature and converts via `NonZeroU64::new(..).expect(..)`
so test call sites stay readable and any accidental zero in a test
fixture panics clearly.
…ble (succinctlabs#874)

* feat(fault-proof,validity): make L1 tx confirmation timeout configurable

The signer's hardcoded TIMEOUT_SECONDS=60 is too tight for 3 L1
confirmations on Ethereum mainnet (~36s of pure block time leaves
only ~24s for mempool inclusion). Under any congestion the watcher
gives up while the tx is still in flight, returns an error, and the
caller immediately retries - racing the original tx and, in the
fault-proof proposer's case, producing duplicate sibling games when
the original eventually lands.

Plumb a new `send_transaction_request_with_timeout` through the
signer and add `tx_confirmation_timeout` (env: TX_CONFIRMATION_TIMEOUT)
to all three L1-submitting clients:

- fault-proof proposer (ProposerConfig)
- fault-proof challenger (ChallengerConfig)
- validity proposer (RequesterConfig / EnvironmentConfig)

The default stays at 60s to preserve the historical signer behavior;
operators on congested L1s can raise it (e.g. 180) without a code
change.

Co-authored-by: seolaoh <osa8361@gmail.com>

* docs: add TX_CONFIRMATION_TIMEOUT to env var docs

Document the new TX_CONFIRMATION_TIMEOUT environment variable in
the fault-proof proposer, fault-proof challenger, and validity
proposer documentation pages.

* fix(docs): update Kona link to new monorepo location

The kona project moved from op-rs.github.io/kona to the
ethereum-optimism/optimism monorepo, breaking the CI link checker.

---------

Co-authored-by: seolaoh <osa8361@gmail.com>
* chore: bump sp1 to v6.1.0

Bump SP1 from 6.0.2 to 6.1.0 across all workspace deps, Dockerfiles,
CI workflows, justfile, build utils, and documentation.

Also bump sp1-cluster to a fork with CoreExecute task type support
required for sp1-prover-types 6.1.0 compatibility.

* chore: rebuild ELFs with SP1 v6.1.0 toolchain

* ci: exclude flaky external URLs from link checker

caldera.xyz and op-rs.github.io/kona intermittently fail TLS/connection
checks in CI. Exclude them to prevent false-positive failures.

* fix(docs): update kona link to new optimism monorepo location

op-rs/kona was archived and moved to ethereum-optimism/optimism.
The old GitHub Pages URL is permanently dead.

* chore: switch sp1-cluster to official v2.1.0 release

Replace fork (fakedev9999/sp1-cluster) with succinctlabs/sp1-cluster
v2.1.0 which includes CoreExecute task type support.

Add [patch."https://github.com/succinctlabs/sp1"] to redirect
sp1-cluster's git SP1 deps to crates.io, avoiding duplicate types.

* chore: switch sp1-cluster to official v2.1.0 release

v2.1.2 uses crates.io SP1 deps, so the [patch] workaround for
git/crates.io type duplication is no longer needed.

* chore: update hokulea to SP1 v6.1.0 (fork rev)

Point hokulea deps to fakedev9999/hokulea@af7f05e which has SP1 6.1.0
without the alloy-consensus =1.6.3 exact pin from hokulea master.
Will switch to official tag once Layr-Labs cuts a release.

* chore: update hokulea rev to 7afecb4 (doc fix)

* chore: rebuild ELFs with SP1 v6.1.0 (latest hokulea rev)

* feat: bump kona to v1.2.13 (ethereum-optimism/optimism monorepo)

Major ecosystem upgrade to align with hokulea master:
- kona: op-rs/kona v1.2.7 → ethereum-optimism/optimism v1.2.13
- alloy: 1.4.x → 1.6.3, alloy-primitives 1.4.1 → 1.5.6
- revm: 33.1.0 → 34.0.0, op-revm 14.1.0 → 15.0.0
- op-alloy: 0.22.4 → 0.23.1 (from optimism monorepo)
- hokulea: fork rev → Layr-Labs/hokulea master (22c48b9)
- hana: fork rev with kona v1.2.13 bump (30d72ed)
- Removed reth patch (PartialEq fix upstreamed in reth v1.11.3)

Code changes:
- BlobProvider: IndexedBlobHash → B256
- ZkvmOpEvmFactory: updated for alloy-op-evm 0.26.x API
- Pipeline cursor: added agreed_l2_output_root arg

* chore: rebuild ELFs with kona v1.2.13 + SP1 v6.1.0

* chore: switch hana from fork rev to official v1.6.0-mocha tag

* chore: remove stale comments from Cargo.toml

* chore: remove unnecessary kona op-rs/kona patch redirect

* chore: rebuild celestia ELF with official hana v1.6.0-mocha tag

* chore: switch hokulea from rev to official hokulea-client/v1.1.7 tag

* chore: rebuild ELFs with official hokulea-client/v1.1.7 tag
…ccinctlabs#878)

Update both sp1-contracts submodules (tests/ and contracts/lib/) to
the v6.1.0 tag. Update DeployVerifier.s.sol imports from v6.0.0 to
v6.1.0 to fix WrongVerifierSelector in e2e network prover tests.
* chore: release v4.3.0

* chore: bump elfs
…cy (succinctlabs#865)

* fix(fault-proof): pin L1 block during sync to prevent RPC inconsistency

With load-balanced RPCs, different eth_call requests within the same
sync cycle can hit backends at different block heights. This breaks
atomicity between related state reads (e.g. credit() vs anchorGame()),
which caused Celo's March 26 incident: canonical head cleared, ~470
orphaned games created.

Fix: fetch L1 block number once at the start of sync_state() and pass
it to all downstream eth_call requests via .block(pinned_block). All
state reads within a sync cycle now see a consistent L1 snapshot.

Pinned functions: sync_games, sync_anchor_game, fetch_game,
is_parent_resolved, is_parent_challenger_wins. The challenger passes
BlockId::latest() to maintain existing behavior (separate pinning can
be added later).

Addresses: #132
Tracking: succinctlabs/cloud-ops#98

* fix: add missing BlockId::latest() to fetch_game call in test

* refactor: collapse redundant RPC calls and cache anchor address

- Fetch L1 block once in sync_state (was: get_block_number + get_block_by_number separately)
- Pass timestamp from the pinned block, eliminating a second RPC call in sync_games
- Reuse anchor_address from top of sync_games instead of re-fetching in DEFENDER_WINS loop
- Remove pinned_block.as_u64().unwrap() (latent panic if BlockId variant changes)

* feat(fault-proof): add confirmed block offset and skip-when-unchanged

Two improvements from @palango's review:

1. SYNC_L1_CONFIRMATIONS env var (default: 0): pin reads to
   latest - N blocks for deployments with unstable RPC backends.
   When 0, uses latest directly (no extra RPC call).

2. Skip sync when the pinned L1 block hasn't advanced past the
   last successfully synced block. Avoids redundant RPC calls
   when the sync loop runs faster than L1 block production.

* simplify: single RPC for confirmed block, remove dead unwrap_or

* fix: add sync_l1_confirmations to test ProposerConfig

* fix: single-response snapshot, prune future games, clear stale anchor

P1: Pin from single RPC response.
- confirmations=0: get_block_by_number(Latest) — one call, no split-backend risk.
- confirmations>0: Latest first (for skip check), then confirmed fetch. If
  confirmed block is unavailable (Ok(None)), skip cycle. Err propagated.

P2: Prune backup-restored games above pinned latest index.
- Runs before the early-return on latest_index==None, so all cached future
  games are removed even when the pinned block has no games.
- sync_anchor_game now clears state.anchor_game when the pinned anchor
  is Address::ZERO or not found in cache, preventing compute_canonical_head
  from following a stale subtree.

* test: add backup+prune regression tests, cloud-ops follow-ups

Two integration tests for the prune behavior:
1. test_backup_prune_future_games_with_confirmations: restart from
   backup with large SYNC_L1_CONFIRMATIONS, verify all future games
   are pruned, anchor/canonical head cleared.
2. test_backup_restore_pinned_block_no_games: same but pinned block
   has zero games (latest_index=None), verify full state clear.

Added new_proposer_with_confirmations helper to allow tests to
override sync_l1_confirmations.

Follow-up validation tracked in:
- cloud-ops#102: SYNC_L1_CONFIRMATIONS > 0 on load-balanced RPCs
- cloud-ops#103: skip/retry when confirmed block unavailable

* test: replace full-prune test with deterministic partial-prune

Replace the first backup test (which used huge confirmations to prune
all games — same scenario as the zero-games test) with a proper
partial-prune test:

1. Create game 0, mine an empty L1 block, create game 1.
2. Sync and save backup with confirmations=0 (both games cached).
3. Restart from backup with confirmations=1 — pinned block is between
   game 0 and game 1.
4. Assert: game 0 survives, game 1 pruned, canonical head is game 0.

Also added new_proposer_with_options helper to TestEnvironment for
cleaner test setup.

* fix(test): guarantee backup restore before prune assertions

Both tests now:
1. Construct and save backup via ProposerBackup::new() + save() directly
   (not relying on the proposer run loop to trigger backup).
2. Assert games are restored from backup BEFORE sync_state() prunes them,
   proving the prune path is actually exercised.

Partial-prune test: backup has 2 games, sync with confirmations=1 prunes
game 1 (created after the gap block) while game 0 survives.

Zero-games test: backup has 1 game, sync with huge confirmations prunes
everything.

* fix(test): use real Game structs from proposer cache in backup tests

* fix: reset cursor when pinned snapshot has no games

* test: add rediscovery regression test for cursor reset after prune

New test: test_backup_prune_then_rediscover_after_confirmation

1. Create game 0, save real backup.
2. Restart with confirmations=1 — pinned block is before game 0.
3. First sync: game pruned, cache empty, cursor reset.
4. Mine one empty block — pinned block now includes game 0.
5. Second sync: game 0 rediscovered, canonical head restored.

Directly tests the cursor reset fix: without it, step 5 would skip
the fetch loop and game 0 would stay missing permanently.

* feat(metrics): add canonical_head_index and anchor_game_index gauges

New metrics:
- op_succinct_fp_canonical_head_game_index: -1 when cleared, else game index
- op_succinct_fp_anchor_game_index: -1 when cleared, else game index

Also fixes stale LatestGameL2BlockNumber and AnchorGameL2BlockNumber:
previously these were never reset when canonical head was cleared (only
a warning was logged). Now they are set to 0.0 in the None branch.

Sentinel: -1 for "absent/cleared" since game index 0 is a valid value.

No regression test added — fetch_proposer_metrics is private and
requires a full provider stack + metrics recorder. The new code is
4 lines of unconditional .set() calls with no branching risk.

* fix: clear canonical_head_l2_block when canonical head is None

* fix: reset FinalizedL2BlockNumber when canonical head cleared

* fix(fault-proof): prevent duplicate game creation when pinned cache lags

Bug: With sync_l1_confirmations > 0, the pinned cache may not see a
recently created game for several sync cycles. During that window,
should_create_game() computes the same (next_l2_block, parent_index),
and handle_game_creation() bumps only the L2 block (keeping the same
parent), creating duplicate sibling games.

Fix: Track last_created_game_l2_block in memory. should_create_game()
skips creation when the proposed L2 block <= last created. The guard
is cleared in sync_state() once canonical_head_l2_block catches up.

Test: test_handle_game_creation_bumps_past_existing verifies the
while-loop bump behavior (existing game at target → bumps to target+1).

* fix: stale finalized metric, restart-safe creation guard, proper guard test

1. FinalizedL2BlockNumber stale on None:
   Reset to 0.0 when get_finalized_l2_block_number returns None (e.g.
   Celestia host). Previously kept the old value indefinitely.

2. Duplicate-creation guard not restart-safe:
   Initialize last_created_game_l2_block from the highest L2 block in
   restored backup games. Without this, a restart before pinned sync
   catches up loses the guard and allows duplicate sibling creation.

3. Regression test exercises should_create_game:
   Replaced handle_game_creation-only test with one that calls
   should_create_game() directly (now pub(crate)). Test verifies:
   - Guard blocks creation when cache is stale after handle_game_creation
   - Guard clears after sync catches up
   Previous test only verified the while-loop bump, not the guard.

* fix: scope creation guard to owned games, isolate guard in test

P2: Restart guard over-blocks on foreign games.
  Before: seeded from max(all_restored_games.l2_block), including foreign
  and orphan games. A foreign game with a higher L2 block could permanently
  block creation.
  After: seeded from max(owned_restored_games.l2_block), filtered by
  is_owned() (vkey match). Foreign games can't suppress creation.

P3: Test doesn't prove the guard is the cause.
  Before: should_create_game() could return false from finalization gate,
  not just the guard.
  After: calls should_create_game() before AND after setting the guard
  with the same stale cache state. If baseline returns true and post-guard
  returns false, the guard is proven to be the only cause. If baseline
  returns false (finalization blocks), the test skips gracefully.

* fix: persist creation guard in backup, verify full guard lifecycle in test

P2: Guard over-blocks on non-canonical owned games.
  Before: seeded from max(owned_games.l2_block) which includes orphan
  branches and old siblings — can permanently block creation.
  After: last_created_game_l2_block is persisted directly in the backup
  file (#[serde(default)] for backward compat). On restore, the exact
  guard value is used. On save, the current guard value is written.
  No more derivation from game cache contents.

P3: Test doesn't verify guard clears after sync.
  After: test now calls should_create_game() a third time after sync
  and asserts it matches the baseline (true). This proves the full
  lifecycle: set on creation → blocks while stale → clears after sync.
  The test still skips gracefully if finalization gate blocks baseline.

Files: backup.rs (new field), proposer.rs (save/restore/try_restore),
       sync.rs (test lifecycle assertion).

* fix: schema guard, crash-window backup, deterministic test precondition

P1: Update backup_schema_guard test to include last_created_game_l2_block
    in the expected key list.

P2: Trigger backup() immediately after handle_game_creation succeeds in
    the creation task. Closes the crash window between game creation and
    the next periodic backup — without this, a crash after creation but
    before backup would lose the duplicate-creation guard.

P3: Replace `return Ok(())` skip with `assert!(should_create_before)`
    hard failure. In mock mode + anvil, finalization should not block,
    so this precondition should always hold. If it doesn't, the test
    fails loudly instead of silently passing with zero guard coverage.

* fix: make should_create_game pub (tests are a separate crate)

* fix: remove unused OPSuccinctFaultDisputeGame import

* fix: preserve anchor baseline in canonical_head_l2_block

Revert clearing canonical_head_l2_block when canonical head is None.
This field is seeded from the anchor root in init_state() and serves as
the baseline for should_create_game() to propose the first game
(parent_game_index = u32::MAX). Clearing it blocks initial proposals on
fresh deployments or when the pinned snapshot has no games.

canonical_head_index is still cleared to None. The new index gauge (-1)
provides observability for the "no head" state without breaking the
creation path.

* docs: document fault proof proposer head metrics

* refactor: simplify anchor check, remove dead metric reset branch

- @piersy: simplify anchor lookup to HashMap::contains_key(&index)
  (O(1) instead of O(n) linear scan over values)
- @seolaoh: remove dead metric reset branch. canonical_head_l2_block
  is no longer cleared (preserves anchor baseline), so the else branch
  is unreachable under normal operation. Kept defensive warn log.

* fix: remove catch-up guard clear, add CHALLENGER_WINS guard reset

Per @piersy's review:

1. Remove explicit guard clear in sync_state(). The guard doesn't need
   to be cleared when canonical head catches up — should_create_game()
   naturally bypasses it because next_l2_block = head + interval >
   last_created. Clearing it was actively harmful: the cleared value (0)
   gets persisted in the next backup, and a restart with a lower pinned
   block would lose the protection.

2. Reset guard when CHALLENGER_WINS removes the tracked game's subtree.
   Without this, the guard would permanently block creation after a
   legitimate branch switch. The guard is only cleared if the subtree
   contains a game at the exact L2 block being tracked.

* fix: track created game by index for precise CHALLENGER_WINS guard reset

The CHALLENGER_WINS subtree removal was matching by L2 block number, but
multiple games can share the same L2 block. This could falsely clear the
guard when a different game at the same height was removed.

Now tracks both l2_block (for should_create_game check) and factory index
(for CHALLENGER_WINS reset). The index uniquely identifies the game, so
the guard is only cleared when the exact created game is in the removed
subtree.

Both values are persisted in backup (last_created_game_index with
#[serde(default)] = u64::MAX for old backups).

* refactor: use game address instead of index for creation guard identity

Per review feedback: gameCount()-1 to derive the index is race-prone
if other games are created in the same block. Game address is available
directly from the DisputeGameCreated receipt event, is globally unique,
and requires no extra RPC call.

Changes:
- last_created_game_index → last_created_game_address (Arc<Mutex<Address>>)
- CHALLENGER_WINS subtree removal now matches by address
- Backup field: last_created_game_index → last_created_game_address
  (#[serde(default)] = Address::ZERO for old backups)
- Removed the gameCount() call after creation

* test: add CHALLENGER_WINS guard invalidation regression test

New test: test_challenger_wins_clears_creation_guard

1. Create game 0 (anchor), sync. Verify baseline allows creation.
2. Create game 1 via handle_game_creation (sets guard). Verify blocked.
3. Sync — game 1 enters cache.
4. Challenge game 1, warp, resolve as CHALLENGER_WINS.
5. Sync — RemoveSubtree fires, guard cleared (address match).
6. should_create_game returns true — creation unblocked.

This test would fail without the address-based invalidation in
RemoveSubtree: the guard would remain set at game 1's L2 block,
and after CHALLENGER_WINS reverts canonical head to game 0,
should_create_game would compute the same next_l2_block <= guard
and return false permanently.

* chore: fix stale comment (index→address), remove race-prone gameCount log

* fix(test): resolve parent game before child in CHALLENGER_WINS test
…al head (succinctlabs#736)

* fix: allow games with parent < anchor become canonical

* test: canonical head override logic

* fix: merge cases into test_sync_state_happy_paths

* fix: update compute_canonical_head comment
@seolaoh seolaoh changed the title Merge upstream v4.3.0 Merge upstream v4.3.0 + proposer L1 block pinned sync Apr 27, 2026
@seolaoh seolaoh marked this pull request as ready for review April 27, 2026 16:06
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3ecabaa53d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread fault-proof/src/proposer.rs
Distinguish the two skip cases: log at WARN when confirmed_number
moves backwards (load-balanced RPC backend regression or deep L1
reorg) so operators can detect unhealthy backends, and keep DEBUG
for the normal equal case where L1 simply hasn't ticked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fbfcc771cb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread fault-proof/src/proposer.rs
Comment thread fault-proof/src/proposer.rs
@palango
Copy link
Copy Markdown

palango commented Apr 28, 2026

@seolaoh for our config (cc @ezdac):
Do we want to set SYNC_L1_CONFIRMATIONS once this rolls out or are we happy with the current state of things?

Comment thread Cargo.toml
reth-trie-common = { git = "https://github.com/succinctlabs/reth", branch = "v1.9.3-kona" }

[patch."https://github.com/op-rs/kona"]
kona-mpt = { git = "https://github.com/celo-org/kona", rev = "cf790b7" }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe to remove? Doesn't this pull in the sequencer drift change we need?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, with removing these patches, we don't have our max_sequencer_drift override anymore. celo-kona already removed this patch with the anticipation that future kona bump will pull in the configurable max_sequencer_drift. But we'd like to use this version right now, I'll re-add the patch when applying kona security fix in the next PR.

@seolaoh
Copy link
Copy Markdown
Collaborator Author

seolaoh commented Apr 28, 2026

@seolaoh for our config (cc @ezdac): Do we want to set SYNC_L1_CONFIRMATIONS once this rolls out or are we happy with the current state of things?

Yes, it'd be better to set it when using LB L1_RPC since we can get different responses in each check even with pinned_block (e.g. the block is not existing in another backend).

The max difference among mainnet proxyd backends over the last 24 hours was 4 blocks, occurring around 2026-04-27T15:35 KST. Typically the spread stays at 0–1 block, averaging ~0.77. All 4 backends are well synchronized overall. So maybe setting it to 5-10 would be a good start. The downside is that we'll see ~60-120 seconds of staleness on L1 reads.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d994624f49

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread fault-proof/src/proposer.rs Outdated
When sync_games prunes future games (above pinned_latest_index) due to
abnormal cache states like backup restore into a shorter chain or deep
L1 reorg, the duplicate-creation guard could point at a game that no
longer exists on chain. Without resetting, should_create_game blocks
indefinitely because canonical_head_l2_block cannot advance through an
orphaned game.

Reset is gated on the guarded address being among the entries this
prune actually removes (evaluated before the removal loop). Checking
"absent from post-prune cache" would over-clear in the case where the
just-created game has not yet been added to the cache and an unrelated
prune fires, allowing should_create_game to re-submit a duplicate at
the same L2 block before the cache catches up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0019a89fe1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread fault-proof/src/proposer.rs
@palango
Copy link
Copy Markdown

palango commented Apr 28, 2026

The max difference among mainnet proxyd backends over the last 24 hours was 4 blocks, occurring around 2026-04-27T15:35 KST. Typically the spread stays at 0–1 block, averaging ~0.77. All 4 backends are well synchronized overall. So maybe setting it to 5-10 would be a good start. The downside is that we'll see ~60-120 seconds of staleness on L1 reads.

You just mean that the proposer will take longer to react to updates, right? or is there anything else changing?

…ve/claim

With sync_l1_confirmations > 0, the pinned cache lags behind the chain
tip by sync_l1_confirmations × block_time, so a recently confirmed
prove(), resolve(), or claimCredit() tx may not yet be reflected in
should_attempt_* flags. Without a pre-flight check, the proposer would
re-submit duplicate transactions that revert on chain — wasting gas for
resolve/claim, and re-running expensive proof generation for prove.

Each path now does one eth_call at `latest` before submission:
- resolve_games: skip if GameStatus != IN_PROGRESS
- claim_bonds: skip if credit(signer) == 0
- should_skip_proving: skip if ProposalStatus is *ValidProofProvided or
  Resolved (single check covers both already-proven and timeout
  default-loss cases since Resolved is set whenever GameStatus moves
  out of IN_PROGRESS)

On RPC failure the check logs a warn and proceeds, so transient backend
issues don't block legitimate work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@seolaoh
Copy link
Copy Markdown
Collaborator Author

seolaoh commented Apr 28, 2026

You just mean that the proposer will take longer to react to updates, right? or is there anything else changing?

I found that with increasing sync_l1_confirmation makes the proposer see the lagged state of L1 and re-do already done tasks except for game creation which has the guard with last_created_game_*. So I added a safety guard right before submitting resolve, claim tx and starting proving, which will check against the latest L1 state. Also might be optimal to set sync_l1_confirmation to 5, considering the delay of everything.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3d76ec6ec9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread fault-proof/src/proposer.rs
Comment thread fault-proof/src/proposer.rs
When SafeDB is active, cost-estimator splits the requested range at every
span batch boundary via split_range_based_on_safe_heads, producing one
zkVM execution per span batch regardless of --batch-size. That mirrors a
hypothetical "split each proposal at span batch boundaries" workload, not
what the proposer actually does (RANGE_SPLIT_COUNT-driven arithmetic split,
default 1 = single execution per proposal interval).

The new --no-safe-head-split flag forces split_range_basic so the range
is partitioned solely by --batch-size, giving a closer estimate of the
per-segment cost the proposer sees on the prover network.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@seolaoh seolaoh merged commit b07ccb8 into develop May 4, 2026
16 checks passed
@seolaoh seolaoh deleted the seolaoh/merge-v4.3.0 branch May 4, 2026 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pin L1 block number during proposer sync loop to prevent RPC inconsistency

4 participants