Merge upstream v4.3.0 + proposer L1 block pinned sync#138
Conversation
…inctlabs#829) * fix: graceful fallback when eth_getBlockReceipts is unavailable Restricted RPC endpoints return 403 for eth_getBlockReceipts, crashing get_l2_block_data_range and losing expensive SP1 execution results. - fetcher: wrap get_block_receipts in match, return zero fees on failure - stats: guard empty block_data and prevent division by zero - cost_estimator: same division-by-zero guard in aggregate_execution_stats * style: fix rustfmt formatting in fetcher.rs match expression * fix: log actual error in eth_getBlockReceipts fallback Address review feedback: split Ok(None) | Err(_) into separate match arms so the error is logged via `error = %e` when the RPC call fails.
…cinctlabs#843) * ci: remove PR triggers from heavy CI workflows to reduce costs E2E sysgo tests and integration tests were running on every PR push, costing ~$2.5k/mo with 9+ concurrent 64-core runners. Move these to merge_group + main push only, with /sysgo and /integration comment triggers for on-demand runs during development. Refs: succinctlabs/cloud-ops#69 * ci: restore PR triggers with reduced matrix per review feedback - PR: run reduced e2e matrix (validity/bootstrap + faultproof/bootstrap) with path filters for fast feedback - PR: run reduced integration matrix (sync only) with path filters - merge_group/main/tags: full matrix as before - /sysgo and /integration comment triggers: full matrix on-demand - Long-running tests: merge_group/main/sysgo only (not on PR)
…tlabs#839) * fix: prevent duplicate game creation via anchor parent index When a finalized game becomes the anchor, `anchors(GAME_TYPE)` returns the same data as looking up the game by its factory index. This allows creating two games with the same (rootClaim, l2BlockNumber, startingOutputRoot) but different parentIndex values, bypassing the factory's UUID uniqueness check. Fix: use `getStartingAnchorRoot()` instead of `anchors()` in the `parentIndex == uint32.max` branch, so it always resolves to the genesis anchor root rather than the current anchor. Bumps optimism submodule from op-contracts/v5.0.0 to v6.0.0 to pick up `getStartingAnchorRoot()` (Optimism PR #18343). Refs: succinctlabs/cloud-ops#56 Co-authored-by: Roger Bai <roger-bai-coinbase@users.noreply.github.com> * refactor: extract _finalizeAndClose helper and improve test naming - Extract duplicated warp→resolve→warp→closeGame into _finalizeAndClose - Rename testNoDuplicateGameWithAnchorParent to testDifferentStartingRootForIndexVsGenesisParent - Add assertion that byIndex and byGenesis are distinct contracts * fix: require parent game ahead of anchor and use current anchor for genesis path - Parent game's L2 block must be strictly ahead of the anchor state, preventing duplicate games when anchor == parent. - uint32.max path uses getAnchorRoot() (current anchor) instead of getStartingAnchorRoot() (genesis), enabling recovery after game type switches without resetting to genesis. * style: fix forge fmt * refactor: remove redundant IAnchorStateRegistry casts * test: add edge case coverage for anchor-aware parent validation - testCannotCreateGenesisPathBelowAnchor: uint32.max with l2Block <= anchor - testChainFromAnchorPathGame: chain child via parent index from uint32.max game - testGenesisPathUsesLatestAnchorAfterMultipleAdvances: multiple anchor updates - testParentOneBlockAheadOfAnchorSucceeds: boundary condition (anchor + 1) * test: fix wrong comment and remove misleading boundary test * test: remove redundant testCannotUseAnchorGameAsParentIndex * fix: proposer uses anchor path when canonical head is the anchor game The contract now requires parent.l2SeqNum > anchor.l2SeqNum. When the canonical head IS the anchor game (e.g., after first game finalizes), the proposer must use uint32.max (anchor path) instead of referencing it by index. Also reorders contract checks so the cheap CHALLENGER_WINS SLOAD runs before the getAnchorRoot external call. * test: fix CI failures from parent-ahead-of-anchor constraint - Fix clippy: map_or → direct Option comparison (proposer.rs:1937) - Fix integration test: accept u32::MAX as valid parent when canonical head is the anchor game (test_game_chain_validation_anchor_reset) - Fix sync tests: create all games before setting anchors, since the contract now rejects parent.l2SeqNum <= anchor.l2SeqNum * style: rename dl to deadline in _finalizeAndClose helper * style: fix rustfmt operator placement * fix: restore time gaps between game creations in deadline lag test The two-phase refactor (create all games, then set anchors) removed the time warps that originally occurred between game creations. This caused all games to have nearly identical deadlines, preventing the MAX_GAME_DEADLINE_LAG filter from dropping stale games during sync. Add explicit time warps between game creations in phase 1 to replicate the original timing gaps while keeping the two-phase structure. * style: fix rustfmt line wrap --------- Co-authored-by: Roger Bai <roger-bai-coinbase@users.noreply.github.com>
…cinctlabs#842) * feat: cache-first rollup config loading for hardfork transitions When a node is upgraded before hardfork activation, the node RPC returns post-hardfork rollup config prematurely. The proposer then computes a wrong rollup_config_hash, creating invalid dispute games. Change fetch_and_save_rollup_config to check for a cached config file before fetching from RPC. On first run the config is fetched and cached; on subsequent runs the cached file is reused. A best-effort RPC comparison (5s timeout) warns operators when configs diverge, signaling that the cache should be refreshed after hardfork activation. Closes succinctlabs/cloud-ops#67 * docs: remove manual cache deletion instructions The proposer's on-chain vkey check (on_chain_vkeys_match) already prevents game creation with a mismatched rollup_config_hash, so operators never need to manually delete the cached config file. The WARN log is kept for visibility but no longer prescribes action. * fix: validate chain ID after RPC fetch, clarify warn-only intent - Add chain ID validation after initial RPC fetch to catch L2_RPC / L2_NODE_RPC misconfiguration on first run (symmetric with cache path) - Add doc comment on compare_config_with_rpc clarifying that warn-only is intentional — on-chain vkey check is the authoritative gate * style: rustfmt * fix: use struct init syntax to satisfy clippy field_reassign_with_default * test: drop stdlib/serde tests per review
…s#844) * feat(fault-proof): filter games by anchor state registry in fetch_game During hardfork transitions, old ASR games in the proposer's local DAG can pollute canonical head selection, causing the proposer to create new games on top of invalid parents. Add an anchorStateRegistry() check in fetch_game() after the game type check and before other on-chain calls. Games with a different ASR are rejected with UnsupportedAnchorStateRegistry, keeping the DAG clean. Operators should run one proposer per ASR during transitions: the old proposer winds down existing games while the new one creates fresh ones. Upstreams the approach from #118. Closes succinctlabs/cloud-ops#68 * test(fault-proof): add ASR filtering tests for canonical head selection Three tests that exercise ProposerState.descendants_of() to verify the DAG behavior that motivates the ASR filter: - old_asr_games_pollute_canonical_head: demonstrates the bug where old ASR games with higher L2 blocks override the canonical head - filtered_dag_produces_correct_canonical_head: verifies clean DAG after filtering produces correct head selection - descendants_excludes_unrelated_chains: verifies DAG traversal correctly isolates independent game chains * test: trim test comments * refactor: address review feedback - Update fetch_game docstring to include ASR check - Move ASR filtering tests from inline mod to tests/sync.rs - Remove old_asr_games_pollute_canonical_head (asserts own literals)
…inctlabs#857) * ci: skip heavy checks on docs-only PRs Remove `paths` filters from `pull_request` triggers and add a lightweight `changes` job using dorny/paths-filter to detect code changes. Heavy CI jobs (integration tests, e2e tests, lint) now depend on `changes` and skip when only docs files are modified. This fixes docs-only PRs being permanently blocked by required status checks that never trigger. When a job is skipped via `if` condition, GitHub reports it as "skipped" which satisfies the required check in rulesets. * ci: pin dorny/paths-filter to commit SHA Pin dorny/paths-filter@v3 to de90cc6f to prevent supply-chain attacks via tag mutation. Also pins existing usage in e2e-sysgo-network-prover.yml for consistency. * ci: fix review feedback — preserve non-PR runs and materialize matrix checks P1: Add always() to lint and integration-tests job conditions so they run on push/merge_group even when the changes job is skipped. P2: E2E matrix job now always runs on PRs (if: always()) so the matrix expands and individual check names are reported. On docs-only PRs, runs on ubuntu-latest and exits at step level. On code PRs, uses the original 64cpu runner. * ci: fix fp-integration-tests-pr to always report check on docs-only PRs Apply the same pattern as e2e-sysgo-tests-pr: always run the job on PRs (so "Fault Proof (sync)" check is reported), use conditional runs-on for cheap runner on docs-only PRs, and skip steps at step level instead of job level. * ci: default to running full job when changes detection fails If the changes job fails (rather than reporting code=false), treat it as a code change and run the full CI job. This prevents silently skipping required checks on real code PRs when path detection errors. Logic: skip only when changes.result == 'success' AND code != 'true'. Otherwise (changes failed, skipped, or code == 'true'), run full job. * ci: add missing paths to change detection filters - e2e: add contracts/** (e2e tests build and use contracts) - lint: add bindings/** (cargo check covers bindings workspace) - all: add rust-toolchain.toml (toolchain changes affect all cargo commands)
Add Swatinem/rust-cache@v2 to the lint CI job. Uses a dedicated
shared-key ("lint") to avoid cross-contamination with stable-toolchain
caches used by other workflows. cache-on-failure is enabled so the
dependency cache is preserved even when clippy or fmt finds issues.
Reduces lint CI time from ~15m to ~5m.
Co-authored-by: Piers Powlesland <pierspowlesland@gmail.com>
…tlabs#845) * docs: add migration guide from optimistic to ZK fault proofs Add a step-by-step guide for migrating an existing OP Stack chain from optimistic fault proofs (e.g., Cannon) to ZK fault proofs using OP Succinct Lite (game type 42). Covers contract deployment, game type activation, old system wind-down, withdrawal safety, and rollback. * docs: extend migration guide to cover both validity and fault proof modes Move migration.md from fault_proofs/ to advanced/ since it now covers both OP Succinct (validity, game type 6) and OP Succinct Lite (fault proofs, game type 42). Phase 1 branches by mode with mode-specific contract deployment steps; Phases 2-3 and withdrawal safety are shared. * docs: fix review issues in migration guide - Fix critical: replace `just deploy-dispute-game-factory` (deploys new factory) with manual setImplementation on existing factory for validity path - Fix broken anchor: #safe-db-configuration → #safedb-configuration - Trim redundant wasRespectedGameTypeWhenCreated re-explanation in rollback - Note factory cleanup behavior on rollback * docs: address Farhad's review feedback on migration guide 1. Scope wasRespectedGameTypeWhenCreated admonish to fault proof mode only — OPSuccinctDisputeGame hardcodes the flag to true, so proposer start ordering doesn't affect game validity in validity mode. 2. Move migration.md from advanced/ to top-level — migration is a core operational workflow for existing customers, not an advanced topic. 3. Update Withdrawal Safety section to explain the behavioral difference between validity mode (flag always true) and fault proof mode (flag dynamically set based on respected type at creation time). * chore: trigger check re-evaluation
…tlabs#846) * fix(fault-proof): remove orphan check from backup validation The backup orphan check rejects any backup where a game's parent_index doesn't exist in the backup. This conflicts with anchor-based fetching (which only keeps recent games) and ASR filtering (succinctlabs#844, which skips old ASR games) — both produce partial DAGs with legitimate orphans. Remove the check. The remaining validations (cursor/games consistency, anchor game existence) are sufficient for detecting real corruption. Backup writes are atomic (tempfile + rename + fsync), so partial writes cannot produce orphans. Fixes succinctlabs#810 * style: condense doc comment, restore inline comments * style: rustfmt doc comment line wrap
The `tibdex/backport` CI workflow has been non-functional: the label `backport/v3.x` makes it try `git switch v3.x`, but the branch is `release/v3.x`. Even with a fixed label, the action fails on any cherry-pick conflict (ELF binaries, Cargo.lock, diverged code), which covers most real backports. Replacing with the `/backport-op-succinct` Claude Code skill, which handles conflicts intelligently, rebuilds ELFs from the backport branch, and verifies equivalence against the source PR.
…ccinctlabs#828) The 64Gi recommendation was based on weight sums (Controller=16 + PlonkWrap=32 = 48) but didn't account for actual peak memory overhead during co-scheduling: - Jemalloc fragmentation (~3-5 GB) - Circuit artifact loading (groth16 + plonk, ~3 GB) - Splicing worker overhead during large CONTROLLER tasks - OS/runtime overhead Actual peak memory during CONTROLLER + PlonkWrap co-scheduling reaches ~60-70 GB, causing OOMKill with a 64Gi limit. The 96Gi recommendation provides sufficient headroom.
…labs#860) Previously, hardcoded noise-suppression directives were added AFTER RUST_LOG parsing, so they always won — making it impossible to debug specific modules (e.g. RUST_LOG=single_hint_handler=debug had no effect). Flip the order: apply defaults first, then layer RUST_LOG directives on top. Matching targets get overridden; everything else stays suppressed. Based on succinctlabs#770 by @kien-rise. Co-authored-by: kien-rise <157339831+kien-rise@users.noreply.github.com>
This cron job has failed every single run since creation (100+ consecutive failures) due to Conduit RPC rate limiting (HTTP 429). It hits a free-tier L2 node with 15 concurrent requests across 1000 blocks, which always exceeds the rate limit. The PR-triggered Cost Estimator workflow (cost-estimator.yml) already provides the same coverage at a smaller range (300 blocks) and passes consistently.
…ost-Isthmus) (succinctlabs#858) * perf(fault-proof): use header withdrawals_root for L2 storage root (post-Isthmus) Skip the eth_getProof RPC call when the L2 block header already carries the L2ToL1MessagePasser storage root in its withdrawals_root field. This optimization only applies to post-Isthmus blocks. Pre-Isthmus blocks (Canyon→Fjord) set withdrawals_root to the empty MPT root (0x56e81f...), which is NOT the storage root — the code falls back to eth_getProof for those blocks. Based on succinctlabs#771 by @kien-rise, with a fix for the Canyon→Isthmus case. Ref: https://specs.optimism.io/protocol/isthmus/exec-engine.html * perf: use header withdrawals_root for L2 storage root (post-Isthmus) Skip eth_getProof RPC calls when the L2 block header already carries the L2ToL1MessagePasser storage root in its withdrawals_root field. This optimization only applies to post-Isthmus blocks. Pre-Isthmus blocks (Canyon→Fjord) set withdrawals_root to the empty MPT root (EMPTY_ROOT_HASH), which is NOT the storage root — the code falls back to eth_getProof for those blocks. Applied to both code paths: - fault-proof: compute_output_root_at_block (proposer/challenger) - host-utils: fetcher (validity proof) Based on succinctlabs#771 by @kien-rise, with a fix for the Canyon→Isthmus case. Ref: https://specs.optimism.io/protocol/isthmus/exec-engine.html Co-authored-by: kien-rise <157339831+kien-rise@users.noreply.github.com> --------- Co-authored-by: kien-rise <157339831+kien-rise@users.noreply.github.com>
…uccinctlabs#861) * fix(test): fix race condition in proposer recovery integration test The test checked for a game with the expected parent index only once after the first new game appeared. If the proposer hadn't completed recovery by that point (common with eigenda DA layer due to slower sync cycles), the scan missed the correct game. Replace the single-pass scan with a polling loop that continuously checks newly created games until the expected parent is found or WAIT_TIMEOUT (60s) expires. * fix: remove unused mut * refactor: labeled break, conditional sleep, remove dead variable
…uccinctlabs#864) The test verified proposer recovery AFTER bond claims, which could advance the anchor to game 1. When canonical head == anchor, the proposer uses u32::MAX (anchor path) instead of parent_index=1 — a valid but non-deterministic outcome depending on DA layer speed. Move recovery check to BEFORE finalization (new Phase 4 → Phase 5 ordering). With anchor still at game 0, canonical head = game 1 is guaranteed, making parent_index=1 deterministic across all DA layers. Root cause: succinctlabs#839 added the anchor path logic but didn't update this test, which was written in succinctlabs#719 (Dec 2025) before anchor paths existed.
…bs#869) * fix(scripts): honor explicit --batch-size and reject zero Since succinctlabs#820, `effective_batch_size()` unconditionally returned `end - start` whenever both `--start` and `--end` were provided, silently ignoring any explicit `--batch-size`. This broke the documented cost-estimator example and the recommended workaround for split-aware PGU estimation (`--start X --end Y --batch-size N`). Change `batch_size` to `Option<u64>` and rewrite `effective_batch_size()` with explicit precedence: explicit flag > full-range DX default > fallback. The PR succinctlabs#820 convenience (single batch when only a range is given) is preserved when `--batch-size` is omitted. Additionally reject `--batch-size 0` at the clap parse layer so the range splitter cannot receive a degenerate `max_range_size = 0` via the CLI. Covered by a decision-table of unit tests plus parse-layer tests for the zero-rejection and precedence behavior. * refactor(scripts): encode --batch-size non-zero invariant with NonZeroU64 Addresses review feedback: `HostExecutorArgs` is a public struct in a library crate, so parser-only validation is insufficient — any in-tree or downstream code that constructs the struct programmatically could still set `Some(0)` and slip through to the splitter. Move the invariant into the type: change `batch_size` from `Option<u64>` to `Option<NonZeroU64>`. The clap value parser now returns `NonZeroU64`, keeping the user-facing error message ("--batch-size must be greater than 0") unchanged. `effective_batch_size` calls `.get()` on the inner value; precedence behavior is preserved exactly. The existing 10 tests are retained; the test helper `args()` keeps its `Option<u64>` signature and converts via `NonZeroU64::new(..).expect(..)` so test call sites stay readable and any accidental zero in a test fixture panics clearly.
…ble (succinctlabs#874) * feat(fault-proof,validity): make L1 tx confirmation timeout configurable The signer's hardcoded TIMEOUT_SECONDS=60 is too tight for 3 L1 confirmations on Ethereum mainnet (~36s of pure block time leaves only ~24s for mempool inclusion). Under any congestion the watcher gives up while the tx is still in flight, returns an error, and the caller immediately retries - racing the original tx and, in the fault-proof proposer's case, producing duplicate sibling games when the original eventually lands. Plumb a new `send_transaction_request_with_timeout` through the signer and add `tx_confirmation_timeout` (env: TX_CONFIRMATION_TIMEOUT) to all three L1-submitting clients: - fault-proof proposer (ProposerConfig) - fault-proof challenger (ChallengerConfig) - validity proposer (RequesterConfig / EnvironmentConfig) The default stays at 60s to preserve the historical signer behavior; operators on congested L1s can raise it (e.g. 180) without a code change. Co-authored-by: seolaoh <osa8361@gmail.com> * docs: add TX_CONFIRMATION_TIMEOUT to env var docs Document the new TX_CONFIRMATION_TIMEOUT environment variable in the fault-proof proposer, fault-proof challenger, and validity proposer documentation pages. * fix(docs): update Kona link to new monorepo location The kona project moved from op-rs.github.io/kona to the ethereum-optimism/optimism monorepo, breaking the CI link checker. --------- Co-authored-by: seolaoh <osa8361@gmail.com>
* chore: bump sp1 to v6.1.0 Bump SP1 from 6.0.2 to 6.1.0 across all workspace deps, Dockerfiles, CI workflows, justfile, build utils, and documentation. Also bump sp1-cluster to a fork with CoreExecute task type support required for sp1-prover-types 6.1.0 compatibility. * chore: rebuild ELFs with SP1 v6.1.0 toolchain * ci: exclude flaky external URLs from link checker caldera.xyz and op-rs.github.io/kona intermittently fail TLS/connection checks in CI. Exclude them to prevent false-positive failures. * fix(docs): update kona link to new optimism monorepo location op-rs/kona was archived and moved to ethereum-optimism/optimism. The old GitHub Pages URL is permanently dead. * chore: switch sp1-cluster to official v2.1.0 release Replace fork (fakedev9999/sp1-cluster) with succinctlabs/sp1-cluster v2.1.0 which includes CoreExecute task type support. Add [patch."https://github.com/succinctlabs/sp1"] to redirect sp1-cluster's git SP1 deps to crates.io, avoiding duplicate types. * chore: switch sp1-cluster to official v2.1.0 release v2.1.2 uses crates.io SP1 deps, so the [patch] workaround for git/crates.io type duplication is no longer needed. * chore: update hokulea to SP1 v6.1.0 (fork rev) Point hokulea deps to fakedev9999/hokulea@af7f05e which has SP1 6.1.0 without the alloy-consensus =1.6.3 exact pin from hokulea master. Will switch to official tag once Layr-Labs cuts a release. * chore: update hokulea rev to 7afecb4 (doc fix) * chore: rebuild ELFs with SP1 v6.1.0 (latest hokulea rev) * feat: bump kona to v1.2.13 (ethereum-optimism/optimism monorepo) Major ecosystem upgrade to align with hokulea master: - kona: op-rs/kona v1.2.7 → ethereum-optimism/optimism v1.2.13 - alloy: 1.4.x → 1.6.3, alloy-primitives 1.4.1 → 1.5.6 - revm: 33.1.0 → 34.0.0, op-revm 14.1.0 → 15.0.0 - op-alloy: 0.22.4 → 0.23.1 (from optimism monorepo) - hokulea: fork rev → Layr-Labs/hokulea master (22c48b9) - hana: fork rev with kona v1.2.13 bump (30d72ed) - Removed reth patch (PartialEq fix upstreamed in reth v1.11.3) Code changes: - BlobProvider: IndexedBlobHash → B256 - ZkvmOpEvmFactory: updated for alloy-op-evm 0.26.x API - Pipeline cursor: added agreed_l2_output_root arg * chore: rebuild ELFs with kona v1.2.13 + SP1 v6.1.0 * chore: switch hana from fork rev to official v1.6.0-mocha tag * chore: remove stale comments from Cargo.toml * chore: remove unnecessary kona op-rs/kona patch redirect * chore: rebuild celestia ELF with official hana v1.6.0-mocha tag * chore: switch hokulea from rev to official hokulea-client/v1.1.7 tag * chore: rebuild ELFs with official hokulea-client/v1.1.7 tag
…ccinctlabs#878) Update both sp1-contracts submodules (tests/ and contracts/lib/) to the v6.1.0 tag. Update DeployVerifier.s.sol imports from v6.0.0 to v6.1.0 to fix WrongVerifierSelector in e2e network prover tests.
* chore: release v4.3.0 * chore: bump elfs
…cy (succinctlabs#865) * fix(fault-proof): pin L1 block during sync to prevent RPC inconsistency With load-balanced RPCs, different eth_call requests within the same sync cycle can hit backends at different block heights. This breaks atomicity between related state reads (e.g. credit() vs anchorGame()), which caused Celo's March 26 incident: canonical head cleared, ~470 orphaned games created. Fix: fetch L1 block number once at the start of sync_state() and pass it to all downstream eth_call requests via .block(pinned_block). All state reads within a sync cycle now see a consistent L1 snapshot. Pinned functions: sync_games, sync_anchor_game, fetch_game, is_parent_resolved, is_parent_challenger_wins. The challenger passes BlockId::latest() to maintain existing behavior (separate pinning can be added later). Addresses: #132 Tracking: succinctlabs/cloud-ops#98 * fix: add missing BlockId::latest() to fetch_game call in test * refactor: collapse redundant RPC calls and cache anchor address - Fetch L1 block once in sync_state (was: get_block_number + get_block_by_number separately) - Pass timestamp from the pinned block, eliminating a second RPC call in sync_games - Reuse anchor_address from top of sync_games instead of re-fetching in DEFENDER_WINS loop - Remove pinned_block.as_u64().unwrap() (latent panic if BlockId variant changes) * feat(fault-proof): add confirmed block offset and skip-when-unchanged Two improvements from @palango's review: 1. SYNC_L1_CONFIRMATIONS env var (default: 0): pin reads to latest - N blocks for deployments with unstable RPC backends. When 0, uses latest directly (no extra RPC call). 2. Skip sync when the pinned L1 block hasn't advanced past the last successfully synced block. Avoids redundant RPC calls when the sync loop runs faster than L1 block production. * simplify: single RPC for confirmed block, remove dead unwrap_or * fix: add sync_l1_confirmations to test ProposerConfig * fix: single-response snapshot, prune future games, clear stale anchor P1: Pin from single RPC response. - confirmations=0: get_block_by_number(Latest) — one call, no split-backend risk. - confirmations>0: Latest first (for skip check), then confirmed fetch. If confirmed block is unavailable (Ok(None)), skip cycle. Err propagated. P2: Prune backup-restored games above pinned latest index. - Runs before the early-return on latest_index==None, so all cached future games are removed even when the pinned block has no games. - sync_anchor_game now clears state.anchor_game when the pinned anchor is Address::ZERO or not found in cache, preventing compute_canonical_head from following a stale subtree. * test: add backup+prune regression tests, cloud-ops follow-ups Two integration tests for the prune behavior: 1. test_backup_prune_future_games_with_confirmations: restart from backup with large SYNC_L1_CONFIRMATIONS, verify all future games are pruned, anchor/canonical head cleared. 2. test_backup_restore_pinned_block_no_games: same but pinned block has zero games (latest_index=None), verify full state clear. Added new_proposer_with_confirmations helper to allow tests to override sync_l1_confirmations. Follow-up validation tracked in: - cloud-ops#102: SYNC_L1_CONFIRMATIONS > 0 on load-balanced RPCs - cloud-ops#103: skip/retry when confirmed block unavailable * test: replace full-prune test with deterministic partial-prune Replace the first backup test (which used huge confirmations to prune all games — same scenario as the zero-games test) with a proper partial-prune test: 1. Create game 0, mine an empty L1 block, create game 1. 2. Sync and save backup with confirmations=0 (both games cached). 3. Restart from backup with confirmations=1 — pinned block is between game 0 and game 1. 4. Assert: game 0 survives, game 1 pruned, canonical head is game 0. Also added new_proposer_with_options helper to TestEnvironment for cleaner test setup. * fix(test): guarantee backup restore before prune assertions Both tests now: 1. Construct and save backup via ProposerBackup::new() + save() directly (not relying on the proposer run loop to trigger backup). 2. Assert games are restored from backup BEFORE sync_state() prunes them, proving the prune path is actually exercised. Partial-prune test: backup has 2 games, sync with confirmations=1 prunes game 1 (created after the gap block) while game 0 survives. Zero-games test: backup has 1 game, sync with huge confirmations prunes everything. * fix(test): use real Game structs from proposer cache in backup tests * fix: reset cursor when pinned snapshot has no games * test: add rediscovery regression test for cursor reset after prune New test: test_backup_prune_then_rediscover_after_confirmation 1. Create game 0, save real backup. 2. Restart with confirmations=1 — pinned block is before game 0. 3. First sync: game pruned, cache empty, cursor reset. 4. Mine one empty block — pinned block now includes game 0. 5. Second sync: game 0 rediscovered, canonical head restored. Directly tests the cursor reset fix: without it, step 5 would skip the fetch loop and game 0 would stay missing permanently. * feat(metrics): add canonical_head_index and anchor_game_index gauges New metrics: - op_succinct_fp_canonical_head_game_index: -1 when cleared, else game index - op_succinct_fp_anchor_game_index: -1 when cleared, else game index Also fixes stale LatestGameL2BlockNumber and AnchorGameL2BlockNumber: previously these were never reset when canonical head was cleared (only a warning was logged). Now they are set to 0.0 in the None branch. Sentinel: -1 for "absent/cleared" since game index 0 is a valid value. No regression test added — fetch_proposer_metrics is private and requires a full provider stack + metrics recorder. The new code is 4 lines of unconditional .set() calls with no branching risk. * fix: clear canonical_head_l2_block when canonical head is None * fix: reset FinalizedL2BlockNumber when canonical head cleared * fix(fault-proof): prevent duplicate game creation when pinned cache lags Bug: With sync_l1_confirmations > 0, the pinned cache may not see a recently created game for several sync cycles. During that window, should_create_game() computes the same (next_l2_block, parent_index), and handle_game_creation() bumps only the L2 block (keeping the same parent), creating duplicate sibling games. Fix: Track last_created_game_l2_block in memory. should_create_game() skips creation when the proposed L2 block <= last created. The guard is cleared in sync_state() once canonical_head_l2_block catches up. Test: test_handle_game_creation_bumps_past_existing verifies the while-loop bump behavior (existing game at target → bumps to target+1). * fix: stale finalized metric, restart-safe creation guard, proper guard test 1. FinalizedL2BlockNumber stale on None: Reset to 0.0 when get_finalized_l2_block_number returns None (e.g. Celestia host). Previously kept the old value indefinitely. 2. Duplicate-creation guard not restart-safe: Initialize last_created_game_l2_block from the highest L2 block in restored backup games. Without this, a restart before pinned sync catches up loses the guard and allows duplicate sibling creation. 3. Regression test exercises should_create_game: Replaced handle_game_creation-only test with one that calls should_create_game() directly (now pub(crate)). Test verifies: - Guard blocks creation when cache is stale after handle_game_creation - Guard clears after sync catches up Previous test only verified the while-loop bump, not the guard. * fix: scope creation guard to owned games, isolate guard in test P2: Restart guard over-blocks on foreign games. Before: seeded from max(all_restored_games.l2_block), including foreign and orphan games. A foreign game with a higher L2 block could permanently block creation. After: seeded from max(owned_restored_games.l2_block), filtered by is_owned() (vkey match). Foreign games can't suppress creation. P3: Test doesn't prove the guard is the cause. Before: should_create_game() could return false from finalization gate, not just the guard. After: calls should_create_game() before AND after setting the guard with the same stale cache state. If baseline returns true and post-guard returns false, the guard is proven to be the only cause. If baseline returns false (finalization blocks), the test skips gracefully. * fix: persist creation guard in backup, verify full guard lifecycle in test P2: Guard over-blocks on non-canonical owned games. Before: seeded from max(owned_games.l2_block) which includes orphan branches and old siblings — can permanently block creation. After: last_created_game_l2_block is persisted directly in the backup file (#[serde(default)] for backward compat). On restore, the exact guard value is used. On save, the current guard value is written. No more derivation from game cache contents. P3: Test doesn't verify guard clears after sync. After: test now calls should_create_game() a third time after sync and asserts it matches the baseline (true). This proves the full lifecycle: set on creation → blocks while stale → clears after sync. The test still skips gracefully if finalization gate blocks baseline. Files: backup.rs (new field), proposer.rs (save/restore/try_restore), sync.rs (test lifecycle assertion). * fix: schema guard, crash-window backup, deterministic test precondition P1: Update backup_schema_guard test to include last_created_game_l2_block in the expected key list. P2: Trigger backup() immediately after handle_game_creation succeeds in the creation task. Closes the crash window between game creation and the next periodic backup — without this, a crash after creation but before backup would lose the duplicate-creation guard. P3: Replace `return Ok(())` skip with `assert!(should_create_before)` hard failure. In mock mode + anvil, finalization should not block, so this precondition should always hold. If it doesn't, the test fails loudly instead of silently passing with zero guard coverage. * fix: make should_create_game pub (tests are a separate crate) * fix: remove unused OPSuccinctFaultDisputeGame import * fix: preserve anchor baseline in canonical_head_l2_block Revert clearing canonical_head_l2_block when canonical head is None. This field is seeded from the anchor root in init_state() and serves as the baseline for should_create_game() to propose the first game (parent_game_index = u32::MAX). Clearing it blocks initial proposals on fresh deployments or when the pinned snapshot has no games. canonical_head_index is still cleared to None. The new index gauge (-1) provides observability for the "no head" state without breaking the creation path. * docs: document fault proof proposer head metrics * refactor: simplify anchor check, remove dead metric reset branch - @piersy: simplify anchor lookup to HashMap::contains_key(&index) (O(1) instead of O(n) linear scan over values) - @seolaoh: remove dead metric reset branch. canonical_head_l2_block is no longer cleared (preserves anchor baseline), so the else branch is unreachable under normal operation. Kept defensive warn log. * fix: remove catch-up guard clear, add CHALLENGER_WINS guard reset Per @piersy's review: 1. Remove explicit guard clear in sync_state(). The guard doesn't need to be cleared when canonical head catches up — should_create_game() naturally bypasses it because next_l2_block = head + interval > last_created. Clearing it was actively harmful: the cleared value (0) gets persisted in the next backup, and a restart with a lower pinned block would lose the protection. 2. Reset guard when CHALLENGER_WINS removes the tracked game's subtree. Without this, the guard would permanently block creation after a legitimate branch switch. The guard is only cleared if the subtree contains a game at the exact L2 block being tracked. * fix: track created game by index for precise CHALLENGER_WINS guard reset The CHALLENGER_WINS subtree removal was matching by L2 block number, but multiple games can share the same L2 block. This could falsely clear the guard when a different game at the same height was removed. Now tracks both l2_block (for should_create_game check) and factory index (for CHALLENGER_WINS reset). The index uniquely identifies the game, so the guard is only cleared when the exact created game is in the removed subtree. Both values are persisted in backup (last_created_game_index with #[serde(default)] = u64::MAX for old backups). * refactor: use game address instead of index for creation guard identity Per review feedback: gameCount()-1 to derive the index is race-prone if other games are created in the same block. Game address is available directly from the DisputeGameCreated receipt event, is globally unique, and requires no extra RPC call. Changes: - last_created_game_index → last_created_game_address (Arc<Mutex<Address>>) - CHALLENGER_WINS subtree removal now matches by address - Backup field: last_created_game_index → last_created_game_address (#[serde(default)] = Address::ZERO for old backups) - Removed the gameCount() call after creation * test: add CHALLENGER_WINS guard invalidation regression test New test: test_challenger_wins_clears_creation_guard 1. Create game 0 (anchor), sync. Verify baseline allows creation. 2. Create game 1 via handle_game_creation (sets guard). Verify blocked. 3. Sync — game 1 enters cache. 4. Challenge game 1, warp, resolve as CHALLENGER_WINS. 5. Sync — RemoveSubtree fires, guard cleared (address match). 6. should_create_game returns true — creation unblocked. This test would fail without the address-based invalidation in RemoveSubtree: the guard would remain set at game 1's L2 block, and after CHALLENGER_WINS reverts canonical head to game 0, should_create_game would compute the same next_l2_block <= guard and return false permanently. * chore: fix stale comment (index→address), remove race-prone gameCount log * fix(test): resolve parent game before child in CHALLENGER_WINS test
…al head (succinctlabs#736) * fix: allow games with parent < anchor become canonical * test: canonical head override logic * fix: merge cases into test_sync_state_happy_paths * fix: update compute_canonical_head comment
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3ecabaa53d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Distinguish the two skip cases: log at WARN when confirmed_number moves backwards (load-balanced RPC backend regression or deep L1 reorg) so operators can detect unhealthy backends, and keep DEBUG for the normal equal case where L1 simply hasn't ticked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fbfcc771cb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| reth-trie-common = { git = "https://github.com/succinctlabs/reth", branch = "v1.9.3-kona" } | ||
|
|
||
| [patch."https://github.com/op-rs/kona"] | ||
| kona-mpt = { git = "https://github.com/celo-org/kona", rev = "cf790b7" } |
There was a problem hiding this comment.
Is this safe to remove? Doesn't this pull in the sequencer drift change we need?
There was a problem hiding this comment.
Yes, with removing these patches, we don't have our max_sequencer_drift override anymore. celo-kona already removed this patch with the anticipation that future kona bump will pull in the configurable max_sequencer_drift. But we'd like to use this version right now, I'll re-add the patch when applying kona security fix in the next PR.
Yes, it'd be better to set it when using LB L1_RPC since we can get different responses in each check even with The max difference among mainnet proxyd backends over the last 24 hours was 4 blocks, occurring around 2026-04-27T15:35 KST. Typically the spread stays at 0–1 block, averaging ~0.77. All 4 backends are well synchronized overall. So maybe setting it to 5-10 would be a good start. The downside is that we'll see ~60-120 seconds of staleness on L1 reads. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d994624f49
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
d994624 to
5d4abe2
Compare
When sync_games prunes future games (above pinned_latest_index) due to abnormal cache states like backup restore into a shorter chain or deep L1 reorg, the duplicate-creation guard could point at a game that no longer exists on chain. Without resetting, should_create_game blocks indefinitely because canonical_head_l2_block cannot advance through an orphaned game. Reset is gated on the guarded address being among the entries this prune actually removes (evaluated before the removal loop). Checking "absent from post-prune cache" would over-clear in the case where the just-created game has not yet been added to the cache and an unrelated prune fires, allowing should_create_game to re-submit a duplicate at the same L2 block before the cache catches up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5d4abe2 to
0019a89
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0019a89fe1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
You just mean that the proposer will take longer to react to updates, right? or is there anything else changing? |
…ve/claim With sync_l1_confirmations > 0, the pinned cache lags behind the chain tip by sync_l1_confirmations × block_time, so a recently confirmed prove(), resolve(), or claimCredit() tx may not yet be reflected in should_attempt_* flags. Without a pre-flight check, the proposer would re-submit duplicate transactions that revert on chain — wasting gas for resolve/claim, and re-running expensive proof generation for prove. Each path now does one eth_call at `latest` before submission: - resolve_games: skip if GameStatus != IN_PROGRESS - claim_bonds: skip if credit(signer) == 0 - should_skip_proving: skip if ProposalStatus is *ValidProofProvided or Resolved (single check covers both already-proven and timeout default-loss cases since Resolved is set whenever GameStatus moves out of IN_PROGRESS) On RPC failure the check logs a warn and proceeds, so transient backend issues don't block legitimate work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
I found that with increasing |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3d76ec6ec9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
When SafeDB is active, cost-estimator splits the requested range at every span batch boundary via split_range_based_on_safe_heads, producing one zkVM execution per span batch regardless of --batch-size. That mirrors a hypothetical "split each proposal at span batch boundaries" workload, not what the proposer actually does (RANGE_SPLIT_COUNT-driven arithmetic split, default 1 = single execution per proposal interval). The new --no-safe-head-split flag forces split_range_basic so the range is partitioned solely by --batch-size, giving a closer estimate of the per-segment cost the proposer sees on the prover network. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merged upstream v4.3.0 which includes some proposer improvements, sp1 crates bump to v6.1.0, and kona bump to v1.2.13. Confirmed that cost-estimator on Mainnet arbitrary 5 blocks worked.
Related to https://github.com/celo-org/celo-blockchain-planning/issues/1324
Remarkable changes
u32::MAXas parent index is regarded as building on top of current anchor--batch-sizeto cost-estimator, we can estimate split-aware PGUNew configs/metrics
add
TX_CONFIRMATION_TIMEOUTto challenger/proposer and set to 180add
SYNC_L1_CONFIRMATIONSto proposer and set to whatever we want to force as a safety gap from the latest blockproposer new metric:
op_succinct_fp_canonical_head_game_index,op_succinct_fp_anchor_game_index