Succinct game monitor by piersy · Pull Request #141 · celo-org/op-succinct

piersy · 2026-05-14T13:18:17Z

Summary

Overhaul the game monitor with a two-tier retry system, progress persistence, operational tooling, and a slimmed-down Docker image.

Retry system

Primary retries with configurable count (--cost-estimator-retries, default 1) and linear backoff (delay * 2 * retry_number). Previously hardcoded to 3 immediate retries which overloaded the monitor.
Background retries for games that exhaust primary retries. These run only when spare execution slots are available (primary always wins), use exponential 4x backoff, and are evicted after --background-retry-max-age-secs (default 3.5 days). Games are marked "completed" when entering the background queue so last_contiguous can advance past them.
AttemptKind enum (Primary/Background) replaces bare retries: u32, enabling distinct scheduling and log-file naming (-bg-retry<n>).

Progress persistence

New ProgressState (serialised to progress.json) stores last_contiguous and the background_retries queue. The daemon resumes from persisted progress on restart.
SequenceTracker (new utils/common crate) tracks out-of-order game completions to compute last_contiguous correctly.
Both progress.json and completion_history.json are now written atomically (write-to-tmp + rename).

Operational tooling

game-monitor run / game-monitor gen-background-retry subcommand split via clap. The daemon moves under run; gen-background-retry reads an existing progress.json, appends new background entries for given game indexes (fetching game_created_at from L1), and prints the complete state to stdout. Existing k8s manifests need command: [game-monitor, run].
rerun-cost-estimator.sh — replays a failed game from the log-file header (command + env).
fetch-game-logs.sh — tars all log files for given game indexes to stdout for local extraction.

Docker image

Removes Rust toolchain and full workspace copy from the runtime image. Previously needed for cargo_metadata::MetadataCommand; now unnecessary since cost-estimator runs with --log-only.
Copies the two operational scripts into the image.

Other improvements

gameCount() calls now use BlockId::finalized to avoid node-desync issues.
Log-size kill message fixed: was showing 0.0 MB for both current and median (dividing bytes-per-block by 1024^2); now shows actual total MB and bytes/block.
--batch-size flag caps per-execution chunk size (default 200) instead of using the full block range.
Pending games are scheduled by absolute executable_at with retry backoff, instead of immediate retry.
Comprehensive doc comments throughout game_monitor.rs.

Previously the cost estimator log would be moved to a ...-failure.log or ...success.log after printting the log when cleaning up the process, this meant that the logged log no longer existed.

The previous hardcoded value of 3 was causing the monitor to get overloaded, so made it configurable and set the default to 1.

Make the cost-estimator run in log only mode This removes the need for the cost estimator to run in a working cargo git project. Remove un-needed rust and cargo workspace Now that the cost estimator is run with the --log-only flag there is no need for rust and the workspace to exist in the built docker image.

PendingGame now stores an absolute executable_at instant rather than a discovered_at + global delay. New discoveries and retries both push to the front; the inner loop scans for the first entry whose executable_at has passed and runs that one. Newer games are more relevant than older ones, so push_front for discoveries gives them priority while older entries stay in place to be picked up afterwards. Retries previously ran immediately, which meant a transient infrastructure issue (RPC blip, finalized-block lag, etc) that caused the original failure would cause the retry to hit the same problem and fail again. Retries now back off linearly (initial_game_delay * 2 * retries), giving the infrastructure time to recover before re-running.

Switch to write-to-temp + rename so a crash mid-write can no longer leave a torn progress file. The blob will grow as more state is added to ProgressState in subsequent commits, making this more important.

Read game_info.timestamp returned by gameAtIndex and convert from Unix seconds into a SystemTime. Wired into the existing 'Game {} covers ...' info log so the field is used in this commit; subsequent commits use it as the anchor for background-retry age eviction.

Replace the bare retries: u32 field with an AttemptKind enum so the upcoming background-retry variant can carry its own counter. The enum currently has only the Primary variant, so behaviour is unchanged. LogFile::new takes AttemptKind directly and pattern-matches to choose between the bare and -retry<n> filename forms.

Introduce a persistent BackgroundRetry record carrying the game index, the L1 game-creation timestamp, the next attempt time, the most recent wait duration, and the attempt counter. ProgressState gains a #[serde(default)] background_retries field so existing progress.json files (containing only last_contiguous) parse unchanged. MonitorState carries an in-memory VecDeque populated from the persisted list at startup, and save_progress now serialises both fields. No scheduling logic uses the queue yet.

When a game's primary retry budget is exhausted maybe_requeue now inserts a BackgroundRetry into the queue (with first wait of initial_game_delay * 2 * max_retries * 4) and persists, instead of silently giving up. mark_game_completed is still called so last_contiguous keeps advancing past flaky games. RunningEstimator carries game_created_at so the timestamp is available at requeue time without re-fetching from L1. The background entries are not yet scheduled for execution; that follows in the next commit.

Add the AttemptKind::Background variant and a second-phase spawn loop that drains background_retries only when no primary entry is currently executable, so primary scheduling always wins on contention. Failed background attempts update the existing entry in place (4x backoff, attempts++) without re-marking the game completed in SequenceTracker. Successful attempts remove the entry without pushing to completion_history (delayed retries would skew anomaly medians). Background entries remain in background_retries while a process is running, with the spawn loop skipping any game already in running_processes; the in-place update on failure relies on this. Log files for background attempts use the new bg-retry<n> infix: cost-estimator-<idx>-<addr>-bg-retry<n>.log. extract_game_index keeps working because it parses up to the first dash after the index.

New --background-retry-max-age-secs CLI flag (default 302400 / 3.5 days) caps how long a game stays on the background-retry queue. The age is measured against the L1 game creation timestamp, so calendar-time limits hold across long process downtimes. evict_aged_background_retries runs at startup and at the top of every loop iteration. It skips entries whose process is currently running to avoid invalidating an in-flight maybe_requeue update.

Extend the per-iteration status line with the size of the background queue and the age of its oldest entry, so the backlog is visible alongside running and pending counts in pod logs.

Lead primary-spawn info logs with the game index (the primary key in pending_games/running_processes/background_retries) rather than the address; address still appears as a secondary field. Simplify the periodic status line by dropping the oldest-age figure and renaming Background to 'Background pending'. Several rustfmt-style line collapses to match the project convention.

Extract the common fetch + log + spawn + insert flow into a new MonitorState::spawn_game method that takes a game index, an AttemptKind, and a SpawnContext bundling per-spawn config. Queue- specific cleanup on both WrongGameType and successful spawn is dispatched on the kind, so each outer loop body shrinks to candidate selection plus a single spawn_game call. The primary spawn log line and the background spawn log line are unified into a single 'Starting cost estimator [<kind>]' format with a kind_descr helper distinguishing primary, primary retry N, and background attempt N. spawn_cost_estimator now takes &Path instead of &PathBuf so the context can hold a borrowed Path slice. No behavioural change.

Generates custom background retry data for manual seeding of background retries.

Previously add(index) inserted into the pending HashSet unconditionally, so indices <= end were stashed forever (the while-loop only scans upwards from end + 1). Early-return in that case so pending never grows without bound.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6844555153

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-14T13:23:01Z

            max_history_length,
+            max_retries,
            history_file,
+            sequence_tracker: SequenceTracker::new(next_game_index),


Start contiguous tracking before first pending index

sequence_tracker is initialized with next_game_index, which marks that index as already contiguous before it is actually processed. If --start-index is used (or any resume path where that index is still pending), later successful games can advance last_contiguous past an unprocessed first index, and a restart (next_game_index = last_contiguous + 1) will skip it permanently. This corrupts persisted progress and can lose required game executions.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8fc6cf8a64

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-14T14:05:05Z

+        let primary_has_ready =
+            state.pending_games.iter().any(|p| p.executable_at <= Instant::now());
+        if !primary_has_ready {
+            while state.can_spawn_new(args.max_concurrent) {


Re-check primary readiness before each background spawn

The primary_has_ready gate is computed once before entering the background loop, so a primary game that becomes ready while spawn_game(...).await is running will not preempt subsequent background spawns in that same iteration. In practice, if a primary is due shortly and there are many ready background retries, this loop can fill all slots with background work and delay the newly-ready primary until a long-running background process exits, which contradicts the intended “primary always wins” scheduling guarantee.

Useful? React with 👍 / 👎.

seolaoh · 2026-05-15T13:56:07Z

This PR supersedes #127, right?

piersy added 28 commits May 14, 2026 12:55

Ensure that we use finalized L1 blocks for the game count lookup

d2dae82

Ensure logged log paths are correct

0ab7b8e

Previously the cost estimator log would be moved to a ...-failure.log or ...success.log after printting the log when cleaning up the process, this meant that the logged log no longer existed.

Make retries configurable

3c3ad10

The previous hardcoded value of 3 was causing the monitor to get overloaded, so made it configurable and set the default to 1.

Ensure both calls to gameCount use finalized block

da72b81

Add feature to resume from last contiguously completed game

11756d6

Limit max size of cost estimator batches

f0b0bdc

Add script to make it easy to manually re-run failed games

9eb24ff

Atomic write for save_progress

18c6484

Switch to write-to-temp + rename so a crash mid-write can no longer leave a torn progress file. The blob will grow as more state is added to ProgressState in subsequent commits, making this more important.

Include background queue stats in periodic log line

cc97312

Extend the per-iteration status line with the size of the background queue and the age of its oldest entry, so the backlog is visible alongside running and pending counts in pod logs.

Standardize logging approach between primary and background games

1666875

Add gen-background-retry subcommand to game-monitor

5b5aaf4

Generates custom background retry data for manual seeding of background retries.

Drop SequenceTracker adds at or below end

575bb94

Previously add(index) inserted into the pending HashSet unconditionally, so indices <= end were stashed forever (the while-loop only scans upwards from end + 1). Early-return in that case so pending never grows without bound.

Add --progress-file to gen-background-retry to read/merge existing state

c34a764

Fix log-size kill message to show actual MB and bytes/block

45fc7d1

Make history save atomic

0fc58df

Add a script to simplify fetching logs

91ad93c

Update comments in game monitor

5145817

Update Cargo.lock

6844555

piersy temporarily deployed to e2e May 14, 2026 13:18 — with GitHub Actions Inactive

chatgpt-codex-connector Bot reviewed May 14, 2026

View reviewed changes

Fix formatting

8fc6cf8

piersy temporarily deployed to e2e May 14, 2026 14:01 — with GitHub Actions Inactive

chatgpt-codex-connector Bot reviewed May 14, 2026

View reviewed changes

seolaoh approved these changes May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Succinct game monitor#141

Succinct game monitor#141
piersy wants to merge 29 commits into
developfrom
piersy/game-monitor

piersy commented May 14, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Uh oh!

seolaoh commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

piersy commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Retry system

Progress persistence

Operational tooling

Docker image

Other improvements

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

seolaoh commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

piersy commented May 14, 2026 •

edited

Loading