Skip to content

Succinct game monitor#141

Open
piersy wants to merge 29 commits into
developfrom
piersy/game-monitor
Open

Succinct game monitor#141
piersy wants to merge 29 commits into
developfrom
piersy/game-monitor

Conversation

@piersy
Copy link
Copy Markdown

@piersy piersy commented May 14, 2026

Summary

Overhaul the game monitor with a two-tier retry system, progress persistence, operational tooling, and a slimmed-down Docker image.

Retry system

  • Primary retries with configurable count (--cost-estimator-retries, default 1) and linear backoff (delay * 2 * retry_number). Previously hardcoded to 3 immediate retries which overloaded the monitor.
  • Background retries for games that exhaust primary retries. These run only when spare execution slots are available (primary always wins), use exponential 4x backoff, and are evicted after --background-retry-max-age-secs (default 3.5 days). Games are marked "completed" when entering the background queue so last_contiguous can advance past them.
  • AttemptKind enum (Primary/Background) replaces bare retries: u32, enabling distinct scheduling and log-file naming (-bg-retry<n>).

Progress persistence

  • New ProgressState (serialised to progress.json) stores last_contiguous and the background_retries queue. The daemon resumes from persisted progress on restart.
  • SequenceTracker (new utils/common crate) tracks out-of-order game completions to compute last_contiguous correctly.
  • Both progress.json and completion_history.json are now written atomically (write-to-tmp + rename).

Operational tooling

  • game-monitor run / game-monitor gen-background-retry subcommand split via clap. The daemon moves under run; gen-background-retry reads an existing progress.json, appends new background entries for given game indexes (fetching game_created_at from L1), and prints the complete state to stdout. Existing k8s manifests need command: [game-monitor, run].
  • rerun-cost-estimator.sh — replays a failed game from the log-file header (command + env).
  • fetch-game-logs.sh — tars all log files for given game indexes to stdout for local extraction.

Docker image

  • Removes Rust toolchain and full workspace copy from the runtime image. Previously needed for cargo_metadata::MetadataCommand; now unnecessary since cost-estimator runs with --log-only.
  • Copies the two operational scripts into the image.

Other improvements

  • gameCount() calls now use BlockId::finalized to avoid node-desync issues.
  • Log-size kill message fixed: was showing 0.0 MB for both current and median (dividing bytes-per-block by 1024^2); now shows actual total MB and bytes/block.
  • --batch-size flag caps per-execution chunk size (default 200) instead of using the full block range.
  • Pending games are scheduled by absolute executable_at with retry backoff, instead of immediate retry.
  • Comprehensive doc comments throughout game_monitor.rs.

piersy added 28 commits May 14, 2026 12:55
Previously the cost estimator log would be moved to a ...-failure.log or
...success.log after printting the log when cleaning up the process,
this meant that the logged log no longer existed.
The previous hardcoded value of 3 was causing the monitor to get
overloaded, so made it configurable and set the default to 1.
Make the cost-estimator run in log only mode

This removes the need for the cost estimator to run in a working cargo
git project.

Remove un-needed rust and cargo workspace

Now that the cost estimator is run with the --log-only flag there is no
need for rust and the workspace to exist in the built docker image.
PendingGame now stores an absolute executable_at instant rather than a
discovered_at + global delay. New discoveries and retries both push to
the front; the inner loop scans for the first entry whose executable_at
has passed and runs that one.

Newer games are more relevant than older ones, so push_front for
discoveries gives them priority while older entries stay in place to be
picked up afterwards. Retries previously ran immediately, which meant a
transient infrastructure issue (RPC blip, finalized-block lag, etc) that
caused the original failure would cause the retry to hit the same
problem and fail again. Retries now back off linearly
(initial_game_delay * 2 * retries), giving the infrastructure time to
recover before re-running.
Switch to write-to-temp + rename so a crash mid-write can no longer leave
a torn progress file. The blob will grow as more state is added to
ProgressState in subsequent commits, making this more important.
Read game_info.timestamp returned by gameAtIndex and convert from Unix
seconds into a SystemTime. Wired into the existing 'Game {} covers ...'
info log so the field is used in this commit; subsequent commits use it
as the anchor for background-retry age eviction.
Replace the bare retries: u32 field with an AttemptKind enum so the
upcoming background-retry variant can carry its own counter. The enum
currently has only the Primary variant, so behaviour is unchanged.

LogFile::new takes AttemptKind directly and pattern-matches to choose
between the bare and -retry<n> filename forms.
Introduce a persistent BackgroundRetry record carrying the game index,
the L1 game-creation timestamp, the next attempt time, the most recent
wait duration, and the attempt counter. ProgressState gains a
#[serde(default)] background_retries field so existing progress.json
files (containing only last_contiguous) parse unchanged.

MonitorState carries an in-memory VecDeque populated from the persisted
list at startup, and save_progress now serialises both fields. No
scheduling logic uses the queue yet.
When a game's primary retry budget is exhausted maybe_requeue now
inserts a BackgroundRetry into the queue (with first wait of
initial_game_delay * 2 * max_retries * 4) and persists, instead of
silently giving up. mark_game_completed is still called so
last_contiguous keeps advancing past flaky games.

RunningEstimator carries game_created_at so the timestamp is available
at requeue time without re-fetching from L1. The background entries
are not yet scheduled for execution; that follows in the next commit.
Add the AttemptKind::Background variant and a second-phase spawn loop
that drains background_retries only when no primary entry is currently
executable, so primary scheduling always wins on contention.

Failed background attempts update the existing entry in place (4x
backoff, attempts++) without re-marking the game completed in
SequenceTracker. Successful attempts remove the entry without pushing
to completion_history (delayed retries would skew anomaly medians).

Background entries remain in background_retries while a process is
running, with the spawn loop skipping any game already in
running_processes; the in-place update on failure relies on this.

Log files for background attempts use the new bg-retry<n> infix:
cost-estimator-<idx>-<addr>-bg-retry<n>.log. extract_game_index keeps
working because it parses up to the first dash after the index.
New --background-retry-max-age-secs CLI flag (default 302400 / 3.5 days)
caps how long a game stays on the background-retry queue. The age is
measured against the L1 game creation timestamp, so calendar-time
limits hold across long process downtimes.

evict_aged_background_retries runs at startup and at the top of every
loop iteration. It skips entries whose process is currently running
to avoid invalidating an in-flight maybe_requeue update.
Extend the per-iteration status line with the size of the background
queue and the age of its oldest entry, so the backlog is visible
alongside running and pending counts in pod logs.
Lead primary-spawn info logs with the game index (the primary key in
pending_games/running_processes/background_retries) rather than the
address; address still appears as a secondary field. Simplify the
periodic status line by dropping the oldest-age figure and renaming
Background to 'Background pending'. Several rustfmt-style line
collapses to match the project convention.
Extract the common fetch + log + spawn + insert flow into a new
MonitorState::spawn_game method that takes a game index, an
AttemptKind, and a SpawnContext bundling per-spawn config. Queue-
specific cleanup on both WrongGameType and successful spawn is
dispatched on the kind, so each outer loop body shrinks to candidate
selection plus a single spawn_game call.

The primary spawn log line and the background spawn log line are
unified into a single 'Starting cost estimator [<kind>]' format with
a kind_descr helper distinguishing primary, primary retry N, and
background attempt N.

spawn_cost_estimator now takes &Path instead of &PathBuf so the
context can hold a borrowed Path slice. No behavioural change.
Generates custom background retry data for manual seeding of background
retries.
Previously add(index) inserted into the pending HashSet unconditionally,
so indices <= end were stashed forever (the while-loop only scans
upwards from end + 1). Early-return in that case so pending never grows
without bound.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6844555153

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

max_history_length,
max_retries,
history_file,
sequence_tracker: SequenceTracker::new(next_game_index),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Start contiguous tracking before first pending index

sequence_tracker is initialized with next_game_index, which marks that index as already contiguous before it is actually processed. If --start-index is used (or any resume path where that index is still pending), later successful games can advance last_contiguous past an unprocessed first index, and a restart (next_game_index = last_contiguous + 1) will skip it permanently. This corrupts persisted progress and can lose required game executions.

Useful? React with 👍 / 👎.

Comment thread scripts/utils/bin/game_monitor.rs
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8fc6cf8a64

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +1361 to +1364
let primary_has_ready =
state.pending_games.iter().any(|p| p.executable_at <= Instant::now());
if !primary_has_ready {
while state.can_spawn_new(args.max_concurrent) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Re-check primary readiness before each background spawn

The primary_has_ready gate is computed once before entering the background loop, so a primary game that becomes ready while spawn_game(...).await is running will not preempt subsequent background spawns in that same iteration. In practice, if a primary is due shortly and there are many ready background retries, this loop can fill all slots with background work and delay the newly-ready primary until a long-running background process exits, which contradicts the intended “primary always wins” scheduling guarantee.

Useful? React with 👍 / 👎.

@seolaoh
Copy link
Copy Markdown
Collaborator

seolaoh commented May 15, 2026

This PR supersedes #127, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants