Succinct game monitor#141
Conversation
Previously the cost estimator log would be moved to a ...-failure.log or ...success.log after printting the log when cleaning up the process, this meant that the logged log no longer existed.
The previous hardcoded value of 3 was causing the monitor to get overloaded, so made it configurable and set the default to 1.
Make the cost-estimator run in log only mode This removes the need for the cost estimator to run in a working cargo git project. Remove un-needed rust and cargo workspace Now that the cost estimator is run with the --log-only flag there is no need for rust and the workspace to exist in the built docker image.
PendingGame now stores an absolute executable_at instant rather than a discovered_at + global delay. New discoveries and retries both push to the front; the inner loop scans for the first entry whose executable_at has passed and runs that one. Newer games are more relevant than older ones, so push_front for discoveries gives them priority while older entries stay in place to be picked up afterwards. Retries previously ran immediately, which meant a transient infrastructure issue (RPC blip, finalized-block lag, etc) that caused the original failure would cause the retry to hit the same problem and fail again. Retries now back off linearly (initial_game_delay * 2 * retries), giving the infrastructure time to recover before re-running.
Switch to write-to-temp + rename so a crash mid-write can no longer leave a torn progress file. The blob will grow as more state is added to ProgressState in subsequent commits, making this more important.
Read game_info.timestamp returned by gameAtIndex and convert from Unix
seconds into a SystemTime. Wired into the existing 'Game {} covers ...'
info log so the field is used in this commit; subsequent commits use it
as the anchor for background-retry age eviction.
Replace the bare retries: u32 field with an AttemptKind enum so the upcoming background-retry variant can carry its own counter. The enum currently has only the Primary variant, so behaviour is unchanged. LogFile::new takes AttemptKind directly and pattern-matches to choose between the bare and -retry<n> filename forms.
Introduce a persistent BackgroundRetry record carrying the game index, the L1 game-creation timestamp, the next attempt time, the most recent wait duration, and the attempt counter. ProgressState gains a #[serde(default)] background_retries field so existing progress.json files (containing only last_contiguous) parse unchanged. MonitorState carries an in-memory VecDeque populated from the persisted list at startup, and save_progress now serialises both fields. No scheduling logic uses the queue yet.
When a game's primary retry budget is exhausted maybe_requeue now inserts a BackgroundRetry into the queue (with first wait of initial_game_delay * 2 * max_retries * 4) and persists, instead of silently giving up. mark_game_completed is still called so last_contiguous keeps advancing past flaky games. RunningEstimator carries game_created_at so the timestamp is available at requeue time without re-fetching from L1. The background entries are not yet scheduled for execution; that follows in the next commit.
Add the AttemptKind::Background variant and a second-phase spawn loop that drains background_retries only when no primary entry is currently executable, so primary scheduling always wins on contention. Failed background attempts update the existing entry in place (4x backoff, attempts++) without re-marking the game completed in SequenceTracker. Successful attempts remove the entry without pushing to completion_history (delayed retries would skew anomaly medians). Background entries remain in background_retries while a process is running, with the spawn loop skipping any game already in running_processes; the in-place update on failure relies on this. Log files for background attempts use the new bg-retry<n> infix: cost-estimator-<idx>-<addr>-bg-retry<n>.log. extract_game_index keeps working because it parses up to the first dash after the index.
New --background-retry-max-age-secs CLI flag (default 302400 / 3.5 days) caps how long a game stays on the background-retry queue. The age is measured against the L1 game creation timestamp, so calendar-time limits hold across long process downtimes. evict_aged_background_retries runs at startup and at the top of every loop iteration. It skips entries whose process is currently running to avoid invalidating an in-flight maybe_requeue update.
Extend the per-iteration status line with the size of the background queue and the age of its oldest entry, so the backlog is visible alongside running and pending counts in pod logs.
Lead primary-spawn info logs with the game index (the primary key in pending_games/running_processes/background_retries) rather than the address; address still appears as a secondary field. Simplify the periodic status line by dropping the oldest-age figure and renaming Background to 'Background pending'. Several rustfmt-style line collapses to match the project convention.
Extract the common fetch + log + spawn + insert flow into a new MonitorState::spawn_game method that takes a game index, an AttemptKind, and a SpawnContext bundling per-spawn config. Queue- specific cleanup on both WrongGameType and successful spawn is dispatched on the kind, so each outer loop body shrinks to candidate selection plus a single spawn_game call. The primary spawn log line and the background spawn log line are unified into a single 'Starting cost estimator [<kind>]' format with a kind_descr helper distinguishing primary, primary retry N, and background attempt N. spawn_cost_estimator now takes &Path instead of &PathBuf so the context can hold a borrowed Path slice. No behavioural change.
Generates custom background retry data for manual seeding of background retries.
Previously add(index) inserted into the pending HashSet unconditionally, so indices <= end were stashed forever (the while-loop only scans upwards from end + 1). Early-return in that case so pending never grows without bound.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6844555153
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| max_history_length, | ||
| max_retries, | ||
| history_file, | ||
| sequence_tracker: SequenceTracker::new(next_game_index), |
There was a problem hiding this comment.
Start contiguous tracking before first pending index
sequence_tracker is initialized with next_game_index, which marks that index as already contiguous before it is actually processed. If --start-index is used (or any resume path where that index is still pending), later successful games can advance last_contiguous past an unprocessed first index, and a restart (next_game_index = last_contiguous + 1) will skip it permanently. This corrupts persisted progress and can lose required game executions.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8fc6cf8a64
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| let primary_has_ready = | ||
| state.pending_games.iter().any(|p| p.executable_at <= Instant::now()); | ||
| if !primary_has_ready { | ||
| while state.can_spawn_new(args.max_concurrent) { |
There was a problem hiding this comment.
Re-check primary readiness before each background spawn
The primary_has_ready gate is computed once before entering the background loop, so a primary game that becomes ready while spawn_game(...).await is running will not preempt subsequent background spawns in that same iteration. In practice, if a primary is due shortly and there are many ready background retries, this loop can fill all slots with background work and delay the newly-ready primary until a long-running background process exits, which contradicts the intended “primary always wins” scheduling guarantee.
Useful? React with 👍 / 👎.
|
This PR supersedes #127, right? |
Summary
Overhaul the game monitor with a two-tier retry system, progress persistence, operational tooling, and a slimmed-down Docker image.
Retry system
--cost-estimator-retries, default 1) and linear backoff (delay * 2 * retry_number). Previously hardcoded to 3 immediate retries which overloaded the monitor.--background-retry-max-age-secs(default 3.5 days). Games are marked "completed" when entering the background queue solast_contiguouscan advance past them.AttemptKindenum (Primary/Background) replaces bareretries: u32, enabling distinct scheduling and log-file naming (-bg-retry<n>).Progress persistence
ProgressState(serialised toprogress.json) storeslast_contiguousand thebackground_retriesqueue. The daemon resumes from persisted progress on restart.SequenceTracker(newutils/commoncrate) tracks out-of-order game completions to computelast_contiguouscorrectly.progress.jsonandcompletion_history.jsonare now written atomically (write-to-tmp + rename).Operational tooling
game-monitor run/game-monitor gen-background-retrysubcommand split via clap. The daemon moves underrun;gen-background-retryreads an existingprogress.json, appends new background entries for given game indexes (fetchinggame_created_atfrom L1), and prints the complete state to stdout. Existing k8s manifests needcommand: [game-monitor, run].rerun-cost-estimator.sh— replays a failed game from the log-file header (command + env).fetch-game-logs.sh— tars all log files for given game indexes to stdout for local extraction.Docker image
cargo_metadata::MetadataCommand; now unnecessary since cost-estimator runs with--log-only.Other improvements
gameCount()calls now useBlockId::finalizedto avoid node-desync issues.0.0 MBfor both current and median (dividing bytes-per-block by 1024^2); now shows actual total MB and bytes/block.--batch-sizeflag caps per-execution chunk size (default 200) instead of using the full block range.executable_atwith retry backoff, instead of immediate retry.game_monitor.rs.