Skip to content

fix: bound auto-update network load so a bad release can't spam GitHub/gateways#4593

Merged
sanity merged 11 commits into
mainfrom
fix/auto-update-load-bounding
Jun 27, 2026
Merged

fix: bound auto-update network load so a bad release can't spam GitHub/gateways#4593
sanity merged 11 commits into
mainfrom
fix/auto-update-load-bounding

Conversation

@sanity

@sanity sanity commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Problem

An aggregate-load audit of the auto-updater found an unbounded failure loop. When a release's install can never succeed (for example a checksum/signature that does not verify, so the fail-closed gate added in #4586 refuses every attempt), a node can loop indefinitely:

  1. the node detects the newer version and exits 42 (update needed),
  2. the supervisor runs freenet update, which fails the verification gate, so nothing is installed,
  3. that failure is classified as "no change" (not a crash, not a lockout), so nothing counts it,
  4. the supervisor restarts the node, which re-detects the same version — back to step 1.

Each iteration costs GitHub REST polls and a gateway reconnect, and the on-crash freenet update is a fresh process with no persistent rate limit, so the loop is unbounded fleet-wide. The bad-binary crash loop is already handled by #4591 (auto-rollback); this is the distinct install-failure loop, which #4586's fail-closed gate converted from a self-terminating path into a non-terminating one.

Solution

Three layered bounds (none touch the #4591 rollback state machine, #4586 checksum verify, #4587 signing, or #4588 exit-code machinery):

  1. Persistent GitHub-poll rate limit (token buckets). Every GitHub release poll goes through an on-disk token bucket in state_dir(), so the limit holds across the fresh one-shot freenet update process and restarts. Two independent buckets — one for the node's get_latest_version, one for the supervisor-side get_latest_release — because the node always polls first in a restart cycle and a shared bucket would let it starve the installer. Capacity 8, refill 1 token / 10 min ⇒ ≤ ~6 polls/hour/node per path. Missing bucket = full (first boot works); corrupt or unpersistable = deny (fail closed); time-injectable for tests.

  2. Per-target-version install-failure gate (mirrors the feat: crash-loop auto-rollback for the auto-updater (#4073) #4591 known-bad pin, but for install failures). After 3 deterministic install failures of the same version (checksum mismatch or invalid signature), the node's update detection and the installer stop acting on that version until a strictly-newer release appears. This is what actually stops the restart loop. Transient failures (download/network outage, a still-propagating manifest, extraction errors) are deliberately not counted, so a brief outage can't permanently suppress a good release. Atomic writes, degrade-safe reads.

  3. Parity gaps. The macOS launchd wrapper gains a consecutive-failure cap (exits the loop; targets a tight crash loop via a healthy-runtime reset; treats a benign "nothing to do" updater exit as not-a-failure; exits 0 so launchd does not respawn). The systemd units gain RestartSteps + RestartMaxDelaySec restart backoff (systemd ≥ 254; ignored gracefully on older), reducing reconnect churn during any residual loop.

Testing

New unit/regression tests cover: the token bucket (capacity, refill, refill cap, deny-on-corrupt, fail-closed-on-unpersistable, missing-is-full, backwards-clock and dip-then-recover give no credit, the two buckets are independent); the install-failure gate (engages after N, allows a strictly-newer version, resets on target change, clears on success, degrade-safe, atomic, and an end-to-end "loop is bounded by the gate" test); the verification-vs-transient classification; and source-scrape pins that get_latest_version/get_latest_release consult their buckets, that RateLimited and PinnedKnownBad arms never exit 42, that a rate-limited poll records no check-time/backoff, that the install gate only counts verification failures, and the macOS wrapper cap/healthy-reset/exit-2 behavior. cargo fmt, cargo clippy --locked -- -D warnings, and cargo test -p freenet --bin freenet (307 tests) are green.

External review: codex review --base origin/main was run iteratively; all findings were addressed (separate buckets, fail-closed persistence, backwards-clock timestamp, macOS exit-0/healthy-reset/exit-2, rate-limited result not feeding the exit-42 fallback, check-time not recorded for denied polls, transient-vs-verification gating) and the final pass reported no actionable regressions.

Scope note

This per-node load-bounding targets the per-node auto-update crash-loop path (#4073). It intentionally does not cover the gateway-only release-agent's own GitHub fetch (crates/release-agent/src/github.rs, GET /repos/{repo}/releases/latest): that is the release-workflow gateway-push path, invoked once per release, not the per-node restart loop, so it is not an aggregate-load amplification vector and is out of scope here.

Refs #4073.

[AI-assisted - Claude]

sanity and others added 10 commits June 26, 2026 19:38
…b/gateways

Auto-update had an unbounded failure loop: a release whose install always
fails (bad manifest/checksum/signature) makes a node re-detect the newer
version, exit 42, run `freenet update`, fail the fail-closed install gate
(no install, classified NoChange so no lockout), restart, and re-detect —
forever. Result: sustained GitHub REST polling and a gateway reconnect per
restart, fleet-wide. The on-crash `freenet update` is a fresh process with
no persistent rate limit.

Three layered bounds:

1. Global persistent GitHub-poll token bucket (state_dir), checked at the
   single choke point of each GitHub fetch (get_latest_release +
   get_latest_version), so every path (startup, re-poll, peer-signal, and the
   fresh on-crash `freenet update`) shares one limit across restarts.
   Capacity 8, refill 1/10min => ~6 REST/hr/node max. Missing bucket = full
   (first boot works); corrupt = deny (fail closed). Time-injectable.

2. Per-target-version install-failure gate (mirrors the rollback known-bad
   pin): after 3 failed installs of the SAME version, the node stops emitting
   exit 42 for it and the installer refuses it, until a strictly-newer release
   appears. This is what actually stops the restart loop. Atomic writes,
   degrade-safe reads.

3. Parity: macOS wrapper gains a consecutive-failure cap (exits the loop,
   mirroring the in-process wrapper's 50); systemd units gain RestartSteps +
   RestartMaxDelaySec restart backoff (>=254, ignored gracefully on older).

Refs #4073.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
…respawns non-zero)

Codex review: the macOS plist sets KeepAlive.SuccessfulExit=false, so launchd
RESPAWNS the job on a non-zero exit and only treats a clean exit 0 as an
intentional terminal stop (same reasoning as the existing exit-0/exit-43 paths).
The consecutive-failure cap therefore must exit 0, not 1 — a non-zero exit would
relaunch a fresh wrapper with the counter reset and defeat the cap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
…tics)

- Rate-limited GitHub polls now return a distinct UpdateCheckResult::RateLimited
  instead of Skipped, so they no longer feed the legacy "max-backoff + 0
  connections -> exit 42" gateway-trust fallback (the supervisor's freenet update
  shares the same empty bucket and would no-op, making local rate-limiting itself
  a restart loop). Handled in all three update-trigger arms; pinned by a
  source-scrape test that no RateLimited arm sends update_tx.

- macOS wrapper failure cap now targets a crash LOOP, not occasional crashes: a
  child that ran healthily for >= MIN_HEALTHY_RUNTIME (300s) resets the
  consecutive-failure streak, so a node that crashes once every few days never
  accumulates to the cap and gets permanently stopped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
Codex review: the backwards-clock guard prevented negative refill THIS step but
still wrote the earlier timestamp back, so a clock that dipped and recovered
later measured elapsed from the rewound timestamp and granted undeserved refill
credit. Anchor the stored timestamp on max(now, prev) so elapsed is only ever
measured from the highest timestamp seen; a dip-and-recover yields zero net
credit. Added a dip-then-recover regression test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
…ation)

Codex P1: with a single shared token bucket, the node always runs before the
supervisor's freenet update in a restart cycle (detect -> exit 42 -> installer),
so the node wins every token race and can starve the installer — a low/refilling
shared bucket could leave a legitimate update unable to fetch+install while the
node keeps spending the lone refill token re-confirming it.

Split into two independent on-disk buckets (NODE for get_latest_version, INSTALL
for get_latest_release), each bounded the same way (capacity 8, refill 1/10min).
Each path is rate-limited independently; the node can no longer drain the
installer's budget. Added an independence test and an install-path source pin.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
…nign no-op

Codex P2 x2:
- The token bucket now FAILS CLOSED: a consume is only allowed if the
  post-consume state was actually persisted. Previously a read-only/full state
  dir made every restart see a missing bucket (=full) and grant another burst,
  failing open. write now returns io::Result and the consume returns
  `allowed && persisted`. Added a unix read-only-dir regression test.
- The macOS wrapper's exit-42 path now treats the updater's
  EXIT_CODE_ALREADY_UP_TO_DATE (2) — already current / rate-limited / pinned /
  install-gated — as a benign no-op that restarts WITHOUT counting toward the
  consecutive-failure cap, so repeated rate-limited no-ops can't stop a healthy
  node. The crash path still counts (a real crash is a failure).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
…re gate

Codex P2: on macOS, download_and_install returns Ok(()) when a DMG-swap fails
(deliberately, to avoid corrupting the signed bundle). My gate logic read that
as a successful install and CLEARED the gate, so repeated DMG checksum/sig/
download/staging failures would retry the same version forever.

Distinguish the two Ok cases with an InstallOutcome enum: Installed clears the
gate; BundleSkipped (the swallowed bundle failure) records an install failure so
the version is gated after the threshold even though the process exits 0. Added
a source-scrape pin (macOS path isn't exercised on Linux CI).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
…ited result)

Codex P2: record_check_time() ran before the rate-limit denial, so the backoff
gate would short-circuit the next tick to a plain Skipped before reaching the
RateLimited arm — and a Skipped at max backoff + 0 connections can still trigger
the gateway-trust exit 42 (the loop the limiter prevents), and the stagger path
could wrongly enter its 24h cooldown.

Move record_check_time() out of the unconditional pre-poll position into the
arms: record it only when a GitHub poll was actually attempted (Ok / non-rate-
limit Err). A token-denied poll records nothing and grows no backoff, so every
tick returns RateLimited until the bucket refills. Added a source-scrape pin.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
…ones

Codex P2: recording the install-failure gate for every download_and_install
Err(_) would permanently gate a perfectly good release after 3 TRANSIENT
failures (GitHub/download outage, still-propagating manifest, extraction/disk
error) — auto-update could stay stuck until a newer release or manual --force.

Introduce a typed ReleaseVerificationError for the DETERMINISTIC "bad release"
signals (checksum mismatch, invalid/wrong-length signature) and gate ONLY on
those; transient errors leave the gate unchanged (neither record nor clear). The
macOS bundle-skip outcome carries the same deterministic-vs-transient bit. Added
a direct classification unit test plus a run_async match source-scrape pin.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
Codex flagged that missing SHA256SUMS.txt / missing entry / missing .sig fall
through to the non-gated transient arm. This is intentional and the opposite of
gating them is the safer choice: a MISSING manifest/entry is the canonical
release-propagation race (the tag goes live before assets finish uploading), so
it self-resolves; gating it risks permanently suppressing a good release a node
polled seconds early — a worse regression than the already rate-limit-bounded
retry cadence. Only a PRESENT-but-WRONG artifact (checksum mismatch / invalid
signature) is the deterministic bad-release signal that gates. Made the
distinction explicit at both the run_async match and required_checksum.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

I have all the information needed to complete the review. Let me write up the findings.


Rule Review: No blocking issues — thorough change with excellent test coverage

Rules checked: git-workflow.md, code-style.md, testing.md
Files reviewed: 7 (auto_update.rs, rollback.rs, service.rs, service/linux.rs, service/macos.rs, update.rs, freenet.rs)

Warnings

None.


Key findings on each checked category:

fix: regression tests (testing.md): All three loop-bounding mechanisms have direct regression tests:

  • Token bucket: 9 tests covering burst, refill, cap-at-capacity, corrupt/NaN/negative files, backwards clock, clock dip, first-boot (missing = full), and fail-closed on unpersistable write.
  • Install-failure gate: 7 tests covering threshold engagement, version isolation, version-change reset, success clear, degrade-safe on corrupt/missing, atomic-write invariant, and end-to-end loop bounding.
  • RateLimited result routing: rate_limited_never_triggers_exit_42 source-scrapes all three match sites to assert no update_tx.send in any RateLimited arm.

Edge-case coverage (testing.md): Excellent — boundary conditions (zero/capacity), error paths (corrupt, unpersistable), clock anomalies (backwards, dip-then-recover), and first-boot edge case are all covered explicitly.

Unwrap in production code (code-style.md): None found. All new production code uses unwrap_or(0), is_ok(), is_err(), and if let Err patterns.

Cleanup exemptions without TTL (AGENTS.md): The install-failure gate is not a GC exemption — it clears on success or version change. Does not fall under the AGENTS.md "Cleanup exemptions MUST be time-bounded" rule.

Retry loops / jitter (code-style.md): No new Rust retry loops were added. The token bucket is a gate, not a loop.

Fire-and-forget spawns (code-style.md): No new async spawns.

Commit messages (git-workflow.md): All commits (fix: and docs:) follow conventional commits and are appropriately scoped. Each fix: is a focused, single-concern change.

Info

  • crates/core/src/bin/commands/update.rs (install outcome match) — The Err(_) => {} arm is a necessary catch-all (anyhow errors can't be exhaustively matched) and is clearly documented with a multi-line comment explaining the transient-vs-deterministic distinction. Not a style error, just flagging the pattern.

Rule review against .claude/rules/. WARNING findings block merge.

…units

Full-review Should: the new RestartSteps/RestartMaxDelaySec backoff can space
medium-runtime crashes past the StartLimitBurst=5/120s window, softening #4551's
terminal "failed-state stop" into a slow-flap (<=300s) for those — while a true
sub-second fast crash still clusters its first ~5 restarts inside 120s and still
terminal-stops. Document this deliberate tradeoff honestly in both unit
templates (comment-only; directives unchanged). Behavior is acceptable: fast
crashes still terminal-stop, medium crashes slow-flap gently and keep trying for
a fix; a bad update is still disarmed by #4073/#4591 rollback.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
@sanity

sanity commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator Author

Reviewed and approving after a full review (this bounds the auto-update network-load failure mode an aggregate-load audit surfaced — a bad-manifest/sig release could otherwise loop forever, ~100-170 GitHub req/hr/node + a gateway reconnect/min fleet-wide, made infinite by #4586's correct fail-closed change).

Review: external Codex (8 iterative rounds, final pass clean) + 3 adversarial lenses — rate-limit bypass/amplification, install-gate false-positives, and end-to-end loop-termination + cross-platform parity + no-regression. No Must-fix. Key things verified:

307 tests pass (incl. token-bucket edge cases, gate classification, end-to-end loop-bound, source-scrape pins). CI green on 3d770e3. Merging.

[AI-assisted - Claude]

@sanity sanity added this pull request to the merge queue Jun 27, 2026
Merged via the queue into main with commit 3dc6e71 Jun 27, 2026
16 checks passed
@sanity sanity deleted the fix/auto-update-load-bounding branch June 27, 2026 02:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant