fix: bound auto-update network load so a bad release can't spam GitHub/gateways#4593
Conversation
…b/gateways Auto-update had an unbounded failure loop: a release whose install always fails (bad manifest/checksum/signature) makes a node re-detect the newer version, exit 42, run `freenet update`, fail the fail-closed install gate (no install, classified NoChange so no lockout), restart, and re-detect — forever. Result: sustained GitHub REST polling and a gateway reconnect per restart, fleet-wide. The on-crash `freenet update` is a fresh process with no persistent rate limit. Three layered bounds: 1. Global persistent GitHub-poll token bucket (state_dir), checked at the single choke point of each GitHub fetch (get_latest_release + get_latest_version), so every path (startup, re-poll, peer-signal, and the fresh on-crash `freenet update`) shares one limit across restarts. Capacity 8, refill 1/10min => ~6 REST/hr/node max. Missing bucket = full (first boot works); corrupt = deny (fail closed). Time-injectable. 2. Per-target-version install-failure gate (mirrors the rollback known-bad pin): after 3 failed installs of the SAME version, the node stops emitting exit 42 for it and the installer refuses it, until a strictly-newer release appears. This is what actually stops the restart loop. Atomic writes, degrade-safe reads. 3. Parity: macOS wrapper gains a consecutive-failure cap (exits the loop, mirroring the in-process wrapper's 50); systemd units gain RestartSteps + RestartMaxDelaySec restart backoff (>=254, ignored gracefully on older). Refs #4073. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
…respawns non-zero) Codex review: the macOS plist sets KeepAlive.SuccessfulExit=false, so launchd RESPAWNS the job on a non-zero exit and only treats a clean exit 0 as an intentional terminal stop (same reasoning as the existing exit-0/exit-43 paths). The consecutive-failure cap therefore must exit 0, not 1 — a non-zero exit would relaunch a fresh wrapper with the counter reset and defeat the cap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
…tics) - Rate-limited GitHub polls now return a distinct UpdateCheckResult::RateLimited instead of Skipped, so they no longer feed the legacy "max-backoff + 0 connections -> exit 42" gateway-trust fallback (the supervisor's freenet update shares the same empty bucket and would no-op, making local rate-limiting itself a restart loop). Handled in all three update-trigger arms; pinned by a source-scrape test that no RateLimited arm sends update_tx. - macOS wrapper failure cap now targets a crash LOOP, not occasional crashes: a child that ran healthily for >= MIN_HEALTHY_RUNTIME (300s) resets the consecutive-failure streak, so a node that crashes once every few days never accumulates to the cap and gets permanently stopped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
Codex review: the backwards-clock guard prevented negative refill THIS step but still wrote the earlier timestamp back, so a clock that dipped and recovered later measured elapsed from the rewound timestamp and granted undeserved refill credit. Anchor the stored timestamp on max(now, prev) so elapsed is only ever measured from the highest timestamp seen; a dip-and-recover yields zero net credit. Added a dip-then-recover regression test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
…ation) Codex P1: with a single shared token bucket, the node always runs before the supervisor's freenet update in a restart cycle (detect -> exit 42 -> installer), so the node wins every token race and can starve the installer — a low/refilling shared bucket could leave a legitimate update unable to fetch+install while the node keeps spending the lone refill token re-confirming it. Split into two independent on-disk buckets (NODE for get_latest_version, INSTALL for get_latest_release), each bounded the same way (capacity 8, refill 1/10min). Each path is rate-limited independently; the node can no longer drain the installer's budget. Added an independence test and an install-path source pin. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
…nign no-op Codex P2 x2: - The token bucket now FAILS CLOSED: a consume is only allowed if the post-consume state was actually persisted. Previously a read-only/full state dir made every restart see a missing bucket (=full) and grant another burst, failing open. write now returns io::Result and the consume returns `allowed && persisted`. Added a unix read-only-dir regression test. - The macOS wrapper's exit-42 path now treats the updater's EXIT_CODE_ALREADY_UP_TO_DATE (2) — already current / rate-limited / pinned / install-gated — as a benign no-op that restarts WITHOUT counting toward the consecutive-failure cap, so repeated rate-limited no-ops can't stop a healthy node. The crash path still counts (a real crash is a failure). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
…re gate Codex P2: on macOS, download_and_install returns Ok(()) when a DMG-swap fails (deliberately, to avoid corrupting the signed bundle). My gate logic read that as a successful install and CLEARED the gate, so repeated DMG checksum/sig/ download/staging failures would retry the same version forever. Distinguish the two Ok cases with an InstallOutcome enum: Installed clears the gate; BundleSkipped (the swallowed bundle failure) records an install failure so the version is gated after the threshold even though the process exits 0. Added a source-scrape pin (macOS path isn't exercised on Linux CI). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
…ited result) Codex P2: record_check_time() ran before the rate-limit denial, so the backoff gate would short-circuit the next tick to a plain Skipped before reaching the RateLimited arm — and a Skipped at max backoff + 0 connections can still trigger the gateway-trust exit 42 (the loop the limiter prevents), and the stagger path could wrongly enter its 24h cooldown. Move record_check_time() out of the unconditional pre-poll position into the arms: record it only when a GitHub poll was actually attempted (Ok / non-rate- limit Err). A token-denied poll records nothing and grows no backoff, so every tick returns RateLimited until the bucket refills. Added a source-scrape pin. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
…ones Codex P2: recording the install-failure gate for every download_and_install Err(_) would permanently gate a perfectly good release after 3 TRANSIENT failures (GitHub/download outage, still-propagating manifest, extraction/disk error) — auto-update could stay stuck until a newer release or manual --force. Introduce a typed ReleaseVerificationError for the DETERMINISTIC "bad release" signals (checksum mismatch, invalid/wrong-length signature) and gate ONLY on those; transient errors leave the gate unchanged (neither record nor clear). The macOS bundle-skip outcome carries the same deterministic-vs-transient bit. Added a direct classification unit test plus a run_async match source-scrape pin. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
Codex flagged that missing SHA256SUMS.txt / missing entry / missing .sig fall through to the non-gated transient arm. This is intentional and the opposite of gating them is the safer choice: a MISSING manifest/entry is the canonical release-propagation race (the tag goes live before assets finish uploading), so it self-resolves; gating it risks permanently suppressing a good release a node polled seconds early — a worse regression than the already rate-limit-bounded retry cadence. Only a PRESENT-but-WRONG artifact (checksum mismatch / invalid signature) is the deterministic bad-release signal that gates. Made the distinction explicit at both the run_async match and required_checksum. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
|
I have all the information needed to complete the review. Let me write up the findings. Rule Review: No blocking issues — thorough change with excellent test coverageRules checked: WarningsNone. Key findings on each checked category: fix: regression tests (testing.md): All three loop-bounding mechanisms have direct regression tests:
Edge-case coverage (testing.md): Excellent — boundary conditions (zero/capacity), error paths (corrupt, unpersistable), clock anomalies (backwards, dip-then-recover), and first-boot edge case are all covered explicitly. Unwrap in production code (code-style.md): None found. All new production code uses Cleanup exemptions without TTL (AGENTS.md): The install-failure gate is not a GC exemption — it clears on success or version change. Does not fall under the AGENTS.md "Cleanup exemptions MUST be time-bounded" rule. Retry loops / jitter (code-style.md): No new Rust retry loops were added. The token bucket is a gate, not a loop. Fire-and-forget spawns (code-style.md): No new async spawns. Commit messages (git-workflow.md): All commits ( Info
Rule review against |
…units Full-review Should: the new RestartSteps/RestartMaxDelaySec backoff can space medium-runtime crashes past the StartLimitBurst=5/120s window, softening #4551's terminal "failed-state stop" into a slow-flap (<=300s) for those — while a true sub-second fast crash still clusters its first ~5 restarts inside 120s and still terminal-stops. Document this deliberate tradeoff honestly in both unit templates (comment-only; directives unchanged). Behavior is acceptable: fast crashes still terminal-stop, medium crashes slow-flap gently and keep trying for a fix; a bad update is still disarmed by #4073/#4591 rollback. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
|
Reviewed and approving after a full review (this bounds the auto-update network-load failure mode an aggregate-load audit surfaced — a bad-manifest/sig release could otherwise loop forever, ~100-170 GitHub req/hr/node + a gateway reconnect/min fleet-wide, made infinite by #4586's correct fail-closed change). Review: external Codex (8 iterative rounds, final pass clean) + 3 adversarial lenses — rate-limit bypass/amplification, install-gate false-positives, and end-to-end loop-termination + cross-platform parity + no-regression. No Must-fix. Key things verified:
307 tests pass (incl. token-bucket edge cases, gate classification, end-to-end loop-bound, source-scrape pins). CI green on 3d770e3. Merging. [AI-assisted - Claude] |
Problem
An aggregate-load audit of the auto-updater found an unbounded failure loop. When a release's install can never succeed (for example a checksum/signature that does not verify, so the fail-closed gate added in #4586 refuses every attempt), a node can loop indefinitely:
freenet update, which fails the verification gate, so nothing is installed,Each iteration costs GitHub REST polls and a gateway reconnect, and the on-crash
freenet updateis a fresh process with no persistent rate limit, so the loop is unbounded fleet-wide. The bad-binary crash loop is already handled by #4591 (auto-rollback); this is the distinct install-failure loop, which #4586's fail-closed gate converted from a self-terminating path into a non-terminating one.Solution
Three layered bounds (none touch the #4591 rollback state machine, #4586 checksum verify, #4587 signing, or #4588 exit-code machinery):
Persistent GitHub-poll rate limit (token buckets). Every GitHub release poll goes through an on-disk token bucket in
state_dir(), so the limit holds across the fresh one-shotfreenet updateprocess and restarts. Two independent buckets — one for the node'sget_latest_version, one for the supervisor-sideget_latest_release— because the node always polls first in a restart cycle and a shared bucket would let it starve the installer. Capacity 8, refill 1 token / 10 min ⇒ ≤ ~6 polls/hour/node per path. Missing bucket = full (first boot works); corrupt or unpersistable = deny (fail closed); time-injectable for tests.Per-target-version install-failure gate (mirrors the feat: crash-loop auto-rollback for the auto-updater (#4073) #4591 known-bad pin, but for install failures). After 3 deterministic install failures of the same version (checksum mismatch or invalid signature), the node's update detection and the installer stop acting on that version until a strictly-newer release appears. This is what actually stops the restart loop. Transient failures (download/network outage, a still-propagating manifest, extraction errors) are deliberately not counted, so a brief outage can't permanently suppress a good release. Atomic writes, degrade-safe reads.
Parity gaps. The macOS launchd wrapper gains a consecutive-failure cap (exits the loop; targets a tight crash loop via a healthy-runtime reset; treats a benign "nothing to do" updater exit as not-a-failure; exits 0 so launchd does not respawn). The systemd units gain
RestartSteps+RestartMaxDelaySecrestart backoff (systemd ≥ 254; ignored gracefully on older), reducing reconnect churn during any residual loop.Testing
New unit/regression tests cover: the token bucket (capacity, refill, refill cap, deny-on-corrupt, fail-closed-on-unpersistable, missing-is-full, backwards-clock and dip-then-recover give no credit, the two buckets are independent); the install-failure gate (engages after N, allows a strictly-newer version, resets on target change, clears on success, degrade-safe, atomic, and an end-to-end "loop is bounded by the gate" test); the verification-vs-transient classification; and source-scrape pins that
get_latest_version/get_latest_releaseconsult their buckets, thatRateLimitedandPinnedKnownBadarms never exit 42, that a rate-limited poll records no check-time/backoff, that the install gate only counts verification failures, and the macOS wrapper cap/healthy-reset/exit-2 behavior.cargo fmt,cargo clippy --locked -- -D warnings, andcargo test -p freenet --bin freenet(307 tests) are green.External review:
codex review --base origin/mainwas run iteratively; all findings were addressed (separate buckets, fail-closed persistence, backwards-clock timestamp, macOS exit-0/healthy-reset/exit-2, rate-limited result not feeding the exit-42 fallback, check-time not recorded for denied polls, transient-vs-verification gating) and the final pass reported no actionable regressions.Scope note
This per-node load-bounding targets the per-node auto-update crash-loop path (#4073). It intentionally does not cover the gateway-only release-agent's own GitHub fetch (
crates/release-agent/src/github.rs,GET /repos/{repo}/releases/latest): that is the release-workflow gateway-push path, invoked once per release, not the per-node restart loop, so it is not an aggregate-load amplification vector and is out of scope here.Refs #4073.
[AI-assisted - Claude]