fix: bound auto-update network load so a bad release can't spam GitHub/gateways by sanity · Pull Request #4593 · freenet/freenet-core

sanity · 2026-06-27T02:00:54Z

Problem

An aggregate-load audit of the auto-updater found an unbounded failure loop. When a release's install can never succeed (for example a checksum/signature that does not verify, so the fail-closed gate added in #4586 refuses every attempt), a node can loop indefinitely:

the node detects the newer version and exits 42 (update needed),
the supervisor runs freenet update, which fails the verification gate, so nothing is installed,
that failure is classified as "no change" (not a crash, not a lockout), so nothing counts it,
the supervisor restarts the node, which re-detects the same version — back to step 1.

Each iteration costs GitHub REST polls and a gateway reconnect, and the on-crash freenet update is a fresh process with no persistent rate limit, so the loop is unbounded fleet-wide. The bad-binary crash loop is already handled by #4591 (auto-rollback); this is the distinct install-failure loop, which #4586's fail-closed gate converted from a self-terminating path into a non-terminating one.

Solution

Three layered bounds (none touch the #4591 rollback state machine, #4586 checksum verify, #4587 signing, or #4588 exit-code machinery):

Persistent GitHub-poll rate limit (token buckets). Every GitHub release poll goes through an on-disk token bucket in state_dir(), so the limit holds across the fresh one-shot freenet update process and restarts. Two independent buckets — one for the node's get_latest_version, one for the supervisor-side get_latest_release — because the node always polls first in a restart cycle and a shared bucket would let it starve the installer. Capacity 8, refill 1 token / 10 min ⇒ ≤ ~6 polls/hour/node per path. Missing bucket = full (first boot works); corrupt or unpersistable = deny (fail closed); time-injectable for tests.
Per-target-version install-failure gate (mirrors the feat: crash-loop auto-rollback for the auto-updater (#4073) #4591 known-bad pin, but for install failures). After 3 deterministic install failures of the same version (checksum mismatch or invalid signature), the node's update detection and the installer stop acting on that version until a strictly-newer release appears. This is what actually stops the restart loop. Transient failures (download/network outage, a still-propagating manifest, extraction errors) are deliberately not counted, so a brief outage can't permanently suppress a good release. Atomic writes, degrade-safe reads.
Parity gaps. The macOS launchd wrapper gains a consecutive-failure cap (exits the loop; targets a tight crash loop via a healthy-runtime reset; treats a benign "nothing to do" updater exit as not-a-failure; exits 0 so launchd does not respawn). The systemd units gain RestartSteps + RestartMaxDelaySec restart backoff (systemd ≥ 254; ignored gracefully on older), reducing reconnect churn during any residual loop.

Testing

New unit/regression tests cover: the token bucket (capacity, refill, refill cap, deny-on-corrupt, fail-closed-on-unpersistable, missing-is-full, backwards-clock and dip-then-recover give no credit, the two buckets are independent); the install-failure gate (engages after N, allows a strictly-newer version, resets on target change, clears on success, degrade-safe, atomic, and an end-to-end "loop is bounded by the gate" test); the verification-vs-transient classification; and source-scrape pins that get_latest_version/get_latest_release consult their buckets, that RateLimited and PinnedKnownBad arms never exit 42, that a rate-limited poll records no check-time/backoff, that the install gate only counts verification failures, and the macOS wrapper cap/healthy-reset/exit-2 behavior. cargo fmt, cargo clippy --locked -- -D warnings, and cargo test -p freenet --bin freenet (307 tests) are green.

External review: codex review --base origin/main was run iteratively; all findings were addressed (separate buckets, fail-closed persistence, backwards-clock timestamp, macOS exit-0/healthy-reset/exit-2, rate-limited result not feeding the exit-42 fallback, check-time not recorded for denied polls, transient-vs-verification gating) and the final pass reported no actionable regressions.

Scope note

This per-node load-bounding targets the per-node auto-update crash-loop path (#4073). It intentionally does not cover the gateway-only release-agent's own GitHub fetch (crates/release-agent/src/github.rs, GET /repos/{repo}/releases/latest): that is the release-workflow gateway-push path, invoked once per release, not the per-node restart loop, so it is not an aggregate-load amplification vector and is out of scope here.

Refs #4073.

[AI-assisted - Claude]

…b/gateways Auto-update had an unbounded failure loop: a release whose install always fails (bad manifest/checksum/signature) makes a node re-detect the newer version, exit 42, run `freenet update`, fail the fail-closed install gate (no install, classified NoChange so no lockout), restart, and re-detect — forever. Result: sustained GitHub REST polling and a gateway reconnect per restart, fleet-wide. The on-crash `freenet update` is a fresh process with no persistent rate limit. Three layered bounds: 1. Global persistent GitHub-poll token bucket (state_dir), checked at the single choke point of each GitHub fetch (get_latest_release + get_latest_version), so every path (startup, re-poll, peer-signal, and the fresh on-crash `freenet update`) shares one limit across restarts. Capacity 8, refill 1/10min => ~6 REST/hr/node max. Missing bucket = full (first boot works); corrupt = deny (fail closed). Time-injectable. 2. Per-target-version install-failure gate (mirrors the rollback known-bad pin): after 3 failed installs of the SAME version, the node stops emitting exit 42 for it and the installer refuses it, until a strictly-newer release appears. This is what actually stops the restart loop. Atomic writes, degrade-safe reads. 3. Parity: macOS wrapper gains a consecutive-failure cap (exits the loop, mirroring the in-process wrapper's 50); systemd units gain RestartSteps + RestartMaxDelaySec restart backoff (>=254, ignored gracefully on older). Refs #4073. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU

…respawns non-zero) Codex review: the macOS plist sets KeepAlive.SuccessfulExit=false, so launchd RESPAWNS the job on a non-zero exit and only treats a clean exit 0 as an intentional terminal stop (same reasoning as the existing exit-0/exit-43 paths). The consecutive-failure cap therefore must exit 0, not 1 — a non-zero exit would relaunch a fresh wrapper with the counter reset and defeat the cap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU

…tics) - Rate-limited GitHub polls now return a distinct UpdateCheckResult::RateLimited instead of Skipped, so they no longer feed the legacy "max-backoff + 0 connections -> exit 42" gateway-trust fallback (the supervisor's freenet update shares the same empty bucket and would no-op, making local rate-limiting itself a restart loop). Handled in all three update-trigger arms; pinned by a source-scrape test that no RateLimited arm sends update_tx. - macOS wrapper failure cap now targets a crash LOOP, not occasional crashes: a child that ran healthily for >= MIN_HEALTHY_RUNTIME (300s) resets the consecutive-failure streak, so a node that crashes once every few days never accumulates to the cap and gets permanently stopped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU

Codex review: the backwards-clock guard prevented negative refill THIS step but still wrote the earlier timestamp back, so a clock that dipped and recovered later measured elapsed from the rewound timestamp and granted undeserved refill credit. Anchor the stored timestamp on max(now, prev) so elapsed is only ever measured from the highest timestamp seen; a dip-and-recover yields zero net credit. Added a dip-then-recover regression test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU

…ation) Codex P1: with a single shared token bucket, the node always runs before the supervisor's freenet update in a restart cycle (detect -> exit 42 -> installer), so the node wins every token race and can starve the installer — a low/refilling shared bucket could leave a legitimate update unable to fetch+install while the node keeps spending the lone refill token re-confirming it. Split into two independent on-disk buckets (NODE for get_latest_version, INSTALL for get_latest_release), each bounded the same way (capacity 8, refill 1/10min). Each path is rate-limited independently; the node can no longer drain the installer's budget. Added an independence test and an install-path source pin. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU

…nign no-op Codex P2 x2: - The token bucket now FAILS CLOSED: a consume is only allowed if the post-consume state was actually persisted. Previously a read-only/full state dir made every restart see a missing bucket (=full) and grant another burst, failing open. write now returns io::Result and the consume returns `allowed && persisted`. Added a unix read-only-dir regression test. - The macOS wrapper's exit-42 path now treats the updater's EXIT_CODE_ALREADY_UP_TO_DATE (2) — already current / rate-limited / pinned / install-gated — as a benign no-op that restarts WITHOUT counting toward the consecutive-failure cap, so repeated rate-limited no-ops can't stop a healthy node. The crash path still counts (a real crash is a failure). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU

…re gate Codex P2: on macOS, download_and_install returns Ok(()) when a DMG-swap fails (deliberately, to avoid corrupting the signed bundle). My gate logic read that as a successful install and CLEARED the gate, so repeated DMG checksum/sig/ download/staging failures would retry the same version forever. Distinguish the two Ok cases with an InstallOutcome enum: Installed clears the gate; BundleSkipped (the swallowed bundle failure) records an install failure so the version is gated after the threshold even though the process exits 0. Added a source-scrape pin (macOS path isn't exercised on Linux CI). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU

…ited result) Codex P2: record_check_time() ran before the rate-limit denial, so the backoff gate would short-circuit the next tick to a plain Skipped before reaching the RateLimited arm — and a Skipped at max backoff + 0 connections can still trigger the gateway-trust exit 42 (the loop the limiter prevents), and the stagger path could wrongly enter its 24h cooldown. Move record_check_time() out of the unconditional pre-poll position into the arms: record it only when a GitHub poll was actually attempted (Ok / non-rate- limit Err). A token-denied poll records nothing and grows no backoff, so every tick returns RateLimited until the bucket refills. Added a source-scrape pin. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU

…ones Codex P2: recording the install-failure gate for every download_and_install Err(_) would permanently gate a perfectly good release after 3 TRANSIENT failures (GitHub/download outage, still-propagating manifest, extraction/disk error) — auto-update could stay stuck until a newer release or manual --force. Introduce a typed ReleaseVerificationError for the DETERMINISTIC "bad release" signals (checksum mismatch, invalid/wrong-length signature) and gate ONLY on those; transient errors leave the gate unchanged (neither record nor clear). The macOS bundle-skip outcome carries the same deterministic-vs-transient bit. Added a direct classification unit test plus a run_async match source-scrape pin. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU

Codex flagged that missing SHA256SUMS.txt / missing entry / missing .sig fall through to the non-gated transient arm. This is intentional and the opposite of gating them is the safer choice: a MISSING manifest/entry is the canonical release-propagation race (the tag goes live before assets finish uploading), so it self-resolves; gating it risks permanently suppressing a good release a node polled seconds early — a worse regression than the already rate-limit-bounded retry cadence. Only a PRESENT-but-WRONG artifact (checksum mismatch / invalid signature) is the deterministic bad-release signal that gates. Made the distinction explicit at both the run_async match and required_checksum. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU

github-actions · 2026-06-27T02:04:10Z

I have all the information needed to complete the review. Let me write up the findings.

Rule Review: No blocking issues — thorough change with excellent test coverage

Rules checked: git-workflow.md, code-style.md, testing.md
Files reviewed: 7 (auto_update.rs, rollback.rs, service.rs, service/linux.rs, service/macos.rs, update.rs, freenet.rs)

Warnings

None.

Key findings on each checked category:

fix: regression tests (testing.md): All three loop-bounding mechanisms have direct regression tests:

Token bucket: 9 tests covering burst, refill, cap-at-capacity, corrupt/NaN/negative files, backwards clock, clock dip, first-boot (missing = full), and fail-closed on unpersistable write.
Install-failure gate: 7 tests covering threshold engagement, version isolation, version-change reset, success clear, degrade-safe on corrupt/missing, atomic-write invariant, and end-to-end loop bounding.
RateLimited result routing: rate_limited_never_triggers_exit_42 source-scrapes all three match sites to assert no update_tx.send in any RateLimited arm.

Edge-case coverage (testing.md): Excellent — boundary conditions (zero/capacity), error paths (corrupt, unpersistable), clock anomalies (backwards, dip-then-recover), and first-boot edge case are all covered explicitly.

Unwrap in production code (code-style.md): None found. All new production code uses unwrap_or(0), is_ok(), is_err(), and if let Err patterns.

Cleanup exemptions without TTL (AGENTS.md): The install-failure gate is not a GC exemption — it clears on success or version change. Does not fall under the AGENTS.md "Cleanup exemptions MUST be time-bounded" rule.

Retry loops / jitter (code-style.md): No new Rust retry loops were added. The token bucket is a gate, not a loop.

Fire-and-forget spawns (code-style.md): No new async spawns.

Commit messages (git-workflow.md): All commits (fix: and docs:) follow conventional commits and are appropriately scoped. Each fix: is a focused, single-concern change.

Info

crates/core/src/bin/commands/update.rs (install outcome match) — The Err(_) => {} arm is a necessary catch-all (anyhow errors can't be exhaustively matched) and is clearly documented with a multi-line comment explaining the transient-vs-deterministic distinction. Not a style error, just flagging the pattern.

Rule review against .claude/rules/. WARNING findings block merge.

…units Full-review Should: the new RestartSteps/RestartMaxDelaySec backoff can space medium-runtime crashes past the StartLimitBurst=5/120s window, softening #4551's terminal "failed-state stop" into a slow-flap (<=300s) for those — while a true sub-second fast crash still clusters its first ~5 restarts inside 120s and still terminal-stops. Document this deliberate tradeoff honestly in both unit templates (comment-only; directives unchanged). Behavior is acceptable: fast crashes still terminal-stop, medium crashes slow-flap gently and keep trying for a fix; a bad update is still disarmed by #4073/#4591 rollback. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU

sanity · 2026-06-27T02:43:33Z

Reviewed and approving after a full review (this bounds the auto-update network-load failure mode an aggregate-load audit surfaced — a bad-manifest/sig release could otherwise loop forever, ~100-170 GitHub req/hr/node + a gateway reconnect/min fleet-wide, made infinite by #4586's correct fail-closed change).

Review: external Codex (8 iterative rounds, final pass clean) + 3 adversarial lenses — rate-limit bypass/amplification, install-gate false-positives, and end-to-end loop-termination + cross-platform parity + no-regression. No Must-fix. Key things verified:

Rate-limit: on-disk token bucket (cap 8, refill 1/10min) at both GitHub-fetch choke points; fail-closed on corrupt/unpersistable; the fresh on-crash freenet update honors it; no bypass path; sound refill math (clock-rewind safe); normal ops have comfortable headroom.
Install-gate (the loop-stopper): only DETERMINISTIC verification failures (checksum/sig) count — transient download/IO/missing-manifest errors deliberately do NOT, so a brief outage can't wrongly gate a good release; a strictly-newer version supersedes (no permanent freeze); degrade-safe.
Parity + no-regression: macOS wrapper failure cap works; systemd RestartSteps backoff documented honestly re: the systemd unit: StartLimitBurst/StartLimitIntervalSec are in [Service], where systemd ignores them #4551 tradeoff (fast crashes still terminal-stop via StartLimit, confirmed in the live-systemd verify; medium crashes slow-flap gently); feat: crash-loop auto-rollback for the auto-updater (#4073) #4591 rollback / fix: make auto-update checksum verification fail-closed on Linux/Windows #4586 checksum / feat: add ed25519 release-artifact signing and verification to auto-update #4587 signing / fix: distinguish fast-crash from update exit to bound #4551 crash loops #4588 exit-codes all intact (separate counters, no collision).

307 tests pass (incl. token-bucket edge cases, gate classification, end-to-end loop-bound, source-scrape pins). CI green on 3d770e3. Merging.

[AI-assisted - Claude]

sanity and others added 10 commits June 26, 2026 19:38

sanity added this pull request to the merge queue Jun 27, 2026

Merged via the queue into main with commit 3dc6e71 Jun 27, 2026
16 checks passed

sanity deleted the fix/auto-update-load-bounding branch June 27, 2026 02:59

This was referenced Jun 27, 2026

feat: periodically re-poll GitHub for new releases so long-running nodes auto-update #4589

Merged

feat: default Freenet install to a supervised, auto-updating service #4590

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: bound auto-update network load so a bad release can't spam GitHub/gateways#4593

fix: bound auto-update network load so a bad release can't spam GitHub/gateways#4593
sanity merged 11 commits into
mainfrom
fix/auto-update-load-bounding

sanity commented Jun 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 27, 2026 •

edited

Loading

Uh oh!

sanity commented Jun 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

sanity commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Testing

Scope note

Uh oh!

github-actions Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rule Review: No blocking issues — thorough change with excellent test coverage

Warnings

Info

Uh oh!

sanity commented Jun 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sanity commented Jun 27, 2026 •

edited

Loading

github-actions Bot commented Jun 27, 2026 •

edited

Loading