fix: distinguish fast-crash from update exit to bound #4551 crash loops#4588
Conversation
Rule Review: No issues foundRules checked: git-workflow.md, code-style.md, testing.md Summary of changes reviewed:
Checks performed:
No rule violations detected. Rule review against |
Builds on #4570 (which moved StartLimitBurst/StartLimitIntervalSec into the [Unit] section so systemd actually honours the crash-loop limiter). This adds a crash-vs-update distinction so the now-effective limiter bounds real boot-wedge loops WITHOUT losing the auto-update self-heal, on every supervision path. systemd path: - StartLimitAction=none (default, made explicit) in both generated units: a tripped crash loop STOPS the unit rather than rebooting the host. - A distinct FAST_CRASH_EXIT_CODE (45), kept OUT of SuccessExitStatus, plus a min-uptime guard: a fatal listener exit under 60s uptime exits 45 (counted toward StartLimitBurst, so a tight loop trips the limiter and the unit stops); >=60s exits 42 (burst-exempt). - ExecStopPost runs `freenet update` on exit 42 OR 45 (via a `case`, avoiding &&/|| precedence pitfalls). Counting (StartLimit) and self-heal (ExecStopPost) are independent, so a boot-crash that a newer release fixes still self-heals (#4549) even though exit 45 is rate-limited; `freenet update` is a separate process that can succeed even when `freenet network` boot-crashes. cross-platform / cross-version safety (Codex P2): - Exit 45 is emitted ONLY when the binary opted in via a new enable_fast_crash_exit_code() flag (mirrors enable_abort_on_fatal_listener_exit). - The entry point opts in ONLY when the supervising unit advertises 45 support via a marker env var (FREENET_SYSTEMD_FAST_CRASH) that the REGENERATED systemd unit sets. So the runtime behavior is tied to the on-disk unit's actual capability: a node running this binary under an OLD/custom systemd unit (e.g. auto-updated but not reinstalled), under the macOS/Windows in-process run-wrapper (understands only 42; self-heals + bounds via its own backoff + 50-failure cap), or unsupervised keeps emitting the self-healing exit 42. - The marker is referenced via the SYSTEMD_FAST_CRASH_ENV_VAR const in both the unit template and the entry-point check (single source of truth, no drift). Tests: p2p_impl behavioral tests (fast-crash code distinct + counted; uptime split when enabled; always 42 when disabled; both abort and fast-crash flags are opt-in/off-by-default); service.rs assertions for StartLimitAction placement, exit-45 out of SuccessExitStatus, ExecStopPost firing update on 42|45, and the fast-crash marker present (via the const). StartLimitBurst/IntervalSec [Unit]-placement is left to #4570's linux.rs tests (not duplicated); those helpers are also hardened to find section headers by line, not substring, so a legitimate "[Service]" reference in a [Unit] comment isn't misparsed. Both generated units pass `systemd-analyze verify`. Refs #4551 (section-placement landed in #4570); completes the crash-loop bounding. Builds on #4570. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
d894b68 to
d511da5
Compare
|
Reviewed and approving. Verified end to end: fast-crash exit-45 is emitted ONLY when the freshly-generated systemd unit sets the [AI-assisted - Claude] |
Problem ------- A typical Linux Freenet install ends up unsupervised, so it never auto-updates. scripts/install.sh defaulted the "install as a service?" prompt to No, and a node with no service manager catching its exit-42 "update needed" signal exits to update and never restarts on the new version - it silently stops updating. With unsupervised being the dominant default on Linux, much of the network freezes on old releases. Solution -------- Make a supervised install the DEFAULT (issue #4073): - install.sh now sets up supervision unless the user explicitly opts out (FREENET_NO_SERVICE=1). The interactive prompt defaults to Yes ([Y/n]); a non-interactive curl|sh run sets up supervision automatically. - On Linux it prefers a SYSTEM service when it can elevate (already root, or sudo) - most reliable on the servers/VPS that dominate the node population (runs at boot, survives logout). When it cannot elevate it falls back to a USER service. A node is only left unsupervised as a last resort, with a loud warning explaining it will not auto-update. - The binary's user-service install now enables systemd lingering (`loginctl enable-linger <user>`) by default so a --user service runs without an active login session (the headless-server footgun: without linger it stops at logout and never auto-updates). New `--no-linger` flag opts out. System services are unaffected (they start at boot). - The decision honors an existing install so a re-run refreshes the same service type instead of creating a duplicate (idempotent + safe). The generated systemd units are unchanged - this reuses the existing unit generation, so the StartLimit/exit-45 work from #4570/#4588 is preserved. NOTE: this changes default install behavior (unsupervised -> supervised). Testing ------- - New scripts/test-install-sh.sh smoke-tests the system-vs-user decision (root / sudo / existing-unit / interactive permutations) by sourcing install.sh and overriding the environment probes. Wired into CI along with shellcheck on install.sh/uninstall.sh and the existing (previously unwired) uninstall smoke test. - New linux.rs unit test pins the lingering policy (system never lingers; user lingers unless --no-linger). - shellcheck clean; cargo fmt / clippy -D warnings / service tests green. Refs #4073 [AI-assisted - Claude] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
Update detection was effectively once-at-boot. After the one-shot startup GitHub check (#3864), the update-check loop's three triggers are all driven by peer signals (urgent / highest-seen-version / version-mismatch). If an entire network sits on the same old version, no node re-checks GitHub directly, so a freshly published release is never picked up until a node happens to restart. A node up for weeks never re-checks. Add a fourth, peer-signal-independent trigger to the existing update-check loop: a periodic direct GitHub re-poll that re-runs the same startup_update_check on a recurring ~6h schedule, jittered +/-25%. A discovered update feeds the SAME update_tx -> graceful-shutdown -> exit-42 -> `freenet update` path as every existing trigger, so all merged apply/verify/signing safety (checksum #4586, signature #4587, crash-loop bounding #4551/#4588) applies unchanged. WHY 6h +/-25%: far under GitHub's unauthenticated 60 req/hr/IP limit even at the jittered minimum (~4.5h), while still propagating a release within hours; the jitter decorrelates nodes that booted together so they do not re-poll (and restart) in lockstep, preserving the load-spreading intent of the existing 0-60s startup jitter and 0-4h decentralized stagger. The re-poll is gated on should_attempt_update() so it honors the persistent auto-update failure lockout (#3934): a locked-out node (e.g. non-writable binary path) must not exit-42 once per interval and drive the supervisor to rerun the same failing install. The jitter math is a pure, injectable function (jittered_repoll_interval) with unit tests; the lockout gate is pinned by a source-scrape test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
Problem
#4570 moved
StartLimitBurst/StartLimitIntervalSecinto the[Unit]section so systemd now actually honours the crash-loop limiter (they were silently ignored in[Service]). With the limiter effective, gaps remained:SuccessExitStatus=42 43whitelists the node's fatal-listener exit code (42, from Gateway death-spirals at ~100% CPU in libunwind FDE-cache invalidation under contract-executor backpressure; network listener dies, node wedges unrecoverable (0.2.81) #4549). A node stuck in a boot-wedge loop exits 42 every ~10s and never counts towardStartLimitBurst— so the limiter wouldn't bound the one case it most needs to.Solution
Builds on #4570. Adds a crash-vs-update distinction so the now-effective limiter bounds real boot-wedge loops without losing the auto-update self-heal, on every supervision path.
systemd path
StartLimitAction=none(default, made explicit) in both units: a tripped loop STOPS the unit rather than rebooting the host. Recover withsystemctl reset-failed freenet && systemctl start freenet.FAST_CRASH_EXIT_CODE = 45(kept OUT ofSuccessExitStatus) + a min-uptime guard,fatal_listener_exit_code(uptime, fast_crash_enabled):StartLimitBurstand the unit stops);ExecStopPostrunsfreenet updateon exit 42 OR 45 (via acase, avoiding&&/||precedence pitfalls). Counting (StartLimit) and self-heal (ExecStopPost) are independent, so a boot-crash that a newer release fixes still self-heals (Gateway death-spirals at ~100% CPU in libunwind FDE-cache invalidation under contract-executor backpressure; network listener dies, node wedges unrecoverable (0.2.81) #4549) even though exit 45 is rate-limited —freenet updateis a separate process that can succeed even whenfreenet networkboot-crashes.cross-platform / cross-version safety
Exit 45 is meaningful only under a unit that handles it, so it is emitted only when the binary opts in (
enable_fast_crash_exit_code(), mirroring the existingenable_abort_on_fatal_listener_exit()), and the entry point opts in only when the supervising unit advertises 45 support via a marker env var (FREENET_SYSTEMD_FAST_CRASH) that the regenerated systemd unit sets. This ties runtime behavior to the on-disk unit's actual capability:The marker is referenced through the
SYSTEMD_FAST_CRASH_ENV_VARconst in both the unit template and the entry-point check (single source of truth, no literal drift).Testing
node::p2p_impl: fast-crash code distinct + counted; uptime split when enabled; always 42 when disabled (run-wrapper / old-unit / unsupervised); both the abort and fast-crash flags are opt-in / off-by-default.service.rs:StartLimitActionin[Unit]; exit-45 out ofSuccessExitStatus;ExecStopPostfires update on42|45; fast-crash marker present (asserted via the const).StartLimitBurst/IntervalSec[Unit]-placement is covered by fix: move StartLimitBurst/StartLimitIntervalSec into [Unit] section of generated systemd units #4570'slinux.rstests (not duplicated). Those#4570helpers are also hardened to find section headers by line, not substring, so the legitimate[Service]reference now in a[Unit]comment isn't misparsed.cargo fmt;cargo clippy --locked -- -D warningsand--features trace-otboth clean. Both generated units passsystemd-analyze verify(exit 0; only the pre-existing ExecStartPre shell-escape notice).Refs #4551 (the section-placement fix landed in #4570); this completes the crash-loop bounding. Builds on #4570.
Note
The systemd burst-behavior change (counting exit 45 toward
StartLimitBurst,StartLimitAction=none, marker-gated opt-in) wants a smoke-test on a live systemd host before the next release, per.claude/rules/deployment.md(acfg(target_os = "linux")service path CI does not exercise end-to-end).[AI-assisted - Claude]
🤖 Generated with Claude Code