fix: distinguish fast-crash from update exit to bound #4551 crash loops by sanity · Pull Request #4588 · freenet/freenet-core

sanity · 2026-06-26T18:03:26Z

Problem

#4570 moved StartLimitBurst/StartLimitIntervalSec into the [Unit] section so systemd now actually honours the crash-loop limiter (they were silently ignored in [Service]). With the limiter effective, gaps remained:

The unit's SuccessExitStatus=42 43 whitelists the node's fatal-listener exit code (42, from Gateway death-spirals at ~100% CPU in libunwind FDE-cache invalidation under contract-executor backpressure; network listener dies, node wedges unrecoverable (0.2.81) #4549). A node stuck in a boot-wedge loop exits 42 every ~10s and never counts toward StartLimitBurst — so the limiter wouldn't bound the one case it most needs to.
The limiter's terminal action was implicit (a tripped loop should stop the unit, not reboot the host).

Solution

Builds on #4570. Adds a crash-vs-update distinction so the now-effective limiter bounds real boot-wedge loops without losing the auto-update self-heal, on every supervision path.

systemd path

StartLimitAction=none (default, made explicit) in both units: a tripped loop STOPS the unit rather than rebooting the host. Recover with systemctl reset-failed freenet && systemctl start freenet.
FAST_CRASH_EXIT_CODE = 45 (kept OUT of SuccessExitStatus) + a min-uptime guard, fatal_listener_exit_code(uptime, fast_crash_enabled):
- fatal listener exit < 60s uptime → exit 45 (counted, so a tight loop trips StartLimitBurst and the unit stops);
- >= 60s → exit 42 (burst-exempt).
ExecStopPost runs freenet update on exit 42 OR 45 (via a case, avoiding &&/|| precedence pitfalls). Counting (StartLimit) and self-heal (ExecStopPost) are independent, so a boot-crash that a newer release fixes still self-heals (Gateway death-spirals at ~100% CPU in libunwind FDE-cache invalidation under contract-executor backpressure; network listener dies, node wedges unrecoverable (0.2.81) #4549) even though exit 45 is rate-limited — freenet update is a separate process that can succeed even when freenet network boot-crashes.

cross-platform / cross-version safety

Exit 45 is meaningful only under a unit that handles it, so it is emitted only when the binary opts in (enable_fast_crash_exit_code(), mirroring the existing enable_abort_on_fatal_listener_exit()), and the entry point opts in only when the supervising unit advertises 45 support via a marker env var (FREENET_SYSTEMD_FAST_CRASH) that the regenerated systemd unit sets. This ties runtime behavior to the on-disk unit's actual capability:

macOS/Windows in-process run-wrapper (knows only 42; self-heals on 42 and bounds the loop with its own backoff + 50-failure cap) → no marker → keeps emitting 42.
Old/custom systemd unit — including a node auto-updated to this binary but whose unit file was never regenerated → no marker → keeps emitting 42 (its existing exit-42 self-heal works).
Unsupervised → no marker → 42 (moot).

The marker is referenced through the SYSTEMD_FAST_CRASH_ENV_VAR const in both the unit template and the entry-point check (single source of truth, no literal drift).

Testing

node::p2p_impl: fast-crash code distinct + counted; uptime split when enabled; always 42 when disabled (run-wrapper / old-unit / unsupervised); both the abort and fast-crash flags are opt-in / off-by-default.
service.rs: StartLimitAction in [Unit]; exit-45 out of SuccessExitStatus; ExecStopPost fires update on 42|45; fast-crash marker present (asserted via the const).
StartLimitBurst/IntervalSec [Unit]-placement is covered by fix: move StartLimitBurst/StartLimitIntervalSec into [Unit] section of generated systemd units #4570's linux.rs tests (not duplicated). Those #4570 helpers are also hardened to find section headers by line, not substring, so the legitimate [Service] reference now in a [Unit] comment isn't misparsed.
cargo fmt; cargo clippy --locked -- -D warnings and --features trace-ot both clean. Both generated units pass systemd-analyze verify (exit 0; only the pre-existing ExecStartPre shell-escape notice).

Refs #4551 (the section-placement fix landed in #4570); this completes the crash-loop bounding. Builds on #4570.

Note

The systemd burst-behavior change (counting exit 45 toward StartLimitBurst, StartLimitAction=none, marker-gated opt-in) wants a smoke-test on a live systemd host before the next release, per .claude/rules/deployment.md (a cfg(target_os = "linux") service path CI does not exercise end-to-end).

[AI-assisted - Claude]

🤖 Generated with Claude Code

github-actions · 2026-06-26T18:07:18Z

Rule Review: No issues found

Rules checked: git-workflow.md, code-style.md, testing.md
Files reviewed: 7

Summary of changes reviewed:

auto_update.rs — new SYSTEMD_FAST_CRASH_ENV_VAR constant
service.rs — new test assertions and helper functions for section-placement verification
service/linux.rs — systemd unit template changes + test helper hardening
freenet.rs — opt-in call to enable_fast_crash_exit_code
lib.rs / node.rs — re-exports
node/p2p_impl.rs — core logic: FAST_CRASH_EXIT_CODE, MIN_HEALTHY_UPTIME_FOR_UPDATE_EXIT, fatal_listener_exit_code(), EMIT_FAST_CRASH_EXIT_CODE, tests

Checks performed:

Commit message: fix: distinguish fast-crash from update exit to bound #4551 crash loops — 71 chars, valid fix: prefix. ✓
Regression tests: fatal_listener_exit_code_distinguishes_fast_crash_from_healthy_uptime, fatal_listener_exit_code_uses_update_code_when_fast_crash_disabled, and fast_crash_exit_code_is_distinct_and_counted directly test the new pure function for the exact bug being fixed. ✓
Boundary conditions: [0,1,10,30,59] (fast crash), exact threshold 60s, [60,120,3600,86400] (healthy uptime), and the fully-disabled path all covered. ✓
No .unwrap() in production code — panic!() calls are confined to mod tests helpers. ✓
start_time.elapsed() at call site — start_time is a pre-existing tokio::time::Instant (line 378, not introduced by this PR); this is Rule Lint territory, not new. ✓
Platform guards: section_of_directive helper and its consumer tests are consistently #[cfg(target_os = "linux")]. ✓
EMIT_FAST_CRASH_EXIT_CODE global atomic: set-once at startup, read in hot path — standard pattern; fatal_listener_exit_code takes it as a parameter for testability, bypassing global state in unit tests. ✓
fatal_listener_exit_code extracted as pure function: matches the project pattern for testable decision logic (references deployment.md in its own docstring). ✓
section helper in linux.rs: correctly hardened from find("\n[") substring search to line-by-line matching to avoid false matches on section names appearing in inline comments. ✓

No rule violations detected.

Rule review against .claude/rules/. WARNING findings block merge.

Builds on #4570 (which moved StartLimitBurst/StartLimitIntervalSec into the [Unit] section so systemd actually honours the crash-loop limiter). This adds a crash-vs-update distinction so the now-effective limiter bounds real boot-wedge loops WITHOUT losing the auto-update self-heal, on every supervision path. systemd path: - StartLimitAction=none (default, made explicit) in both generated units: a tripped crash loop STOPS the unit rather than rebooting the host. - A distinct FAST_CRASH_EXIT_CODE (45), kept OUT of SuccessExitStatus, plus a min-uptime guard: a fatal listener exit under 60s uptime exits 45 (counted toward StartLimitBurst, so a tight loop trips the limiter and the unit stops); >=60s exits 42 (burst-exempt). - ExecStopPost runs `freenet update` on exit 42 OR 45 (via a `case`, avoiding &&/|| precedence pitfalls). Counting (StartLimit) and self-heal (ExecStopPost) are independent, so a boot-crash that a newer release fixes still self-heals (#4549) even though exit 45 is rate-limited; `freenet update` is a separate process that can succeed even when `freenet network` boot-crashes. cross-platform / cross-version safety (Codex P2): - Exit 45 is emitted ONLY when the binary opted in via a new enable_fast_crash_exit_code() flag (mirrors enable_abort_on_fatal_listener_exit). - The entry point opts in ONLY when the supervising unit advertises 45 support via a marker env var (FREENET_SYSTEMD_FAST_CRASH) that the REGENERATED systemd unit sets. So the runtime behavior is tied to the on-disk unit's actual capability: a node running this binary under an OLD/custom systemd unit (e.g. auto-updated but not reinstalled), under the macOS/Windows in-process run-wrapper (understands only 42; self-heals + bounds via its own backoff + 50-failure cap), or unsupervised keeps emitting the self-healing exit 42. - The marker is referenced via the SYSTEMD_FAST_CRASH_ENV_VAR const in both the unit template and the entry-point check (single source of truth, no drift). Tests: p2p_impl behavioral tests (fast-crash code distinct + counted; uptime split when enabled; always 42 when disabled; both abort and fast-crash flags are opt-in/off-by-default); service.rs assertions for StartLimitAction placement, exit-45 out of SuccessExitStatus, ExecStopPost firing update on 42|45, and the fast-crash marker present (via the const). StartLimitBurst/IntervalSec [Unit]-placement is left to #4570's linux.rs tests (not duplicated); those helpers are also hardened to find section headers by line, not substring, so a legitimate "[Service]" reference in a [Unit] comment isn't misparsed. Both generated units pass `systemd-analyze verify`. Refs #4551 (section-placement landed in #4570); completes the crash-loop bounding. Builds on #4570. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU

sanity · 2026-06-26T19:37:29Z

Reviewed and approving. Verified end to end: fast-crash exit-45 is emitted ONLY when the freshly-generated systemd unit sets the SYSTEMD_FAST_CRASH_ENV_VAR marker, so the macOS/Windows wrapper, unsupervised nodes, and a node that auto-updated the binary but not yet its old systemd unit all keep emitting exit-42 (existing self-heal + loop-bounding intact) — no regression on any path. The new unit fires freenet update on exit 42 OR 45 (boot-crash self-heal) while keeping 45 out of SuccessExitStatus so StartLimit counts it and a tight loop is bounded. Comprehensive tests (marker gating, ExecStopPost 42|45, exit-code selector, systemd-analyze), CI green. Completes #4551 with #4570. Merging.

[AI-assisted - Claude]

Problem ------- A typical Linux Freenet install ends up unsupervised, so it never auto-updates. scripts/install.sh defaulted the "install as a service?" prompt to No, and a node with no service manager catching its exit-42 "update needed" signal exits to update and never restarts on the new version - it silently stops updating. With unsupervised being the dominant default on Linux, much of the network freezes on old releases. Solution -------- Make a supervised install the DEFAULT (issue #4073): - install.sh now sets up supervision unless the user explicitly opts out (FREENET_NO_SERVICE=1). The interactive prompt defaults to Yes ([Y/n]); a non-interactive curl|sh run sets up supervision automatically. - On Linux it prefers a SYSTEM service when it can elevate (already root, or sudo) - most reliable on the servers/VPS that dominate the node population (runs at boot, survives logout). When it cannot elevate it falls back to a USER service. A node is only left unsupervised as a last resort, with a loud warning explaining it will not auto-update. - The binary's user-service install now enables systemd lingering (`loginctl enable-linger <user>`) by default so a --user service runs without an active login session (the headless-server footgun: without linger it stops at logout and never auto-updates). New `--no-linger` flag opts out. System services are unaffected (they start at boot). - The decision honors an existing install so a re-run refreshes the same service type instead of creating a duplicate (idempotent + safe). The generated systemd units are unchanged - this reuses the existing unit generation, so the StartLimit/exit-45 work from #4570/#4588 is preserved. NOTE: this changes default install behavior (unsupervised -> supervised). Testing ------- - New scripts/test-install-sh.sh smoke-tests the system-vs-user decision (root / sudo / existing-unit / interactive permutations) by sourcing install.sh and overriding the environment probes. Wired into CI along with shellcheck on install.sh/uninstall.sh and the existing (previously unwired) uninstall smoke test. - New linux.rs unit test pins the lingering policy (system never lingers; user lingers unless --no-linger). - shellcheck clean; cargo fmt / clippy -D warnings / service tests green. Refs #4073 [AI-assisted - Claude] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU

Update detection was effectively once-at-boot. After the one-shot startup GitHub check (#3864), the update-check loop's three triggers are all driven by peer signals (urgent / highest-seen-version / version-mismatch). If an entire network sits on the same old version, no node re-checks GitHub directly, so a freshly published release is never picked up until a node happens to restart. A node up for weeks never re-checks. Add a fourth, peer-signal-independent trigger to the existing update-check loop: a periodic direct GitHub re-poll that re-runs the same startup_update_check on a recurring ~6h schedule, jittered +/-25%. A discovered update feeds the SAME update_tx -> graceful-shutdown -> exit-42 -> `freenet update` path as every existing trigger, so all merged apply/verify/signing safety (checksum #4586, signature #4587, crash-loop bounding #4551/#4588) applies unchanged. WHY 6h +/-25%: far under GitHub's unauthenticated 60 req/hr/IP limit even at the jittered minimum (~4.5h), while still propagating a release within hours; the jitter decorrelates nodes that booted together so they do not re-poll (and restart) in lockstep, preserving the load-spreading intent of the existing 0-60s startup jitter and 0-4h decentralized stagger. The re-poll is gated on should_attempt_update() so it honors the persistent auto-update failure lockout (#3934): a locked-out node (e.g. non-writable binary path) must not exit-42 once per interval and drive the supervisor to rerun the same failing install. The jitter math is a pure, injectable function (jittered_repoll_interval) with unit tests; the lockout gate is pinned by a source-scrape test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU

sanity force-pushed the fix/4551-crash-loop-startlimit branch from d894b68 to d511da5 Compare June 26, 2026 19:04

sanity added this pull request to the merge queue Jun 26, 2026

Merged via the queue into main with commit 07a179c Jun 26, 2026
16 checks passed

sanity deleted the fix/4551-crash-loop-startlimit branch June 26, 2026 19:51

sanity mentioned this pull request Jun 28, 2026

fix: exit-for-restart on redb poison instead of bricking forever (#4604) #4609

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: distinguish fast-crash from update exit to bound #4551 crash loops#4588

fix: distinguish fast-crash from update exit to bound #4551 crash loops#4588
sanity merged 1 commit into
mainfrom
fix/4551-crash-loop-startlimit

sanity commented Jun 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

sanity commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

sanity commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

systemd path

cross-platform / cross-version safety

Testing

Uh oh!

github-actions Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rule Review: No issues found

Uh oh!

sanity commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sanity commented Jun 26, 2026 •

edited

Loading

github-actions Bot commented Jun 26, 2026 •

edited

Loading