Skip to content

fix: distinguish fast-crash from update exit to bound #4551 crash loops#4588

Merged
sanity merged 1 commit into
mainfrom
fix/4551-crash-loop-startlimit
Jun 26, 2026
Merged

fix: distinguish fast-crash from update exit to bound #4551 crash loops#4588
sanity merged 1 commit into
mainfrom
fix/4551-crash-loop-startlimit

Conversation

@sanity

@sanity sanity commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Problem

#4570 moved StartLimitBurst/StartLimitIntervalSec into the [Unit] section so systemd now actually honours the crash-loop limiter (they were silently ignored in [Service]). With the limiter effective, gaps remained:

  1. The unit's SuccessExitStatus=42 43 whitelists the node's fatal-listener exit code (42, from Gateway death-spirals at ~100% CPU in libunwind FDE-cache invalidation under contract-executor backpressure; network listener dies, node wedges unrecoverable (0.2.81) #4549). A node stuck in a boot-wedge loop exits 42 every ~10s and never counts toward StartLimitBurst — so the limiter wouldn't bound the one case it most needs to.
  2. The limiter's terminal action was implicit (a tripped loop should stop the unit, not reboot the host).

Solution

Builds on #4570. Adds a crash-vs-update distinction so the now-effective limiter bounds real boot-wedge loops without losing the auto-update self-heal, on every supervision path.

systemd path

  • StartLimitAction=none (default, made explicit) in both units: a tripped loop STOPS the unit rather than rebooting the host. Recover with systemctl reset-failed freenet && systemctl start freenet.
  • FAST_CRASH_EXIT_CODE = 45 (kept OUT of SuccessExitStatus) + a min-uptime guard, fatal_listener_exit_code(uptime, fast_crash_enabled):
    • fatal listener exit < 60s uptime → exit 45 (counted, so a tight loop trips StartLimitBurst and the unit stops);
    • >= 60s → exit 42 (burst-exempt).
  • ExecStopPost runs freenet update on exit 42 OR 45 (via a case, avoiding &&/|| precedence pitfalls). Counting (StartLimit) and self-heal (ExecStopPost) are independent, so a boot-crash that a newer release fixes still self-heals (Gateway death-spirals at ~100% CPU in libunwind FDE-cache invalidation under contract-executor backpressure; network listener dies, node wedges unrecoverable (0.2.81) #4549) even though exit 45 is rate-limited — freenet update is a separate process that can succeed even when freenet network boot-crashes.

cross-platform / cross-version safety

Exit 45 is meaningful only under a unit that handles it, so it is emitted only when the binary opts in (enable_fast_crash_exit_code(), mirroring the existing enable_abort_on_fatal_listener_exit()), and the entry point opts in only when the supervising unit advertises 45 support via a marker env var (FREENET_SYSTEMD_FAST_CRASH) that the regenerated systemd unit sets. This ties runtime behavior to the on-disk unit's actual capability:

  • macOS/Windows in-process run-wrapper (knows only 42; self-heals on 42 and bounds the loop with its own backoff + 50-failure cap) → no marker → keeps emitting 42.
  • Old/custom systemd unit — including a node auto-updated to this binary but whose unit file was never regenerated → no marker → keeps emitting 42 (its existing exit-42 self-heal works).
  • Unsupervised → no marker → 42 (moot).

The marker is referenced through the SYSTEMD_FAST_CRASH_ENV_VAR const in both the unit template and the entry-point check (single source of truth, no literal drift).

Testing

  • node::p2p_impl: fast-crash code distinct + counted; uptime split when enabled; always 42 when disabled (run-wrapper / old-unit / unsupervised); both the abort and fast-crash flags are opt-in / off-by-default.
  • service.rs: StartLimitAction in [Unit]; exit-45 out of SuccessExitStatus; ExecStopPost fires update on 42|45; fast-crash marker present (asserted via the const).
  • StartLimitBurst/IntervalSec [Unit]-placement is covered by fix: move StartLimitBurst/StartLimitIntervalSec into [Unit] section of generated systemd units #4570's linux.rs tests (not duplicated). Those #4570 helpers are also hardened to find section headers by line, not substring, so the legitimate [Service] reference now in a [Unit] comment isn't misparsed.
  • cargo fmt; cargo clippy --locked -- -D warnings and --features trace-ot both clean. Both generated units pass systemd-analyze verify (exit 0; only the pre-existing ExecStartPre shell-escape notice).

Refs #4551 (the section-placement fix landed in #4570); this completes the crash-loop bounding. Builds on #4570.

Note

The systemd burst-behavior change (counting exit 45 toward StartLimitBurst, StartLimitAction=none, marker-gated opt-in) wants a smoke-test on a live systemd host before the next release, per .claude/rules/deployment.md (a cfg(target_os = "linux") service path CI does not exercise end-to-end).

[AI-assisted - Claude]

🤖 Generated with Claude Code

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Rule Review: No issues found

Rules checked: git-workflow.md, code-style.md, testing.md
Files reviewed: 7

Summary of changes reviewed:

  • auto_update.rs — new SYSTEMD_FAST_CRASH_ENV_VAR constant
  • service.rs — new test assertions and helper functions for section-placement verification
  • service/linux.rs — systemd unit template changes + test helper hardening
  • freenet.rs — opt-in call to enable_fast_crash_exit_code
  • lib.rs / node.rs — re-exports
  • node/p2p_impl.rs — core logic: FAST_CRASH_EXIT_CODE, MIN_HEALTHY_UPTIME_FOR_UPDATE_EXIT, fatal_listener_exit_code(), EMIT_FAST_CRASH_EXIT_CODE, tests

Checks performed:

  • Commit message: fix: distinguish fast-crash from update exit to bound #4551 crash loops — 71 chars, valid fix: prefix. ✓
  • Regression tests: fatal_listener_exit_code_distinguishes_fast_crash_from_healthy_uptime, fatal_listener_exit_code_uses_update_code_when_fast_crash_disabled, and fast_crash_exit_code_is_distinct_and_counted directly test the new pure function for the exact bug being fixed. ✓
  • Boundary conditions: [0,1,10,30,59] (fast crash), exact threshold 60s, [60,120,3600,86400] (healthy uptime), and the fully-disabled path all covered. ✓
  • No .unwrap() in production codepanic!() calls are confined to mod tests helpers. ✓
  • start_time.elapsed() at call sitestart_time is a pre-existing tokio::time::Instant (line 378, not introduced by this PR); this is Rule Lint territory, not new. ✓
  • Platform guards: section_of_directive helper and its consumer tests are consistently #[cfg(target_os = "linux")]. ✓
  • EMIT_FAST_CRASH_EXIT_CODE global atomic: set-once at startup, read in hot path — standard pattern; fatal_listener_exit_code takes it as a parameter for testability, bypassing global state in unit tests. ✓
  • fatal_listener_exit_code extracted as pure function: matches the project pattern for testable decision logic (references deployment.md in its own docstring). ✓
  • section helper in linux.rs: correctly hardened from find("\n[") substring search to line-by-line matching to avoid false matches on section names appearing in inline comments. ✓

No rule violations detected.


Rule review against .claude/rules/. WARNING findings block merge.

Builds on #4570 (which moved StartLimitBurst/StartLimitIntervalSec into the
[Unit] section so systemd actually honours the crash-loop limiter). This adds a
crash-vs-update distinction so the now-effective limiter bounds real boot-wedge
loops WITHOUT losing the auto-update self-heal, on every supervision path.

systemd path:
- StartLimitAction=none (default, made explicit) in both generated units: a
  tripped crash loop STOPS the unit rather than rebooting the host.
- A distinct FAST_CRASH_EXIT_CODE (45), kept OUT of SuccessExitStatus, plus a
  min-uptime guard: a fatal listener exit under 60s uptime exits 45 (counted
  toward StartLimitBurst, so a tight loop trips the limiter and the unit stops);
  >=60s exits 42 (burst-exempt).
- ExecStopPost runs `freenet update` on exit 42 OR 45 (via a `case`, avoiding
  &&/|| precedence pitfalls). Counting (StartLimit) and self-heal (ExecStopPost)
  are independent, so a boot-crash that a newer release fixes still self-heals
  (#4549) even though exit 45 is rate-limited; `freenet update` is a separate
  process that can succeed even when `freenet network` boot-crashes.

cross-platform / cross-version safety (Codex P2):
- Exit 45 is emitted ONLY when the binary opted in via a new
  enable_fast_crash_exit_code() flag (mirrors enable_abort_on_fatal_listener_exit).
- The entry point opts in ONLY when the supervising unit advertises 45 support
  via a marker env var (FREENET_SYSTEMD_FAST_CRASH) that the REGENERATED systemd
  unit sets. So the runtime behavior is tied to the on-disk unit's actual
  capability: a node running this binary under an OLD/custom systemd unit (e.g.
  auto-updated but not reinstalled), under the macOS/Windows in-process
  run-wrapper (understands only 42; self-heals + bounds via its own backoff +
  50-failure cap), or unsupervised keeps emitting the self-healing exit 42.
- The marker is referenced via the SYSTEMD_FAST_CRASH_ENV_VAR const in both the
  unit template and the entry-point check (single source of truth, no drift).

Tests: p2p_impl behavioral tests (fast-crash code distinct + counted; uptime
split when enabled; always 42 when disabled; both abort and fast-crash flags are
opt-in/off-by-default); service.rs assertions for StartLimitAction placement,
exit-45 out of SuccessExitStatus, ExecStopPost firing update on 42|45, and the
fast-crash marker present (via the const). StartLimitBurst/IntervalSec
[Unit]-placement is left to #4570's linux.rs tests (not duplicated); those
helpers are also hardened to find section headers by line, not substring, so a
legitimate "[Service]" reference in a [Unit] comment isn't misparsed. Both
generated units pass `systemd-analyze verify`.

Refs #4551 (section-placement landed in #4570); completes the crash-loop
bounding. Builds on #4570.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
@sanity sanity force-pushed the fix/4551-crash-loop-startlimit branch from d894b68 to d511da5 Compare June 26, 2026 19:04
@sanity

sanity commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

Reviewed and approving. Verified end to end: fast-crash exit-45 is emitted ONLY when the freshly-generated systemd unit sets the SYSTEMD_FAST_CRASH_ENV_VAR marker, so the macOS/Windows wrapper, unsupervised nodes, and a node that auto-updated the binary but not yet its old systemd unit all keep emitting exit-42 (existing self-heal + loop-bounding intact) — no regression on any path. The new unit fires freenet update on exit 42 OR 45 (boot-crash self-heal) while keeping 45 out of SuccessExitStatus so StartLimit counts it and a tight loop is bounded. Comprehensive tests (marker gating, ExecStopPost 42|45, exit-code selector, systemd-analyze), CI green. Completes #4551 with #4570. Merging.

[AI-assisted - Claude]

@sanity sanity added this pull request to the merge queue Jun 26, 2026
Merged via the queue into main with commit 07a179c Jun 26, 2026
16 checks passed
@sanity sanity deleted the fix/4551-crash-loop-startlimit branch June 26, 2026 19:51
sanity added a commit that referenced this pull request Jun 27, 2026
Problem
-------
A typical Linux Freenet install ends up unsupervised, so it never
auto-updates. scripts/install.sh defaulted the "install as a service?"
prompt to No, and a node with no service manager catching its exit-42
"update needed" signal exits to update and never restarts on the new
version - it silently stops updating. With unsupervised being the
dominant default on Linux, much of the network freezes on old releases.

Solution
--------
Make a supervised install the DEFAULT (issue #4073):

- install.sh now sets up supervision unless the user explicitly opts out
  (FREENET_NO_SERVICE=1). The interactive prompt defaults to Yes ([Y/n]);
  a non-interactive curl|sh run sets up supervision automatically.
- On Linux it prefers a SYSTEM service when it can elevate (already root,
  or sudo) - most reliable on the servers/VPS that dominate the node
  population (runs at boot, survives logout). When it cannot elevate it
  falls back to a USER service. A node is only left unsupervised as a
  last resort, with a loud warning explaining it will not auto-update.
- The binary's user-service install now enables systemd lingering
  (`loginctl enable-linger <user>`) by default so a --user service runs
  without an active login session (the headless-server footgun: without
  linger it stops at logout and never auto-updates). New `--no-linger`
  flag opts out. System services are unaffected (they start at boot).
- The decision honors an existing install so a re-run refreshes the same
  service type instead of creating a duplicate (idempotent + safe).

The generated systemd units are unchanged - this reuses the existing
unit generation, so the StartLimit/exit-45 work from #4570/#4588 is
preserved.

NOTE: this changes default install behavior (unsupervised -> supervised).

Testing
-------
- New scripts/test-install-sh.sh smoke-tests the system-vs-user decision
  (root / sudo / existing-unit / interactive permutations) by sourcing
  install.sh and overriding the environment probes. Wired into CI along
  with shellcheck on install.sh/uninstall.sh and the existing (previously
  unwired) uninstall smoke test.
- New linux.rs unit test pins the lingering policy (system never lingers;
  user lingers unless --no-linger).
- shellcheck clean; cargo fmt / clippy -D warnings / service tests green.

Refs #4073

[AI-assisted - Claude]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
sanity added a commit that referenced this pull request Jun 27, 2026
Update detection was effectively once-at-boot. After the one-shot startup
GitHub check (#3864), the update-check loop's three triggers are all driven
by peer signals (urgent / highest-seen-version / version-mismatch). If an
entire network sits on the same old version, no node re-checks GitHub
directly, so a freshly published release is never picked up until a node
happens to restart. A node up for weeks never re-checks.

Add a fourth, peer-signal-independent trigger to the existing update-check
loop: a periodic direct GitHub re-poll that re-runs the same
startup_update_check on a recurring ~6h schedule, jittered +/-25%. A
discovered update feeds the SAME update_tx -> graceful-shutdown -> exit-42 ->
`freenet update` path as every existing trigger, so all merged
apply/verify/signing safety (checksum #4586, signature #4587, crash-loop
bounding #4551/#4588) applies unchanged.

WHY 6h +/-25%: far under GitHub's unauthenticated 60 req/hr/IP limit even at
the jittered minimum (~4.5h), while still propagating a release within hours;
the jitter decorrelates nodes that booted together so they do not re-poll
(and restart) in lockstep, preserving the load-spreading intent of the
existing 0-60s startup jitter and 0-4h decentralized stagger.

The re-poll is gated on should_attempt_update() so it honors the persistent
auto-update failure lockout (#3934): a locked-out node (e.g. non-writable
binary path) must not exit-42 once per interval and drive the supervisor to
rerun the same failing install. The jitter math is a pure, injectable
function (jittered_repoll_interval) with unit tests; the lockout gate is
pinned by a source-scrape test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant