fix: move StartLimitBurst/StartLimitIntervalSec into [Unit] section of generated systemd units#4570
Conversation
|
Now I have all the context I need to write up the review. Rule Review: Clean fix with good regression testsRules checked: git-workflow.md, code-style.md, testing.md WarningsNone. Info
Summary: The fix correctly relocates Rule review against |
|
Thanks @mvanhorn — this is exactly the fix for #4551 (the [AI-assisted - Claude] |
…f generated systemd units
dae1c4c to
d22f03d
Compare
|
Reviewed and approving. The section move is correct (StartLimit* belong in [Unit]; systemd-analyze confirms the old [Service] placement reproduces [AI-assisted - Claude] |
Builds on #4570 (which moved StartLimitBurst/StartLimitIntervalSec into the [Unit] section so systemd actually honours the crash-loop limiter). This adds the crash-vs-update distinction so the now-effective limiter only bounds real boot-wedge loops without disabling the auto-update self-heal path. - StartLimitAction=none (the default, made explicit) in both generated units: a tripped crash loop STOPS the unit rather than rebooting the host. - A distinct FAST_CRASH_EXIT_CODE (45), deliberately NOT in the unit's SuccessExitStatus, plus a min-uptime guard fatal_listener_exit_code(uptime): a fatal listener exit under 60s uptime exits 45 (counted toward the burst, so a tight loop trips StartLimitBurst and the unit stops); >=60s exits 42 (burst-exempt, keeps the freenet-update self-heal hook for #4549). Tests: new p2p_impl behavioral tests (fast-crash code is distinct + counted; fatal_listener_exit_code distinguishes fast-crash from healthy uptime) and service.rs assertions for StartLimitAction placement + exit-45 staying out of SuccessExitStatus. StartLimitBurst/IntervalSec [Unit]-placement assertions are left to #4570's linux.rs tests (not duplicated here). Also hardens those #4570 test helpers to detect section headers by line rather than substring, so a legitimate "[Service]" reference inside a [Unit] comment is no longer mistaken for the section header. Refs #4551 (section-placement landed in #4570); completes the crash-loop bounding so a boot-wedge loop is actually rate-limited. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
Builds on #4570 (which moved StartLimitBurst/StartLimitIntervalSec into the [Unit] section so systemd actually honours the crash-loop limiter). This adds a crash-vs-update distinction so the now-effective limiter bounds real boot-wedge loops WITHOUT losing the auto-update self-heal, on every supervision path. systemd path: - StartLimitAction=none (default, made explicit) in both generated units: a tripped crash loop STOPS the unit rather than rebooting the host. - A distinct FAST_CRASH_EXIT_CODE (45), kept OUT of SuccessExitStatus, plus a min-uptime guard: a fatal listener exit under 60s uptime exits 45 (counted toward StartLimitBurst, so a tight loop trips the limiter and the unit stops); >=60s exits 42 (burst-exempt). - ExecStopPost runs `freenet update` on exit 42 OR 45 (via a `case`, avoiding &&/|| precedence pitfalls). Counting (StartLimit) and self-heal (ExecStopPost) are independent, so a boot-crash that a newer release fixes still self-heals (#4549) even though exit 45 is rate-limited; `freenet update` is a separate process that can succeed even when `freenet network` boot-crashes. cross-platform / cross-version safety (Codex P2): - Exit 45 is emitted ONLY when the binary opted in via a new enable_fast_crash_exit_code() flag (mirrors enable_abort_on_fatal_listener_exit). - The entry point opts in ONLY when the supervising unit advertises 45 support via a marker env var (FREENET_SYSTEMD_FAST_CRASH) that the REGENERATED systemd unit sets. So the runtime behavior is tied to the on-disk unit's actual capability: a node running this binary under an OLD/custom systemd unit (e.g. auto-updated but not reinstalled), under the macOS/Windows in-process run-wrapper (understands only 42; self-heals + bounds via its own backoff + 50-failure cap), or unsupervised keeps emitting the self-healing exit 42. - The marker is referenced via the SYSTEMD_FAST_CRASH_ENV_VAR const in both the unit template and the entry-point check (single source of truth, no drift). Tests: p2p_impl behavioral tests (fast-crash code distinct + counted; uptime split when enabled; always 42 when disabled; both abort and fast-crash flags are opt-in/off-by-default); service.rs assertions for StartLimitAction placement, exit-45 out of SuccessExitStatus, ExecStopPost firing update on 42|45, and the fast-crash marker present (via the const). StartLimitBurst/IntervalSec [Unit]-placement is left to #4570's linux.rs tests (not duplicated); those helpers are also hardened to find section headers by line, not substring, so a legitimate "[Service]" reference in a [Unit] comment isn't misparsed. Both generated units pass `systemd-analyze verify`. Refs #4551 (section-placement landed in #4570); completes the crash-loop bounding. Builds on #4570. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
Problem ------- A typical Linux Freenet install ends up unsupervised, so it never auto-updates. scripts/install.sh defaulted the "install as a service?" prompt to No, and a node with no service manager catching its exit-42 "update needed" signal exits to update and never restarts on the new version - it silently stops updating. With unsupervised being the dominant default on Linux, much of the network freezes on old releases. Solution -------- Make a supervised install the DEFAULT (issue #4073): - install.sh now sets up supervision unless the user explicitly opts out (FREENET_NO_SERVICE=1). The interactive prompt defaults to Yes ([Y/n]); a non-interactive curl|sh run sets up supervision automatically. - On Linux it prefers a SYSTEM service when it can elevate (already root, or sudo) - most reliable on the servers/VPS that dominate the node population (runs at boot, survives logout). When it cannot elevate it falls back to a USER service. A node is only left unsupervised as a last resort, with a loud warning explaining it will not auto-update. - The binary's user-service install now enables systemd lingering (`loginctl enable-linger <user>`) by default so a --user service runs without an active login session (the headless-server footgun: without linger it stops at logout and never auto-updates). New `--no-linger` flag opts out. System services are unaffected (they start at boot). - The decision honors an existing install so a re-run refreshes the same service type instead of creating a duplicate (idempotent + safe). The generated systemd units are unchanged - this reuses the existing unit generation, so the StartLimit/exit-45 work from #4570/#4588 is preserved. NOTE: this changes default install behavior (unsupervised -> supervised). Testing ------- - New scripts/test-install-sh.sh smoke-tests the system-vs-user decision (root / sudo / existing-unit / interactive permutations) by sourcing install.sh and overriding the environment probes. Wired into CI along with shellcheck on install.sh/uninstall.sh and the existing (previously unwired) uninstall smoke test. - New linux.rs unit test pins the lingering policy (system never lingers; user lingers unless --no-linger). - shellcheck clean; cargo fmt / clippy -D warnings / service tests green. Refs #4073 [AI-assisted - Claude] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
Summary
The generated systemd unit files now place
StartLimitIntervalSecandStartLimitBurstin the[Unit]section, where systemd actually reads them, instead of[Service]. A regression test covers the generated unit content for both unit variants.Why this matters
systemd only honors the start-limit directives when they appear in
[Unit]; with them in[Service], systemd logs a warning and the intended restart rate-limiting never takes effect, so a crash-looping service is not throttled as configured. Moving the directives to the correct section makes the limit apply. Reported in #4551.Testing
Added regression tests asserting the generated unit files carry the start-limit directives in the
[Unit]section for both generated unit variants.Closes #4551
AI was used for assistance.