Skip to content

fix: move StartLimitBurst/StartLimitIntervalSec into [Unit] section of generated systemd units#4570

Merged
sanity merged 1 commit into
freenet:mainfrom
mvanhorn:fix/4551-2026-06-24-204-bug-systemd-startlimit-section
Jun 26, 2026
Merged

fix: move StartLimitBurst/StartLimitIntervalSec into [Unit] section of generated systemd units#4570
sanity merged 1 commit into
freenet:mainfrom
mvanhorn:fix/4551-2026-06-24-204-bug-systemd-startlimit-section

Conversation

@mvanhorn

Copy link
Copy Markdown
Contributor

Summary

The generated systemd unit files now place StartLimitIntervalSec and StartLimitBurst in the [Unit] section, where systemd actually reads them, instead of [Service]. A regression test covers the generated unit content for both unit variants.

Why this matters

systemd only honors the start-limit directives when they appear in [Unit]; with them in [Service], systemd logs a warning and the intended restart rate-limiting never takes effect, so a crash-looping service is not throttled as configured. Moving the directives to the correct section makes the limit apply. Reported in #4551.

Testing

Added regression tests asserting the generated unit files carry the start-limit directives in the [Unit] section for both generated unit variants.

Closes #4551

AI was used for assistance.

@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Now I have all the context I need to write up the review.

Rule Review: Clean fix with good regression tests

Rules checked: git-workflow.md, code-style.md, testing.md
Files reviewed: 1 (crates/core/src/bin/commands/service/linux.rs)

Warnings

None.

Info

  • commits.txt:1 — Commit subject is ~94 characters, exceeding the 72-character limit in git-workflow.md. The subject fix: move StartLimitBurst/StartLimitIntervalSec into [Unit] section of generated systemd units is valid conventional-commits format but slightly long. (rule: git-workflow.md)

Summary: The fix correctly relocates StartLimitBurst/StartLimitIntervalSec from [Service] to [Unit] (where systemd actually reads them since v229). Three regression tests were added and are appropriately scoped to target_os = "linux". Tests check section placement, uniqueness, and the SuccessExitStatus=42 invariant — solid coverage of the bug and adjacent behavior. No production-code issues.


Rule review against .claude/rules/. WARNING findings block merge.

@sanity sanity changed the title Move StartLimitBurst/StartLimitIntervalSec into [Unit] section of generated systemd units fix: move StartLimitBurst/StartLimitIntervalSec into [Unit] section of generated systemd units Jun 26, 2026
@sanity

sanity commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Thanks @mvanhorn — this is exactly the fix for #4551 (the StartLimit* directives were being silently ignored in [Service]). I added the fix: conventional-commit prefix to the title to unblock CI; the change itself is clean (correct section move + good regression tests). Reviewing now as part of the 0.2.84 auto-update-safety work. The complementary crash-loop auto-rollback and a min-uptime guard will come as a separate follow-up so this focused fix can land on its own.

[AI-assisted - Claude]

@sanity sanity force-pushed the fix/4551-2026-06-24-204-bug-systemd-startlimit-section branch from dae1c4c to d22f03d Compare June 26, 2026 16:25
@sanity

sanity commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Reviewed and approving. The section move is correct (StartLimit* belong in [Unit]; systemd-analyze confirms the old [Service] placement reproduces Unknown key name 'StartLimitIntervalSec' in section 'Service', ignoring), the regression tests assert placement by section, and an external Codex pass found no introduced issues. CI green. This is the P0 for #4551 — it makes the crash-loop rate-limit actually take effect. The complementary crash-vs-update guard (distinct fast-crash exit code + min-uptime) lands as a separate follow-up. Thanks @mvanhorn. Merging.

[AI-assisted - Claude]

@sanity sanity added this pull request to the merge queue Jun 26, 2026
Merged via the queue into freenet:main with commit fc07db0 Jun 26, 2026
16 checks passed
sanity added a commit that referenced this pull request Jun 26, 2026
Builds on #4570 (which moved StartLimitBurst/StartLimitIntervalSec into the
[Unit] section so systemd actually honours the crash-loop limiter). This adds
the crash-vs-update distinction so the now-effective limiter only bounds real
boot-wedge loops without disabling the auto-update self-heal path.

- StartLimitAction=none (the default, made explicit) in both generated units:
  a tripped crash loop STOPS the unit rather than rebooting the host.
- A distinct FAST_CRASH_EXIT_CODE (45), deliberately NOT in the unit's
  SuccessExitStatus, plus a min-uptime guard fatal_listener_exit_code(uptime):
  a fatal listener exit under 60s uptime exits 45 (counted toward the burst, so
  a tight loop trips StartLimitBurst and the unit stops); >=60s exits 42
  (burst-exempt, keeps the freenet-update self-heal hook for #4549).

Tests: new p2p_impl behavioral tests (fast-crash code is distinct + counted;
fatal_listener_exit_code distinguishes fast-crash from healthy uptime) and
service.rs assertions for StartLimitAction placement + exit-45 staying out of
SuccessExitStatus. StartLimitBurst/IntervalSec [Unit]-placement assertions are
left to #4570's linux.rs tests (not duplicated here). Also hardens those #4570
test helpers to detect section headers by line rather than substring, so a
legitimate "[Service]" reference inside a [Unit] comment is no longer mistaken
for the section header.

Refs #4551 (section-placement landed in #4570); completes the crash-loop
bounding so a boot-wedge loop is actually rate-limited.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
sanity added a commit that referenced this pull request Jun 26, 2026
Builds on #4570 (which moved StartLimitBurst/StartLimitIntervalSec into the
[Unit] section so systemd actually honours the crash-loop limiter). This adds a
crash-vs-update distinction so the now-effective limiter bounds real boot-wedge
loops WITHOUT losing the auto-update self-heal, on every supervision path.

systemd path:
- StartLimitAction=none (default, made explicit) in both generated units: a
  tripped crash loop STOPS the unit rather than rebooting the host.
- A distinct FAST_CRASH_EXIT_CODE (45), kept OUT of SuccessExitStatus, plus a
  min-uptime guard: a fatal listener exit under 60s uptime exits 45 (counted
  toward StartLimitBurst, so a tight loop trips the limiter and the unit stops);
  >=60s exits 42 (burst-exempt).
- ExecStopPost runs `freenet update` on exit 42 OR 45 (via a `case`, avoiding
  &&/|| precedence pitfalls). Counting (StartLimit) and self-heal (ExecStopPost)
  are independent, so a boot-crash that a newer release fixes still self-heals
  (#4549) even though exit 45 is rate-limited; `freenet update` is a separate
  process that can succeed even when `freenet network` boot-crashes.

cross-platform / cross-version safety (Codex P2):
- Exit 45 is emitted ONLY when the binary opted in via a new
  enable_fast_crash_exit_code() flag (mirrors enable_abort_on_fatal_listener_exit).
- The entry point opts in ONLY when the supervising unit advertises 45 support
  via a marker env var (FREENET_SYSTEMD_FAST_CRASH) that the REGENERATED systemd
  unit sets. So the runtime behavior is tied to the on-disk unit's actual
  capability: a node running this binary under an OLD/custom systemd unit (e.g.
  auto-updated but not reinstalled), under the macOS/Windows in-process
  run-wrapper (understands only 42; self-heals + bounds via its own backoff +
  50-failure cap), or unsupervised keeps emitting the self-healing exit 42.
- The marker is referenced via the SYSTEMD_FAST_CRASH_ENV_VAR const in both the
  unit template and the entry-point check (single source of truth, no drift).

Tests: p2p_impl behavioral tests (fast-crash code distinct + counted; uptime
split when enabled; always 42 when disabled; both abort and fast-crash flags are
opt-in/off-by-default); service.rs assertions for StartLimitAction placement,
exit-45 out of SuccessExitStatus, ExecStopPost firing update on 42|45, and the
fast-crash marker present (via the const). StartLimitBurst/IntervalSec
[Unit]-placement is left to #4570's linux.rs tests (not duplicated); those
helpers are also hardened to find section headers by line, not substring, so a
legitimate "[Service]" reference in a [Unit] comment isn't misparsed. Both
generated units pass `systemd-analyze verify`.

Refs #4551 (section-placement landed in #4570); completes the crash-loop
bounding. Builds on #4570.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
sanity added a commit that referenced this pull request Jun 27, 2026
Problem
-------
A typical Linux Freenet install ends up unsupervised, so it never
auto-updates. scripts/install.sh defaulted the "install as a service?"
prompt to No, and a node with no service manager catching its exit-42
"update needed" signal exits to update and never restarts on the new
version - it silently stops updating. With unsupervised being the
dominant default on Linux, much of the network freezes on old releases.

Solution
--------
Make a supervised install the DEFAULT (issue #4073):

- install.sh now sets up supervision unless the user explicitly opts out
  (FREENET_NO_SERVICE=1). The interactive prompt defaults to Yes ([Y/n]);
  a non-interactive curl|sh run sets up supervision automatically.
- On Linux it prefers a SYSTEM service when it can elevate (already root,
  or sudo) - most reliable on the servers/VPS that dominate the node
  population (runs at boot, survives logout). When it cannot elevate it
  falls back to a USER service. A node is only left unsupervised as a
  last resort, with a loud warning explaining it will not auto-update.
- The binary's user-service install now enables systemd lingering
  (`loginctl enable-linger <user>`) by default so a --user service runs
  without an active login session (the headless-server footgun: without
  linger it stops at logout and never auto-updates). New `--no-linger`
  flag opts out. System services are unaffected (they start at boot).
- The decision honors an existing install so a re-run refreshes the same
  service type instead of creating a duplicate (idempotent + safe).

The generated systemd units are unchanged - this reuses the existing
unit generation, so the StartLimit/exit-45 work from #4570/#4588 is
preserved.

NOTE: this changes default install behavior (unsupervised -> supervised).

Testing
-------
- New scripts/test-install-sh.sh smoke-tests the system-vs-user decision
  (root / sudo / existing-unit / interactive permutations) by sourcing
  install.sh and overriding the environment probes. Wired into CI along
  with shellcheck on install.sh/uninstall.sh and the existing (previously
  unwired) uninstall smoke test.
- New linux.rs unit test pins the lingering policy (system never lingers;
  user lingers unless --no-linger).
- shellcheck clean; cargo fmt / clippy -D warnings / service tests green.

Refs #4073

[AI-assisted - Claude]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

systemd unit: StartLimitBurst/StartLimitIntervalSec are in [Service], where systemd ignores them

2 participants