fix: exit-for-restart on redb poison instead of bricking forever (#4604)#4609
fix: exit-for-restart on redb poison instead of bricking forever (#4604)#4609sanity wants to merge 4 commits into
Conversation
Problem
-------
A single transient redb I/O error permanently and silently bricks a node's
contract operations. redb latches an in-memory poison flag (`io_failed`) after
any backend read/write error; from then on every `begin_write` returns
`StorageError::PreviousIo` ("Previous I/O error occurred. Please close and
re-open the database."), so every contract PUT/UPDATE (and the hosting-metadata
write on essentially every access) fails forever. Because the process never
exits, systemd never restarts it and the exit-42 auto-update hook never fires —
the node stays "running" while 100% of contract ops fail. Observed on 0.2.84
(failing non-ECC RAM → btrfs csum EIO → redb poison), but any transient disk
EIO / fs hiccup triggers it on healthy hardware too.
Solution
--------
Option (b): detect the poison at the storage-op layer and EXIT the process so
the supervisor recycles the node with a fresh, un-poisoned database handle (and
the health/update hooks fire). Chosen over in-process close+reopen (option a)
because reopen does NOT interact with the crash-loop protection: on persistent
corruption it would churn-reopen forever, recreating the exact "silently broken,
supervisor never engages" failure this issue is about.
- Precise detection: `storage_error_is_poison` matches only the I/O-poison class
(`PreviousIo`, `Io`, `LockPoisoned`, `DatabaseClosed`) by typed variant, never
the message string. Benign not-found (`Ok(None)`), `TableDoesNotExist`,
`Corrupted`, `ValueTooLarge`, etc. are NOT treated as poison.
- Choke point: `ReDb::begin_write` / `begin_read` wrappers route a poison error
to the recovery path. `begin_write` is the reliable trigger — redb checks the
poison flag on every `begin_write`, and the node writes on essentially every op.
- The exit reuses the existing `fatal_listener_exit_code` decision (#4549/#4551),
so it inherits the crash-loop protection unchanged: a poison after a healthy
uptime exits 42 (burst-exempt, fires the update self-heal); a poison within the
60s healthy window (under a supervisor that understands it) exits the COUNTED
code 45, so a database that re-poisons on every boot (persistent corruption) is
bounded by `StartLimitBurst` and the #4591 auto-rollback probation instead of
restart-looping forever. No new exit code is introduced.
- Opt-in (`enable_abort_on_redb_poison`, called only by the real binary) so
simulation/integration tests and library embedders never have a storage error
kill their host process.
Testing
-------
- `redb_poison_classifier_is_precise`: produces REAL redb errors via a
fault-injecting `StorageBackend` (genuine `Io` then `PreviousIo`) and asserts
they classify as poison, while a benign table error does not.
- `poisoned_redb_takes_recovery_path_benign_does_not`: a poisoned database routes
a write op to the recovery path (observed via a test counter; abort is off in
tests) while a benign not-found does not.
- `redb_poison_exit_reuses_crash_loop_bounded_codes` / `redb_poison_abort_is_opt_in`:
pin the counted-vs-burst-exempt exit-code interaction and the opt-in default.
Needs a release to deploy.
Closes #4604
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
Rule Review: No issues foundRules checked: SummaryThis PR adds opt-in process-exit-for-restart when redb's in-memory poison flag is tripped by a transient I/O error, and is a clean Regression tests present (
No
Commit messages: All follow conventional commits format. All explain the why in commit bodies. Contracts module ( No rule violations detected. Rule review against |
Codex review of the initial fix flagged that the begin_* wrappers only catch poison at transaction start. redb's `begin_read` does NOT check the poison flag (it serves the last committed snapshot from cache), so a poisoned read fails later at `open_table`/`get`/iteration — a read-only workload could keep erroring until some later write hit `begin_write`. Close the gap: a `read_guarded` helper wraps every read method's body and routes any poison error (surfacing after begin_read) to the same recovery path. Crucially, the umbrella read-path classifier (`redb_error_is_poison`) matches `PreviousIo | LockPoisoned | DatabaseClosed` but NOT `Io` — several read methods SYNTHESIZE `redb::Error::Io(InvalidData)` for a benign malformed-data row (e.g. a wrong-length CodeHash), and treating that as poison would exit-and-restart the node on a single bad row (a crash loop). A genuine backend I/O poison is still caught: redb latches its flag on the first backend Io, so the next backend read returns `PreviousIo` (caught here) and the next `begin_write` returns `PreviousIo` (caught by the transaction-path classifier, which keeps matching `Io` since redb never synthesizes Io there). The classifier test pins both directions, including that a synthetic `Io(InvalidData)` is NOT treated as poison. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
…review) Codex's second pass: the FIRST backend I/O failure that poisons redb usually surfaces from commit() as CommitError::Storage(Io), which the begin_* wrappers don't see — so the node stayed running on the poisoning write until a later op hit begin_write and tripped PreviousIo. Add `commit_guarded`, used by every write method's commit, which routes a poison commit error to the recovery path on the very op that poisons the handle. The commit path comes straight from redb (never the synthetic Io(InvalidData) of a malformed row), so it matches `Io` via `storage_error_is_poison`. The end-to-end test now asserts the poisoning write triggers recovery on the same op. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
…#4604) Addresses the rule-review findings on the PR: - load_all_user_secrets_index: reindent the read_guarded closure body to match the other converted methods (rustfmt had silently skipped reformatting this longer fn, so cargo fmt --check passed despite the inconsistent indentation). - Add a deterministic read-path assertion: feed read_guarded a real PreviousIo (from the poisoned handle) and confirm it routes to the recovery path, covering the read_guarded -> route_redb_error wiring in the end-to-end test (previously only the begin_write/commit write path was observed via the counter). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
Problem
A single transient redb I/O error permanently and silently bricks a node's contract operations, with no recovery path.
redb latches an in-memory poison flag (
io_failed) after any backend read/write error. From then on everybegin_write()returnsStorageError::PreviousIo("Previous I/O error occurred. Please close and re-open the database."), so every contract PUT/UPDATE — and the hosting-metadata write performed on essentially every access — fails forever. Because the process never exits, systemd never restarts it and the exit-42 auto-update hook never fires: the node stays "running" while 100% of its contract ops fail. Blast radius is total, the failure is silent.Observed on a 0.2.84 node whose failing non-ECC RAM produced a btrfs csum failure →
EIO→ redb poison. The bad RAM is the trigger; the defect is the absence of any recovery from a transient I/O error — the same permanent brick follows from any transient diskEIO/ fs hiccup on healthy hardware.Solution
Option (b): detect the poison at the storage-op layer and EXIT the process so the supervisor recycles the node with a fresh, un-poisoned database handle (and the health/update hooks fire).
Chosen over in-process close+reopen (option a) because reopen does not interact with the crash-loop protection: on persistent corruption (the real-world case) it would churn-reopen forever, re-creating the exact "silently broken, supervisor never engages" failure this issue is about. Exit-for-restart routes persistent corruption into the existing crash-loop limiter; in-process reopen can't.
Detection coverage (all three storage entry points)
Poison is routed to the recovery path wherever it can surface:
begin_write/begin_readwrappers — catch a poison at transaction start.begin_writeis the steady-state choke point: redb checks the poison flag on everybegin_write(check_io_errors), and the node writes hosting metadata on essentially every contract access.commit_guarded(every write commit) — the FIRST backend I/O error usually surfaces here asCommitError::Storage(Io), on the very op that poisons the handle, so recovery fires immediately rather than waiting for the nextbegin_write.read_guarded(every read method body) — redb'sbegin_readdoes NOT check the poison flag (it serves the cached snapshot), so a poisoned read fails later atopen_table/get/iteration; this gives read-only workloads the same prompt exit-for-restart.Precise detection — two classifiers, one deliberate asymmetry
storage_error_is_poison, onStorageError): matchesPreviousIo | Io | LockPoisoned | DatabaseClosed. These errors come straight from redb and are never synthesized, so matchingIois safe.redb_error_is_poison, on the umbrellaredb::Error): matchesPreviousIo | LockPoisoned | DatabaseClosedbut NOTIo— several read methods SYNTHESIZEredb::Error::Io(InvalidData)for a benign malformed-data row (e.g. a wrong-lengthCodeHash), and treating that as poison would exit-and-restart the node on a single bad row (a crash loop). A genuine backend I/O poison is still caught: redb latches its flag on the first backendIo, so the next backend read returnsPreviousIo(caught here) and the nextbegin_writereturnsPreviousIo(caught by the transaction-path classifier).Benign not-found (
Ok(None)),TableDoesNotExist,Corrupted,ValueTooLarge, etc. are never treated as poison, so a normal missing-contract never restarts the node.The whole mechanism is opt-in (
enable_abort_on_redb_poison, called only by the realfreenetbinary) so simulation/integration tests and library embedders never have a storage error kill their host process.Interaction with crash-loop protection (#4591 / #4588 / systemd StartLimit)
The poison exit reuses the existing
fatal_listener_exit_code(uptime, fast_crash_enabled)decision (#4549/#4551) — no new exit code is introduced:SuccessExitStatus=42), fires thefreenet updateself-heal hook. A genuinely transient I/O blip (the Node silently bricked (all contract ops fail forever) after a single transient redb I/O error — no poison recovery #4604 case: a node that ran healthily for hours) restarts cleanly without trippingStartLimitBurst.SuccessExitStatus, so it counts towardStartLimitBurst, and the systemd unit runsfreenet updateon 42 OR 45 so the feat: crash-loop auto-rollback for the auto-updater (#4073) #4591 auto-rollback probation counts it viaFREENET_POST_STOP_EXIT_CODE. A database that re-poisons on every boot (persistent disk/RAM corruption) therefore trips StartLimit / rolls back instead of looping forever.Net: a transient poison self-heals with one restart; a persistent one is bounded by the same protection the fatal-listener path already uses.
Testing
redb_poison_classifier_is_precise— produces real redb errors via a fault-injectingStorageBackend(a genuineStorageError::Io, thenPreviousIoonce redb latches the poison flag) and asserts they classify as poison; asserts the syntheticIo(InvalidData)malformed-row error does not (the false-positive guard); asserts a benign table-open error is not poison. Variant-based, so it can't drift with redb wording.poisoned_redb_takes_recovery_path_benign_does_not— end-to-end: a benign not-found does not trigger recovery; the poisoning write (commit-time I/O) takes the recovery path on the same op; a subsequent poisoned write does too (observed via a test counter; abort is opt-in and off in tests, so it returns instead of exiting).redb_poison_exit_reuses_crash_loop_bounded_codes/redb_poison_abort_is_opt_in— pin the counted-vs-burst-exempt exit-code interaction (a re-poison within 60s uses the counted code) and the opt-in default.cargo fmt,cargo clippy --locked -- -D warnings, and the redb + p2p_impl test suites are green. Codex review (--base origin/main) run across three iterations; its two findings (read-path and commit-time coverage) are fixed and the final pass is clean.Deploying
This needs a release to reach live nodes; build + PR + review only here. It interacts with the crash-loop protection as described above, so it should ride a normal release (no special rollout).
Closes #4604
🤖 Generated with Claude Code
[AI-assisted - Claude]