Skip to content

fix: exit-for-restart on redb poison instead of bricking forever (#4604)#4609

Open
sanity wants to merge 4 commits into
mainfrom
fix/4604-redb-poison-recovery
Open

fix: exit-for-restart on redb poison instead of bricking forever (#4604)#4609
sanity wants to merge 4 commits into
mainfrom
fix/4604-redb-poison-recovery

Conversation

@sanity

@sanity sanity commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

Problem

A single transient redb I/O error permanently and silently bricks a node's contract operations, with no recovery path.

redb latches an in-memory poison flag (io_failed) after any backend read/write error. From then on every begin_write() returns StorageError::PreviousIo ("Previous I/O error occurred. Please close and re-open the database."), so every contract PUT/UPDATE — and the hosting-metadata write performed on essentially every access — fails forever. Because the process never exits, systemd never restarts it and the exit-42 auto-update hook never fires: the node stays "running" while 100% of its contract ops fail. Blast radius is total, the failure is silent.

Observed on a 0.2.84 node whose failing non-ECC RAM produced a btrfs csum failure → EIO → redb poison. The bad RAM is the trigger; the defect is the absence of any recovery from a transient I/O error — the same permanent brick follows from any transient disk EIO / fs hiccup on healthy hardware.

Solution

Option (b): detect the poison at the storage-op layer and EXIT the process so the supervisor recycles the node with a fresh, un-poisoned database handle (and the health/update hooks fire).

Chosen over in-process close+reopen (option a) because reopen does not interact with the crash-loop protection: on persistent corruption (the real-world case) it would churn-reopen forever, re-creating the exact "silently broken, supervisor never engages" failure this issue is about. Exit-for-restart routes persistent corruption into the existing crash-loop limiter; in-process reopen can't.

Detection coverage (all three storage entry points)

Poison is routed to the recovery path wherever it can surface:

  1. begin_write / begin_read wrappers — catch a poison at transaction start. begin_write is the steady-state choke point: redb checks the poison flag on every begin_write (check_io_errors), and the node writes hosting metadata on essentially every contract access.
  2. commit_guarded (every write commit) — the FIRST backend I/O error usually surfaces here as CommitError::Storage(Io), on the very op that poisons the handle, so recovery fires immediately rather than waiting for the next begin_write.
  3. read_guarded (every read method body) — redb's begin_read does NOT check the poison flag (it serves the cached snapshot), so a poisoned read fails later at open_table/get/iteration; this gives read-only workloads the same prompt exit-for-restart.

Precise detection — two classifiers, one deliberate asymmetry

  • Transaction/commit path (storage_error_is_poison, on StorageError): matches PreviousIo | Io | LockPoisoned | DatabaseClosed. These errors come straight from redb and are never synthesized, so matching Io is safe.
  • Read-method-body path (redb_error_is_poison, on the umbrella redb::Error): matches PreviousIo | LockPoisoned | DatabaseClosed but NOT Io — several read methods SYNTHESIZE redb::Error::Io(InvalidData) for a benign malformed-data row (e.g. a wrong-length CodeHash), and treating that as poison would exit-and-restart the node on a single bad row (a crash loop). A genuine backend I/O poison is still caught: redb latches its flag on the first backend Io, so the next backend read returns PreviousIo (caught here) and the next begin_write returns PreviousIo (caught by the transaction-path classifier).

Benign not-found (Ok(None)), TableDoesNotExist, Corrupted, ValueTooLarge, etc. are never treated as poison, so a normal missing-contract never restarts the node.

The whole mechanism is opt-in (enable_abort_on_redb_poison, called only by the real freenet binary) so simulation/integration tests and library embedders never have a storage error kill their host process.

Interaction with crash-loop protection (#4591 / #4588 / systemd StartLimit)

The poison exit reuses the existing fatal_listener_exit_code(uptime, fast_crash_enabled) decision (#4549/#4551) — no new exit code is introduced:

Net: a transient poison self-heals with one restart; a persistent one is bounded by the same protection the fatal-listener path already uses.

Testing

  • redb_poison_classifier_is_precise — produces real redb errors via a fault-injecting StorageBackend (a genuine StorageError::Io, then PreviousIo once redb latches the poison flag) and asserts they classify as poison; asserts the synthetic Io(InvalidData) malformed-row error does not (the false-positive guard); asserts a benign table-open error is not poison. Variant-based, so it can't drift with redb wording.
  • poisoned_redb_takes_recovery_path_benign_does_not — end-to-end: a benign not-found does not trigger recovery; the poisoning write (commit-time I/O) takes the recovery path on the same op; a subsequent poisoned write does too (observed via a test counter; abort is opt-in and off in tests, so it returns instead of exiting).
  • redb_poison_exit_reuses_crash_loop_bounded_codes / redb_poison_abort_is_opt_in — pin the counted-vs-burst-exempt exit-code interaction (a re-poison within 60s uses the counted code) and the opt-in default.

cargo fmt, cargo clippy --locked -- -D warnings, and the redb + p2p_impl test suites are green. Codex review (--base origin/main) run across three iterations; its two findings (read-path and commit-time coverage) are fixed and the final pass is clean.

Deploying

This needs a release to reach live nodes; build + PR + review only here. It interacts with the crash-loop protection as described above, so it should ride a normal release (no special rollout).

Closes #4604

🤖 Generated with Claude Code

[AI-assisted - Claude]

Problem
-------
A single transient redb I/O error permanently and silently bricks a node's
contract operations. redb latches an in-memory poison flag (`io_failed`) after
any backend read/write error; from then on every `begin_write` returns
`StorageError::PreviousIo` ("Previous I/O error occurred. Please close and
re-open the database."), so every contract PUT/UPDATE (and the hosting-metadata
write on essentially every access) fails forever. Because the process never
exits, systemd never restarts it and the exit-42 auto-update hook never fires —
the node stays "running" while 100% of contract ops fail. Observed on 0.2.84
(failing non-ECC RAM → btrfs csum EIO → redb poison), but any transient disk
EIO / fs hiccup triggers it on healthy hardware too.

Solution
--------
Option (b): detect the poison at the storage-op layer and EXIT the process so
the supervisor recycles the node with a fresh, un-poisoned database handle (and
the health/update hooks fire). Chosen over in-process close+reopen (option a)
because reopen does NOT interact with the crash-loop protection: on persistent
corruption it would churn-reopen forever, recreating the exact "silently broken,
supervisor never engages" failure this issue is about.

- Precise detection: `storage_error_is_poison` matches only the I/O-poison class
  (`PreviousIo`, `Io`, `LockPoisoned`, `DatabaseClosed`) by typed variant, never
  the message string. Benign not-found (`Ok(None)`), `TableDoesNotExist`,
  `Corrupted`, `ValueTooLarge`, etc. are NOT treated as poison.
- Choke point: `ReDb::begin_write` / `begin_read` wrappers route a poison error
  to the recovery path. `begin_write` is the reliable trigger — redb checks the
  poison flag on every `begin_write`, and the node writes on essentially every op.
- The exit reuses the existing `fatal_listener_exit_code` decision (#4549/#4551),
  so it inherits the crash-loop protection unchanged: a poison after a healthy
  uptime exits 42 (burst-exempt, fires the update self-heal); a poison within the
  60s healthy window (under a supervisor that understands it) exits the COUNTED
  code 45, so a database that re-poisons on every boot (persistent corruption) is
  bounded by `StartLimitBurst` and the #4591 auto-rollback probation instead of
  restart-looping forever. No new exit code is introduced.
- Opt-in (`enable_abort_on_redb_poison`, called only by the real binary) so
  simulation/integration tests and library embedders never have a storage error
  kill their host process.

Testing
-------
- `redb_poison_classifier_is_precise`: produces REAL redb errors via a
  fault-injecting `StorageBackend` (genuine `Io` then `PreviousIo`) and asserts
  they classify as poison, while a benign table error does not.
- `poisoned_redb_takes_recovery_path_benign_does_not`: a poisoned database routes
  a write op to the recovery path (observed via a test counter; abort is off in
  tests) while a benign not-found does not.
- `redb_poison_exit_reuses_crash_loop_bounded_codes` / `redb_poison_abort_is_opt_in`:
  pin the counted-vs-burst-exempt exit-code interaction and the opt-in default.

Needs a release to deploy.

Closes #4604

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
@github-actions

github-actions Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Rule Review: No issues found

Rules checked: git-workflow.md, code-style.md, testing.md, contracts.md
Files reviewed: 5

Summary

This PR adds opt-in process-exit-for-restart when redb's in-memory poison flag is tripped by a transient I/O error, and is a clean fix: PR with comprehensive regression tests.

Regression tests present (testing.md requirement): Four new tests are added:

  • redb_poison_classifier_is_precise — tests all three classifier functions against real redb error variants (via fault-injecting backend), plus the synthetic Io(InvalidData) false-positive case
  • poisoned_redb_takes_recovery_path_benign_does_not — end-to-end: benign not-found does NOT trigger recovery; commit-time poison DOES; subsequent begin_write poison DOES; read-path poison via read_guarded DOES
  • redb_poison_exit_reuses_crash_loop_bounded_codes — boundary conditions at 0/1/30/59s (fast-crash) and MIN_HEALTHY_UPTIME_FOR_UPDATE_EXIT (burst-exempt)
  • redb_poison_abort_is_opt_in — default-off behavior

No .unwrap() in new production code (code-style.md): All new infrastructure methods (begin_write, begin_read, route_txn_error, route_redb_error, read_guarded, commit_guarded) use explicit match/map_err/?. The pre-existing .unwrap() calls inside load_all_delegate_index / load_all_secrets_index are only reached after a preceding length-check guard and are unchanged from before this PR.

Instant::now usage: The std::time::Instant::now at p2p_impl.rs:168 measures real process uptime for crash-loop exit-code selection — the correct tool for this purpose and consistent with the existing fatal_listener_exit_code pattern in the same file. Per review instructions, banned-API patterns are enforced by the Rule Lint CI grep and not re-reported here.

Commit messages: All follow conventional commits format. All explain the why in commit bodies.

Contracts module (contracts.md): Changes are confined to the storage backend (redb.rs) and do not touch WASM execution, state merge logic, or host-function registration.

No rule violations detected.


Rule review against .claude/rules/. WARNING findings block merge.

sanity and others added 3 commits June 27, 2026 19:32
Codex review of the initial fix flagged that the begin_* wrappers only catch
poison at transaction start. redb's `begin_read` does NOT check the poison flag
(it serves the last committed snapshot from cache), so a poisoned read fails
later at `open_table`/`get`/iteration — a read-only workload could keep erroring
until some later write hit `begin_write`.

Close the gap: a `read_guarded` helper wraps every read method's body and routes
any poison error (surfacing after begin_read) to the same recovery path.

Crucially, the umbrella read-path classifier (`redb_error_is_poison`) matches
`PreviousIo | LockPoisoned | DatabaseClosed` but NOT `Io` — several read methods
SYNTHESIZE `redb::Error::Io(InvalidData)` for a benign malformed-data row (e.g. a
wrong-length CodeHash), and treating that as poison would exit-and-restart the
node on a single bad row (a crash loop). A genuine backend I/O poison is still
caught: redb latches its flag on the first backend Io, so the next backend read
returns `PreviousIo` (caught here) and the next `begin_write` returns `PreviousIo`
(caught by the transaction-path classifier, which keeps matching `Io` since redb
never synthesizes Io there). The classifier test pins both directions, including
that a synthetic `Io(InvalidData)` is NOT treated as poison.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
…review)

Codex's second pass: the FIRST backend I/O failure that poisons redb usually
surfaces from commit() as CommitError::Storage(Io), which the begin_* wrappers
don't see — so the node stayed running on the poisoning write until a later op
hit begin_write and tripped PreviousIo.

Add `commit_guarded`, used by every write method's commit, which routes a poison
commit error to the recovery path on the very op that poisons the handle. The
commit path comes straight from redb (never the synthetic Io(InvalidData) of a
malformed row), so it matches `Io` via `storage_error_is_poison`. The end-to-end
test now asserts the poisoning write triggers recovery on the same op.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
…#4604)

Addresses the rule-review findings on the PR:
- load_all_user_secrets_index: reindent the read_guarded closure body to match
  the other converted methods (rustfmt had silently skipped reformatting this
  longer fn, so cargo fmt --check passed despite the inconsistent indentation).
- Add a deterministic read-path assertion: feed read_guarded a real PreviousIo
  (from the poisoned handle) and confirm it routes to the recovery path, covering
  the read_guarded -> route_redb_error wiring in the end-to-end test (previously
  only the begin_write/commit write path was observed via the counter).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Node silently bricked (all contract ops fail forever) after a single transient redb I/O error — no poison recovery

1 participant