fix: exit-for-restart on redb poison instead of bricking forever (#4604) by sanity · Pull Request #4609 · freenet/freenet-core

sanity · 2026-06-28T00:16:04Z

Problem

A single transient redb I/O error permanently and silently bricks a node's contract operations, with no recovery path.

redb latches an in-memory poison flag (io_failed) after any backend read/write error. From then on every begin_write() returns StorageError::PreviousIo ("Previous I/O error occurred. Please close and re-open the database."), so every contract PUT/UPDATE — and the hosting-metadata write performed on essentially every access — fails forever. Because the process never exits, systemd never restarts it and the exit-42 auto-update hook never fires: the node stays "running" while 100% of its contract ops fail. Blast radius is total, the failure is silent.

Observed on a 0.2.84 node whose failing non-ECC RAM produced a btrfs csum failure → EIO → redb poison. The bad RAM is the trigger; the defect is the absence of any recovery from a transient I/O error — the same permanent brick follows from any transient disk EIO / fs hiccup on healthy hardware.

Solution

Option (b): detect the poison at the storage-op layer and EXIT the process so the supervisor recycles the node with a fresh, un-poisoned database handle (and the health/update hooks fire).

Chosen over in-process close+reopen (option a) because reopen does not interact with the crash-loop protection: on persistent corruption (the real-world case) it would churn-reopen forever, re-creating the exact "silently broken, supervisor never engages" failure this issue is about. Exit-for-restart routes persistent corruption into the existing crash-loop limiter; in-process reopen can't.

Detection coverage (all three storage entry points)

Poison is routed to the recovery path wherever it can surface:

begin_write / begin_read wrappers — catch a poison at transaction start. begin_write is the steady-state choke point: redb checks the poison flag on every begin_write (check_io_errors), and the node writes hosting metadata on essentially every contract access.
commit_guarded (every write commit) — the FIRST backend I/O error usually surfaces here as CommitError::Storage(Io), on the very op that poisons the handle, so recovery fires immediately rather than waiting for the next begin_write.
read_guarded (every read method body) — redb's begin_read does NOT check the poison flag (it serves the cached snapshot), so a poisoned read fails later at open_table/get/iteration; this gives read-only workloads the same prompt exit-for-restart.

Precise detection — two classifiers, one deliberate asymmetry

Transaction/commit path (storage_error_is_poison, on StorageError): matches PreviousIo | Io | LockPoisoned | DatabaseClosed. These errors come straight from redb and are never synthesized, so matching Io is safe.
Read-method-body path (redb_error_is_poison, on the umbrella redb::Error): matches PreviousIo | LockPoisoned | DatabaseClosed but NOT Io — several read methods SYNTHESIZE redb::Error::Io(InvalidData) for a benign malformed-data row (e.g. a wrong-length CodeHash), and treating that as poison would exit-and-restart the node on a single bad row (a crash loop). A genuine backend I/O poison is still caught: redb latches its flag on the first backend Io, so the next backend read returns PreviousIo (caught here) and the next begin_write returns PreviousIo (caught by the transaction-path classifier).

Benign not-found (Ok(None)), TableDoesNotExist, Corrupted, ValueTooLarge, etc. are never treated as poison, so a normal missing-contract never restarts the node.

The whole mechanism is opt-in (enable_abort_on_redb_poison, called only by the real freenet binary) so simulation/integration tests and library embedders never have a storage error kill their host process.

Interaction with crash-loop protection (#4591 / #4588 / systemd StartLimit)

The poison exit reuses the existing fatal_listener_exit_code(uptime, fast_crash_enabled) decision (#4549/#4551) — no new exit code is introduced:

Poison after a healthy uptime (>= 60s) → exit 42: burst-exempt (SuccessExitStatus=42), fires the freenet update self-heal hook. A genuinely transient I/O blip (the Node silently bricked (all contract ops fail forever) after a single transient redb I/O error — no poison recovery #4604 case: a node that ran healthily for hours) restarts cleanly without tripping StartLimitBurst.
Poison within the 60s healthy window, under a supervisor that understands it → exit 45 (the counted fast-crash code): not in SuccessExitStatus, so it counts toward StartLimitBurst, and the systemd unit runs freenet update on 42 OR 45 so the feat: crash-loop auto-rollback for the auto-updater (#4073) #4591 auto-rollback probation counts it via FREENET_POST_STOP_EXIT_CODE. A database that re-poisons on every boot (persistent disk/RAM corruption) therefore trips StartLimit / rolls back instead of looping forever.

Net: a transient poison self-heals with one restart; a persistent one is bounded by the same protection the fatal-listener path already uses.

Testing

redb_poison_classifier_is_precise — produces real redb errors via a fault-injecting StorageBackend (a genuine StorageError::Io, then PreviousIo once redb latches the poison flag) and asserts they classify as poison; asserts the synthetic Io(InvalidData) malformed-row error does not (the false-positive guard); asserts a benign table-open error is not poison. Variant-based, so it can't drift with redb wording.
poisoned_redb_takes_recovery_path_benign_does_not — end-to-end: a benign not-found does not trigger recovery; the poisoning write (commit-time I/O) takes the recovery path on the same op; a subsequent poisoned write does too (observed via a test counter; abort is opt-in and off in tests, so it returns instead of exiting).
redb_poison_exit_reuses_crash_loop_bounded_codes / redb_poison_abort_is_opt_in — pin the counted-vs-burst-exempt exit-code interaction (a re-poison within 60s uses the counted code) and the opt-in default.

cargo fmt, cargo clippy --locked -- -D warnings, and the redb + p2p_impl test suites are green. Codex review (--base origin/main) run across three iterations; its two findings (read-path and commit-time coverage) are fixed and the final pass is clean.

Deploying

This needs a release to reach live nodes; build + PR + review only here. It interacts with the crash-loop protection as described above, so it should ride a normal release (no special rollout).

Closes #4604

🤖 Generated with Claude Code

[AI-assisted - Claude]

Problem ------- A single transient redb I/O error permanently and silently bricks a node's contract operations. redb latches an in-memory poison flag (`io_failed`) after any backend read/write error; from then on every `begin_write` returns `StorageError::PreviousIo` ("Previous I/O error occurred. Please close and re-open the database."), so every contract PUT/UPDATE (and the hosting-metadata write on essentially every access) fails forever. Because the process never exits, systemd never restarts it and the exit-42 auto-update hook never fires — the node stays "running" while 100% of contract ops fail. Observed on 0.2.84 (failing non-ECC RAM → btrfs csum EIO → redb poison), but any transient disk EIO / fs hiccup triggers it on healthy hardware too. Solution -------- Option (b): detect the poison at the storage-op layer and EXIT the process so the supervisor recycles the node with a fresh, un-poisoned database handle (and the health/update hooks fire). Chosen over in-process close+reopen (option a) because reopen does NOT interact with the crash-loop protection: on persistent corruption it would churn-reopen forever, recreating the exact "silently broken, supervisor never engages" failure this issue is about. - Precise detection: `storage_error_is_poison` matches only the I/O-poison class (`PreviousIo`, `Io`, `LockPoisoned`, `DatabaseClosed`) by typed variant, never the message string. Benign not-found (`Ok(None)`), `TableDoesNotExist`, `Corrupted`, `ValueTooLarge`, etc. are NOT treated as poison. - Choke point: `ReDb::begin_write` / `begin_read` wrappers route a poison error to the recovery path. `begin_write` is the reliable trigger — redb checks the poison flag on every `begin_write`, and the node writes on essentially every op. - The exit reuses the existing `fatal_listener_exit_code` decision (#4549/#4551), so it inherits the crash-loop protection unchanged: a poison after a healthy uptime exits 42 (burst-exempt, fires the update self-heal); a poison within the 60s healthy window (under a supervisor that understands it) exits the COUNTED code 45, so a database that re-poisons on every boot (persistent corruption) is bounded by `StartLimitBurst` and the #4591 auto-rollback probation instead of restart-looping forever. No new exit code is introduced. - Opt-in (`enable_abort_on_redb_poison`, called only by the real binary) so simulation/integration tests and library embedders never have a storage error kill their host process. Testing ------- - `redb_poison_classifier_is_precise`: produces REAL redb errors via a fault-injecting `StorageBackend` (genuine `Io` then `PreviousIo`) and asserts they classify as poison, while a benign table error does not. - `poisoned_redb_takes_recovery_path_benign_does_not`: a poisoned database routes a write op to the recovery path (observed via a test counter; abort is off in tests) while a benign not-found does not. - `redb_poison_exit_reuses_crash_loop_bounded_codes` / `redb_poison_abort_is_opt_in`: pin the counted-vs-burst-exempt exit-code interaction and the opt-in default. Needs a release to deploy. Closes #4604 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU

github-actions · 2026-06-28T00:21:37Z

Rule Review: No issues found

Rules checked: git-workflow.md, code-style.md, testing.md, contracts.md
Files reviewed: 5

Summary

This PR adds opt-in process-exit-for-restart when redb's in-memory poison flag is tripped by a transient I/O error, and is a clean fix: PR with comprehensive regression tests.

Regression tests present (testing.md requirement): Four new tests are added:

redb_poison_classifier_is_precise — tests all three classifier functions against real redb error variants (via fault-injecting backend), plus the synthetic Io(InvalidData) false-positive case
poisoned_redb_takes_recovery_path_benign_does_not — end-to-end: benign not-found does NOT trigger recovery; commit-time poison DOES; subsequent begin_write poison DOES; read-path poison via read_guarded DOES
redb_poison_exit_reuses_crash_loop_bounded_codes — boundary conditions at 0/1/30/59s (fast-crash) and MIN_HEALTHY_UPTIME_FOR_UPDATE_EXIT (burst-exempt)
redb_poison_abort_is_opt_in — default-off behavior

No .unwrap() in new production code (code-style.md): All new infrastructure methods (begin_write, begin_read, route_txn_error, route_redb_error, read_guarded, commit_guarded) use explicit match/map_err/?. The pre-existing .unwrap() calls inside load_all_delegate_index / load_all_secrets_index are only reached after a preceding length-check guard and are unchanged from before this PR.

Instant::now usage: The std::time::Instant::now at p2p_impl.rs:168 measures real process uptime for crash-loop exit-code selection — the correct tool for this purpose and consistent with the existing fatal_listener_exit_code pattern in the same file. Per review instructions, banned-API patterns are enforced by the Rule Lint CI grep and not re-reported here.

Commit messages: All follow conventional commits format. All explain the why in commit bodies.

Contracts module (contracts.md): Changes are confined to the storage backend (redb.rs) and do not touch WASM execution, state merge logic, or host-function registration.

No rule violations detected.

Rule review against .claude/rules/. WARNING findings block merge.

Codex review of the initial fix flagged that the begin_* wrappers only catch poison at transaction start. redb's `begin_read` does NOT check the poison flag (it serves the last committed snapshot from cache), so a poisoned read fails later at `open_table`/`get`/iteration — a read-only workload could keep erroring until some later write hit `begin_write`. Close the gap: a `read_guarded` helper wraps every read method's body and routes any poison error (surfacing after begin_read) to the same recovery path. Crucially, the umbrella read-path classifier (`redb_error_is_poison`) matches `PreviousIo | LockPoisoned | DatabaseClosed` but NOT `Io` — several read methods SYNTHESIZE `redb::Error::Io(InvalidData)` for a benign malformed-data row (e.g. a wrong-length CodeHash), and treating that as poison would exit-and-restart the node on a single bad row (a crash loop). A genuine backend I/O poison is still caught: redb latches its flag on the first backend Io, so the next backend read returns `PreviousIo` (caught here) and the next `begin_write` returns `PreviousIo` (caught by the transaction-path classifier, which keeps matching `Io` since redb never synthesizes Io there). The classifier test pins both directions, including that a synthetic `Io(InvalidData)` is NOT treated as poison. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU

…review) Codex's second pass: the FIRST backend I/O failure that poisons redb usually surfaces from commit() as CommitError::Storage(Io), which the begin_* wrappers don't see — so the node stayed running on the poisoning write until a later op hit begin_write and tripped PreviousIo. Add `commit_guarded`, used by every write method's commit, which routes a poison commit error to the recovery path on the very op that poisons the handle. The commit path comes straight from redb (never the synthetic Io(InvalidData) of a malformed row), so it matches `Io` via `storage_error_is_poison`. The end-to-end test now asserts the poisoning write triggers recovery on the same op. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU

…#4604) Addresses the rule-review findings on the PR: - load_all_user_secrets_index: reindent the read_guarded closure body to match the other converted methods (rustfmt had silently skipped reformatting this longer fn, so cargo fmt --check passed despite the inconsistent indentation). - Add a deterministic read-path assertion: feed read_guarded a real PreviousIo (from the poisoned handle) and confirm it routes to the recovery path, covering the read_guarded -> route_redb_error wiring in the end-to-end test (previously only the begin_write/commit write path was observed via the counter). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014dGjU1Q6Vpk2dm4sUf4pdU

sanity and others added 3 commits June 27, 2026 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: exit-for-restart on redb poison instead of bricking forever (#4604)#4609

fix: exit-for-restart on redb poison instead of bricking forever (#4604)#4609
sanity wants to merge 4 commits into
mainfrom
fix/4604-redb-poison-recovery

sanity commented Jun 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

sanity commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Detection coverage (all three storage entry points)

Precise detection — two classifiers, one deliberate asymmetry

Interaction with crash-loop protection (#4591 / #4588 / systemd StartLimit)

Testing

Deploying

Uh oh!

github-actions Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rule Review: No issues found

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sanity commented Jun 28, 2026 •

edited

Loading

github-actions Bot commented Jun 28, 2026 •

edited

Loading