Skip to content

rawdb: add freezer safety margin to prevent data loss on corruption#637

Draft
joshuacolvin0 wants to merge 1 commit into
masterfrom
freezer-safety-margin
Draft

rawdb: add freezer safety margin to prevent data loss on corruption#637
joshuacolvin0 wants to merge 1 commit into
masterfrom
freezer-safety-margin

Conversation

@joshuacolvin0
Copy link
Copy Markdown
Member

@joshuacolvin0 joshuacolvin0 commented Mar 15, 2026

pulled in by OffchainLabs/nitro#4506
related to NIT-4663

After an unclean shutdown, repair() may truncate the freezer head to
restore cross-table consistency. Previously, blocks were deleted from
the key-value store immediately after freezing, so truncated blocks
could end up missing from both stores — making the node unable to
start (especially for L2 nodes that cannot re-sync pruned blocks from
peers).

Introduce a safety margin (freezerCleanupMargin = freezerBatchLimit)
that retains the most recently frozen blocks in the key-value store.
Since freezeRange reads via nofreezedb (which bypasses the ancient
store), retained blocks can be re-frozen after repair() truncation.

Key changes:

  • Add cleanupMargin field on chainFreezer with persisted cleanup tail
    (freezerCleanupTailKey) so progress resumes across restarts
  • Replace immediate post-freeze deletion with incremental cleanup over
    [cleanupStart, cleanupLimit) using Has()+Get() to distinguish missing
    keys from I/O errors, with backoff on failure
  • Add startup validation in Open(): detect unrecoverable data gaps
    where the freezer has been truncated below the cleanup tail
  • Handle upgrade path (skip-ahead when no tail but frozen >
    FullImmutabilityThreshold) and fresh installs (clean from block 1)
  • Cap per-cycle cleanup to freezerBatchLimit to prevent stalling
  • Bound dangling side chain chase to freezerBatchLimit iterations
  • Add ReadFreezerCleanupTail/WriteFreezerCleanupTail accessors and a
    strict variant for startup/runtime error propagation
  • Surface cleanup tail in ReadChainMetadata diagnostics
  • Add comprehensive test suite (21 tests) covering margin behavior,
    crash recovery, side chain cleanup, boundary conditions, corruption
    detection, upgrade path, and regression guard

Disk overhead: ~30K blocks duplicated temporarily (~30-600 MB).

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

…hutdown

After an unclean shutdown, repair() may truncate the freezer head to
restore cross-table consistency. Previously, blocks were deleted from
the key-value store immediately after freezing, so truncated blocks
could end up missing from both stores — making the node unable to
start (especially for L2 nodes that cannot re-sync pruned blocks from
peers).

Introduce a safety margin (freezerCleanupMargin = freezerBatchLimit)
that retains the most recently frozen blocks in the key-value store.
Since freezeRange reads via nofreezedb (which bypasses the ancient
store), retained blocks can be re-frozen after repair() truncation.

Key changes:
- Add cleanupMargin field on chainFreezer with persisted cleanup tail
  (freezerCleanupTailKey) so progress resumes across restarts
- Replace immediate post-freeze deletion with incremental cleanup over
  [cleanupStart, cleanupLimit) using Has()+Get() to distinguish missing
  keys from I/O errors, with backoff on failure
- Add startup validation in Open(): detect unrecoverable data gaps
  where the freezer has been truncated below the cleanup tail
- Handle upgrade path (skip-ahead when no tail but frozen >
  FullImmutabilityThreshold) and fresh installs (clean from block 1)
- Cap per-cycle cleanup to freezerBatchLimit to prevent stalling
- Bound dangling side chain chase to freezerBatchLimit iterations
- Add ReadFreezerCleanupTail/WriteFreezerCleanupTail accessors and a
  strict variant for startup/runtime error propagation
- Surface cleanup tail in ReadChainMetadata diagnostics
- Add comprehensive test suite (21 tests) covering margin behavior,
  crash recovery, side chain cleanup, boundary conditions, corruption
  detection, upgrade path, and regression guard

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joshuacolvin0 joshuacolvin0 force-pushed the freezer-safety-margin branch from 62ab752 to 07c22fc Compare March 16, 2026 20:35
@joshuacolvin0 joshuacolvin0 marked this pull request as draft March 16, 2026 20:40
@@ -220,14 +246,90 @@ func (f *chainFreezer) freeze(db ethdb.KeyValueStore) {
if err := f.SyncAncient(); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can totally understand the rationale for adding this margin.

Something I don't understand is:

    // After an
	// unclean shutdown, repair() may truncate the freezer head to restore
	// cross-table consistency.

If a chain segment has been moved to the freezer, the freezer is explicitly synced before the corresponding items are deleted from the key-value store. Specifically, once a chain segment is migrated, one of two conditions applies:

  • It has been fully synced, with all tables aligned via f.SyncAncient(), or
  • It has not yet been properly flushed to the freezer, in which case it can be reverted on the next startup due to an unclean shutdown.

In either scenario, it is guaranteed that the chain segment exists in at least one location, either in the freezer or in the key-value store.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants