Skip to content

[FIXED] Encrypted partial purge recovery#8042

Merged
neilalexander merged 1 commit intomainfrom
daniele/filestore-encrypted-purge-recovery
Apr 20, 2026
Merged

[FIXED] Encrypted partial purge recovery#8042
neilalexander merged 1 commit intomainfrom
daniele/filestore-encrypted-purge-recovery

Conversation

@sciascid
Copy link
Copy Markdown
Contributor

A purge operation involves moving a tombstone .block and its corresponding key file into new_msgs.
If the server crashed in after moving the key file, before moving the block file, then a stranded block file would be left behind without its key.
On restart, recover would find the tombstone block, without its key file, and would go throuhg the plaintext conversion path. However the blocks was actually encrypted, eventually causing the recovery to fail due to finding the block corrupt.

Treat a new_msgs directory without any block file as a partial "uncommitted" purge. In that case remove the matching tombstone block still left in msgs/ and clear new_msgs.

Signed-off-by: Daniele Sciascia daniele@nats.io

@sciascid sciascid requested a review from a team as a code owner April 16, 2026 16:14
@sciascid
Copy link
Copy Markdown
Contributor Author

@claude review

@synadia-claude-reviewer
Copy link
Copy Markdown

synadia-claude-reviewer Bot commented Apr 16, 2026

Claude finished @sciascid's task in 1m 38s —— View job


Code Review

  • Gather context and read changed files
  • Review server/filestore.go changes
  • Review server/filestore_test.go changes
  • Post comprehensive review feedback

Summary

This fix correctly addresses a narrow but real crash-recovery bug in the encrypted purge path. The purge operation moves the tombstone's key file before its block file (by design — the block being last is the commit signal). If the server crashes in that window, the orphaned encrypted block in msgs/ would be read without its key during recovery, triggering a corruption error.


server/filestore.go

The logic is correct. The critical ordering invariant is in the purge() function at line 10084:

for _, mbf := range []string{fmt.Sprintf(keyScan, lmb.index), fmt.Sprintf(blkScan, lmb.index)} {

Key is moved first, block second. __new_msgs__ containing a key but no block unambiguously means the crash happened between step 1 and step 2. The fix handles all three cases cleanly:

State of __new_msgs__ Meaning Handling
Does not exist No purge in progress No-op (existing behavior)
Has .blk Purge committed, needs final rename Old path: remove mdir, rename ndirmdir
Has .key, no .blk Purge uncommitted (this fix) Remove orphaned .blk from mdir, remove ndir

The early return nil on line 10175 is correct. In the crash-before-block-move scenario, pdir (__msgs__/) was never created (that rename happens after the block move), so the pdir stat/cleanup at lines 10178–10181 is rightly skipped.

Error handling is sound. os.IsNotExist is correctly tolerated when removing the block from mdir — if mdir or the block itself is already gone, it's not an error worth surfacing.

Minor observation — comment verbosity: The comments in the else block describe what the code does rather than why. They're not wrong, but could be trimmed. For example, the 3-line comment block:

// If the tombstone is encrypted, and the server crashed
// after it moved the key file, remove the corresponding
// block file as well. We don't want to leave the encrypted
// block without its key file.

could be just: // Remove orphaned encrypted block; its key was already moved out. — but this is purely a style preference, not a bug.


server/filestore_test.go

The test accurately reproduces the crash window:

  1. Stores 1024 messages, snapshots msgs/ (pre-purge)
  2. Runs a real purge to generate a valid encrypted tombstone + key
  3. Restores the pre-purge msgs/ directory
  4. Places the encrypted tombstone block back into msgs/ (simulating it wasn't moved yet)
  5. Places only the key file in __new_msgs__/ (simulating the partial move)
  6. Reopens the store and asserts state matches the pre-purge baseline
  7. Asserts the orphaned tombstone block was cleaned up
  8. Asserts __new_msgs__/ was removed

Removing index.db (line with streamStreamStateFile) is a good detail — it forces the recovery code path through block scanning rather than the full-state fast path, ensuring the actual fix is exercised.

One small note: the test only covers the case where the tombstone block exists in msgs/. The code also handles the case where it doesn't (via os.IsNotExist), but that case isn't tested explicitly. This isn't a gap worth fixing now — it's a defensive code path that is straightforward.


Overall

The fix is minimal, correct, and well-tested. No issues blocking merge.

@sciascid sciascid force-pushed the daniele/filestore-encrypted-purge-recovery branch from b61ae85 to 749a478 Compare April 16, 2026 17:52
Copy link
Copy Markdown
Member

@MauriceVanVeen MauriceVanVeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

A purge operation involves moving a tombstone .block and
its corresponding key file into __new_msgs__.
If the server crashed in after moving the key file,
before moving the block file, then a stranded block file
would be left behind without its key.
On restart, recover would find the tombstone block, without
its key file, and would go throuhg the plaintext conversion
path. However the blocks was actually encrypted, eventually
causing the recovery to fail due to finding the block corrupt.

Treat a __new_msgs__ directory without any block file as a
partial "uncommitted" purge. In that case remove the matching
tombstone block still left in msgs/ and clear __new_msgs__.

Signed-off-by: Daniele Sciascia <daniele@nats.io>
@sciascid sciascid force-pushed the daniele/filestore-encrypted-purge-recovery branch from 749a478 to 24487ae Compare April 17, 2026 14:31
@neilalexander neilalexander merged commit a236aeb into main Apr 20, 2026
50 checks passed
@neilalexander neilalexander deleted the daniele/filestore-encrypted-purge-recovery branch April 20, 2026 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants