[FIXED] Encrypted partial purge recovery#8042
Conversation
|
@claude review |
|
Claude finished @sciascid's task in 1m 38s —— View job Code Review
SummaryThis fix correctly addresses a narrow but real crash-recovery bug in the encrypted purge path. The purge operation moves the tombstone's key file before its block file (by design — the block being last is the commit signal). If the server crashes in that window, the orphaned encrypted block in
|
State of __new_msgs__ |
Meaning | Handling |
|---|---|---|
| Does not exist | No purge in progress | No-op (existing behavior) |
Has .blk |
Purge committed, needs final rename | Old path: remove mdir, rename ndir → mdir |
Has .key, no .blk |
Purge uncommitted (this fix) | Remove orphaned .blk from mdir, remove ndir |
The early return nil on line 10175 is correct. In the crash-before-block-move scenario, pdir (__msgs__/) was never created (that rename happens after the block move), so the pdir stat/cleanup at lines 10178–10181 is rightly skipped.
Error handling is sound. os.IsNotExist is correctly tolerated when removing the block from mdir — if mdir or the block itself is already gone, it's not an error worth surfacing.
Minor observation — comment verbosity: The comments in the else block describe what the code does rather than why. They're not wrong, but could be trimmed. For example, the 3-line comment block:
// If the tombstone is encrypted, and the server crashed
// after it moved the key file, remove the corresponding
// block file as well. We don't want to leave the encrypted
// block without its key file.could be just: // Remove orphaned encrypted block; its key was already moved out. — but this is purely a style preference, not a bug.
server/filestore_test.go
The test accurately reproduces the crash window:
- Stores 1024 messages, snapshots
msgs/(pre-purge) - Runs a real purge to generate a valid encrypted tombstone + key
- Restores the pre-purge
msgs/directory - Places the encrypted tombstone block back into
msgs/(simulating it wasn't moved yet) - Places only the key file in
__new_msgs__/(simulating the partial move) - Reopens the store and asserts state matches the pre-purge baseline
- Asserts the orphaned tombstone block was cleaned up
- Asserts
__new_msgs__/was removed
Removing index.db (line with streamStreamStateFile) is a good detail — it forces the recovery code path through block scanning rather than the full-state fast path, ensuring the actual fix is exercised.
One small note: the test only covers the case where the tombstone block exists in msgs/. The code also handles the case where it doesn't (via os.IsNotExist), but that case isn't tested explicitly. This isn't a gap worth fixing now — it's a defensive code path that is straightforward.
Overall
The fix is minimal, correct, and well-tested. No issues blocking merge.
b61ae85 to
749a478
Compare
A purge operation involves moving a tombstone .block and its corresponding key file into __new_msgs__. If the server crashed in after moving the key file, before moving the block file, then a stranded block file would be left behind without its key. On restart, recover would find the tombstone block, without its key file, and would go throuhg the plaintext conversion path. However the blocks was actually encrypted, eventually causing the recovery to fail due to finding the block corrupt. Treat a __new_msgs__ directory without any block file as a partial "uncommitted" purge. In that case remove the matching tombstone block still left in msgs/ and clear __new_msgs__. Signed-off-by: Daniele Sciascia <daniele@nats.io>
749a478 to
24487ae
Compare
A purge operation involves moving a tombstone .block and its corresponding key file into new_msgs.
If the server crashed in after moving the key file, before moving the block file, then a stranded block file would be left behind without its key.
On restart, recover would find the tombstone block, without its key file, and would go throuhg the plaintext conversion path. However the blocks was actually encrypted, eventually causing the recovery to fail due to finding the block corrupt.
Treat a new_msgs directory without any block file as a partial "uncommitted" purge. In that case remove the matching tombstone block still left in msgs/ and clear new_msgs.
Signed-off-by: Daniele Sciascia daniele@nats.io