[P0] Make handling of warm keys deterministic #986
[P0] Make handling of warm keys deterministic #986Brgndy25 wants to merge 10 commits intoupgrade-v0.2.12from
Conversation
Problem: Recovering hosts need a deterministic way to decide which nonces are BFT-committed (2/3+ signatures) so warm key bindings can be trusted without re-querying mainnet. Solution: Introduce subnet/host/finality.go with computeFinalizedNonce and helper functions (bitmapSlotWeight, twoThirdsWeight) to compute the highest nonce finalized by a 2/3+ signer threshold based on stored per-nonce signatures.
Problem: The /diffs HTTP path and gossip recovery path only carried bare types.Diff values, so state_hash and warm_key_delta from each stored record were dropped on fetch. Recovery could not align with full per-nonce metadata needed for warm-key determinism. Solution: Return types.DiffRecord from HTTPClient.GetDiffs (decode diff, state_hash, and warm_key_delta), mirror that in HandleGetDiffs JSON, and thread DiffRecord through gossip DiffFetcher/StateUpdater and Host.ApplyRecoveredDiffs with updated tests.
problem: Gossip recovery replays DiffRecords from peers. Warm key bindings carried in WarmKeyDelta should not be treated as authoritative for every nonce; only nonces that are effectively BFT-committed (enough signatures) should skip mainnet ResolveWarmKey. solution: ApplyRecoveredDiffs computes a finalized nonce watermark F from stored signatures (computeFinalizedNonce). For each recovered record with nonce ≤ F and a non-empty WarmKeyDelta, it calls InjectWarmKeys before applyAndPersist. For nonce > F (or when there is no storage), replay keeps using the existing ResolveWarmKey path.
There was a problem hiding this comment.
Pull request overview
This PR extends gossip recovery to transport additional metadata alongside diffs (notably WarmKeyDelta and StateHash) to support more deterministic warm-key handling during recovery, and introduces a local-signature–based “finalized nonce” computation to decide when it’s safe to trust warm-key deltas from the wire.
Changes:
- Extend the
/diffsHTTP response to includewarm_key_deltaand update the HTTP client to return[]types.DiffRecordinstead of[]types.Diff. - Update gossip interfaces/tests and host recovery (
ApplyRecoveredDiffs) to consumeDiffRecordand inject warm keys when allowed. - Add
subnet/host/finality.goto compute a finalized nonce from locally stored signatures.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| subnet/transport/server.go | Adds warm_key_delta to diff JSON response payload. |
| subnet/transport/client.go | Changes GetDiffs to return []types.DiffRecord and decodes state_hash + warm_key_delta. |
| subnet/transport/client_test.go | Updates test assertions for new diff record shape. |
| subnet/host/host.go | Updates recovery path to accept DiffRecord, compute finalized nonce, and inject warm keys conditionally. |
| subnet/host/finality.go | New finalized-nonce computation based on stored signatures. |
| subnet/gossip/interface.go | Updates interfaces to use []types.DiffRecord. |
| subnet/gossip/gossip_test.go | Updates mocks/tests for the new diff record type. |
Comments suppressed due to low confidence (1)
subnet/host/host.go:860
ApplyRecoveredDiffsnow receivesDiffRecordincludingStateHash, but the recoveredStateHashis never validated against the state root computed after applying the diff. That means a peer could return corrupted/mismatched diffs (or an incorrectWarmKeyDelta) and the host would still apply/persist them without detecting the divergence at recovery time. SincesignIfAcceptedalready returns the computedroot, compare it torec.StateHash(when provided) and fail recovery on mismatch.
// Sign state with acceptance check (same path as HandleRequest).
stateSig, root, nonce, err := h.signIfAccepted(nil)
if err != nil {
return sigs, fmt.Errorf("sign recovered state: %w", err)
}
subnet/host/finality.go
Outdated
| threshold := twoThirdsWeight(group) | ||
| for f := latestNonce; f > 0; f-- { | ||
| if bitmapSlotWeight(confirmedBy[f], group) >= threshold { | ||
| return f |
There was a problem hiding this comment.
computeFinalizedNonce uses twoThirdsWeight(group) as the threshold, but twoThirdsWeight currently computes ceil(2/3 * n). This is weaker than the finalization threshold used elsewhere (e.g. Host.checkFinalization uses 2*len(group)/3 + 1, i.e. >2/3). As a result, finalized may be overestimated (e.g. group size 3 => threshold 2 here vs 3 in checkFinalization), causing ApplyRecoveredDiffs to trust WarmKeyDelta from the wire before the nonce is actually finalized. Align the threshold calculation with the existing finalization rule (>2/3).
| var finalized uint64 | ||
| if h.store != nil { | ||
| finalized = computeFinalizedNonce(h.store, h.escrowID, latestNonce, h.group) | ||
| } | ||
|
|
There was a problem hiding this comment.
New finality/warm-key trust behavior is introduced via computeFinalizedNonce and the rec.Nonce <= finalized gate, but there are no unit tests covering (a) the finalized threshold semantics across group sizes and (b) that WarmKeyDelta is only trusted up to the finalized nonce. Adding focused tests for computeFinalizedNonce and ApplyRecoveredDiffs would help prevent regressions in recovery correctness.
Problem: Finalized nonce logic (2/3 slot thresholds, transitivity, partial later nonces) was not covered by direct tests, so regressions in escrow finality behavior would be easy to miss. Solution: Add subnet/host/finality_test.go with helpers for in-memory escrow sessions and signatures, and tests for insufficient signers (F=0), partial high nonces not raising F, transitivity from a single high-nonce supermajority, 2/3 thresholds for groups of 3 and 6, and latestNonce==0.
Problem: Recovery replay needs to combine BFT-finality (computeFinalizedNonce) with whether WarmKeyDelta on the wire is trusted (inject) or warm keys must be resolved via the bridge. That split across F, nonce ordering, and empty deltas was not covered by tests, and reproducing pre-replay signatures without diff rows is awkward with a plain Memory store. Solution: Add host_recovery_test.go with a recoverySigStore wrapper that overlays pre-replay slot signatures for computeFinalizedNonce, plus helpers to build hosts with a counting warm-key resolver. Tests assert: finalized nonce + WarmKeyDelta uses inject (no resolver); nonce above F uses resolver; F=0 uses resolver; finalized but nil WarmKeyDelta still resolves ConfirmStart.
subnet/host/finality.go
Outdated
| sigs, err := store.GetSignatures(escrowID, n) | ||
| if err != nil { | ||
| continue | ||
| } |
There was a problem hiding this comment.
We should add log at error level here if err != nil
subnet/host/finality.go
Outdated
| confirmedBy[prev] = bm | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Here is O(N^2) cycle when O(N) is intended to be.
We just don't need inner cycle
for prev := uint64(1); prev <= n; prev++It could be:
confirmedBy := make(map[uint64]types.Bitmap128)
// running = slots that have signed at any nonce >= current n.
var running types.Bitmap128
for n := latestNonce; n > 0; n-- {
sigs, err := store.GetSignatures(escrowID, n)
if err != nil {
// keep previous running set; just skip adding new signers for this nonce
confirmedBy[n] = running
continue
}
// A slot that signed at n confirms all nonces <= n.
// In reverse traversal, that means it belongs to running for n and all lower nonces.
for slotID := range sigs {
running.Set(slotID)
}
confirmedBy[n] = running
}O(N) cycle
There was a problem hiding this comment.
Moreover we don't need next cycle, we can here add
...
confirmedBy[n] = running
if bitmapSlotWeight(confirmedBy[f], group) >= threshold {
return n
}|
|
||
| var finalized uint64 | ||
| if h.store != nil { | ||
| finalized = computeFinalizedNonce(h.store, h.escrowID, latestNonce, h.group) |
There was a problem hiding this comment.
We are calling here computeFinalizedNonce that is using stored values to calculate finalized
sigs, err := store.GetSignatures(escrowID, n)but it seams store is empty as peer signatures written after catch-up gossip exchange
Am I right or missed something? where do we fill the store?
There was a problem hiding this comment.
computeFinalizedNonce runs before we applyAndPersist the recovered diffs. Peer sigs only go into storage via AccumulateGossipSig, which requires a stored diff for that nonce first, so for nonces in the recovery batch, peer sigs usually aren’t there yet at that line. Recovery also gossips our sigs after apply, so others’ sigs for that gap often arrive later. So finalized is a pre-recovery snapshot; that’s conservative for trusting WarmKeyDelta during catch-up.
There was a problem hiding this comment.
I mean that at the next lines we inject warm keys only for finalized nonce:
if rec.Nonce <= finalized && len(rec.WarmKeyDelta) > 0 {
h.sm.InjectWarmKeys(rec.WarmKeyDelta)
}So we have following situation:
- We are trying to catch up
- We got the diff and empty storage
- We are trying to
computeFinalizedNonce, but because of empty storage it is 0 - We do not
InjectWarmKeysas conditionrec.Nonce <= finalizedis never met - We fail on
applyAndPersistif there was warm key changes
I think we need somehow get signatures at computeFinalizedNonce, maybe not from store ( store.GetSignatures(escrowID, n) ) but from diffs
| } | ||
| if err := h.applyAndPersist(rec.Diff); err != nil { | ||
| return sigs, fmt.Errorf("apply recovered diff nonce %d: %w", rec.Nonce, err) | ||
| } |
There was a problem hiding this comment.
Here is the possible attack from peer:
peer provides wrong warm key diffs and they are stored at h.sm.InjectWarmKeys(rec.WarmKeyDelta)
Then h.applyAndPersist fails
If attacker provides change in warm key where there was no change, we will not be able to catch up and recover at all even on retry as warm key is already broken.
We should make a transaction here and rollback if h.applyAndPersist fails
problem: GetSignatures failures in computeFinalizedNonce were ignored with only a silent continue, so operators had no visibility when signature reads failed during finality computation. solution: Emit a logging.Error with subsystem host, escrow_id, nonce, and error before continuing.
problem: computeFinalizedNonce built confirmedBy with a nested loop over prev for each signature at nonce n, so work scaled like O(nonce * signatures per height) and could be quadratic. A separate pass then scanned for the highest finalized nonce. solution: Iterate nonces from latest down to 1 with a single running Bitmap128 (slots that signed at any height >= current n), check the 2/3 slot-weight threshold in the same loop, and return the first matching nonce from the top. Removes the inner prev loop and the extra scan. Also fix host_recovery_test to handle NewStateMachine’s (*StateMachine, error) return so ./host tests compile.
problem: Malicious or bad peers can attach WarmKeyDelta to gossip recovery; we InjectWarmKeys before applyAndPersist. If ApplyDiff then fails, bindings were already written and InjectWarmKeys does not overwrite, leaving poisoned state so retries cannot recover. Injecting when the nonce was already applied (no-op apply) could also pollute warm keys. solution: Snapshot warm keys before inject; on apply failure, if LatestNonce is unchanged, RestoreWarmKeys to the snapshot. Only inject when the nonce is still ahead of LatestNonce. Add StateMachine.RestoreWarmKeys and mirror the pattern in user session replay. Tests cover rollback and restore.
Implements #955