Skip to content

[P0] Make handling of warm keys deterministic #986

Open
Brgndy25 wants to merge 10 commits intoupgrade-v0.2.12from
brgndy25/#913-impl-warm-key-determinisim
Open

[P0] Make handling of warm keys deterministic #986
Brgndy25 wants to merge 10 commits intoupgrade-v0.2.12from
brgndy25/#913-impl-warm-key-determinisim

Conversation

@Brgndy25
Copy link
Copy Markdown
Collaborator

@Brgndy25 Brgndy25 commented Mar 31, 2026

Implements #955

Problem: Recovering hosts need a deterministic way to decide which nonces are BFT-committed (2/3+ signatures) so warm key bindings can be trusted without re-querying mainnet.

Solution: Introduce subnet/host/finality.go with computeFinalizedNonce and helper functions (bitmapSlotWeight, twoThirdsWeight) to compute the highest nonce finalized by a 2/3+ signer threshold based on stored per-nonce signatures.
Problem: The /diffs HTTP path and gossip recovery path only carried bare
types.Diff values, so state_hash and warm_key_delta from each stored record
were dropped on fetch. Recovery could not align with full per-nonce metadata
needed for warm-key determinism.

Solution: Return types.DiffRecord from HTTPClient.GetDiffs (decode diff,
state_hash, and warm_key_delta), mirror that in HandleGetDiffs JSON, and
thread DiffRecord through gossip DiffFetcher/StateUpdater and
Host.ApplyRecoveredDiffs with updated tests.
problem:
Gossip recovery replays DiffRecords from peers. Warm key bindings carried in WarmKeyDelta should not be treated as authoritative for every nonce; only nonces that are effectively BFT-committed (enough signatures) should skip mainnet ResolveWarmKey.

solution:
ApplyRecoveredDiffs computes a finalized nonce watermark F from stored signatures (computeFinalizedNonce). For each recovered record with nonce ≤ F and a non-empty WarmKeyDelta, it calls InjectWarmKeys before applyAndPersist. For nonce > F (or when there is no storage), replay keeps using the existing ResolveWarmKey path.
Copilot AI review requested due to automatic review settings March 31, 2026 17:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends gossip recovery to transport additional metadata alongside diffs (notably WarmKeyDelta and StateHash) to support more deterministic warm-key handling during recovery, and introduces a local-signature–based “finalized nonce” computation to decide when it’s safe to trust warm-key deltas from the wire.

Changes:

  • Extend the /diffs HTTP response to include warm_key_delta and update the HTTP client to return []types.DiffRecord instead of []types.Diff.
  • Update gossip interfaces/tests and host recovery (ApplyRecoveredDiffs) to consume DiffRecord and inject warm keys when allowed.
  • Add subnet/host/finality.go to compute a finalized nonce from locally stored signatures.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
subnet/transport/server.go Adds warm_key_delta to diff JSON response payload.
subnet/transport/client.go Changes GetDiffs to return []types.DiffRecord and decodes state_hash + warm_key_delta.
subnet/transport/client_test.go Updates test assertions for new diff record shape.
subnet/host/host.go Updates recovery path to accept DiffRecord, compute finalized nonce, and inject warm keys conditionally.
subnet/host/finality.go New finalized-nonce computation based on stored signatures.
subnet/gossip/interface.go Updates interfaces to use []types.DiffRecord.
subnet/gossip/gossip_test.go Updates mocks/tests for the new diff record type.
Comments suppressed due to low confidence (1)

subnet/host/host.go:860

  • ApplyRecoveredDiffs now receives DiffRecord including StateHash, but the recovered StateHash is never validated against the state root computed after applying the diff. That means a peer could return corrupted/mismatched diffs (or an incorrect WarmKeyDelta) and the host would still apply/persist them without detecting the divergence at recovery time. Since signIfAccepted already returns the computed root, compare it to rec.StateHash (when provided) and fail recovery on mismatch.
		// Sign state with acceptance check (same path as HandleRequest).
		stateSig, root, nonce, err := h.signIfAccepted(nil)
		if err != nil {
			return sigs, fmt.Errorf("sign recovered state: %w", err)
		}

Comment on lines +33 to +36
threshold := twoThirdsWeight(group)
for f := latestNonce; f > 0; f-- {
if bitmapSlotWeight(confirmedBy[f], group) >= threshold {
return f
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computeFinalizedNonce uses twoThirdsWeight(group) as the threshold, but twoThirdsWeight currently computes ceil(2/3 * n). This is weaker than the finalization threshold used elsewhere (e.g. Host.checkFinalization uses 2*len(group)/3 + 1, i.e. >2/3). As a result, finalized may be overestimated (e.g. group size 3 => threshold 2 here vs 3 in checkFinalization), causing ApplyRecoveredDiffs to trust WarmKeyDelta from the wire before the nonce is actually finalized. Align the threshold calculation with the existing finalization rule (>2/3).

Copilot uses AI. Check for mistakes.
Comment on lines +841 to +845
var finalized uint64
if h.store != nil {
finalized = computeFinalizedNonce(h.store, h.escrowID, latestNonce, h.group)
}

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New finality/warm-key trust behavior is introduced via computeFinalizedNonce and the rec.Nonce <= finalized gate, but there are no unit tests covering (a) the finalized threshold semantics across group sizes and (b) that WarmKeyDelta is only trusted up to the finalized nonce. Adding focused tests for computeFinalizedNonce and ApplyRecoveredDiffs would help prevent regressions in recovery correctness.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@heitor-lassarote heitor-lassarote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Brgndy25 and others added 3 commits April 6, 2026 16:31
Problem:
Finalized nonce logic (2/3 slot thresholds, transitivity, partial later nonces) was not covered by direct tests, so regressions in escrow finality behavior would be easy to miss.

Solution:
Add subnet/host/finality_test.go with helpers for in-memory escrow sessions and signatures, and tests for insufficient signers (F=0), partial high nonces not raising F, transitivity from a single high-nonce supermajority, 2/3 thresholds for groups of 3 and 6, and latestNonce==0.
Problem:
Recovery replay needs to combine BFT-finality (computeFinalizedNonce) with whether WarmKeyDelta on the wire is trusted (inject) or warm keys must be resolved via the bridge. That split across F, nonce ordering, and empty deltas was not covered by tests, and reproducing pre-replay signatures without diff rows is awkward with a plain Memory store.

Solution:
Add host_recovery_test.go with a recoverySigStore wrapper that overlays pre-replay slot signatures for computeFinalizedNonce, plus helpers to build hosts with a counting warm-key resolver. Tests assert: finalized nonce + WarmKeyDelta uses inject (no resolver); nonce above F uses resolver; F=0 uses resolver; finalized but nil WarmKeyDelta still resolves ConfirmStart.
@Brgndy25 Brgndy25 marked this pull request as ready for review April 6, 2026 13:46
@Brgndy25 Brgndy25 changed the title Brgndy25/#913 impl warm key determinisim [P0] Make handling of warm keys deterministic Apr 6, 2026
@tcharchian tcharchian linked an issue Apr 7, 2026 that may be closed by this pull request
sigs, err := store.GetSignatures(escrowID, n)
if err != nil {
continue
}
Copy link
Copy Markdown
Collaborator

@akup akup Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add log at error level here if err != nil

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmedBy[prev] = bm
}
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is O(N^2) cycle when O(N) is intended to be.
We just don't need inner cycle

for prev := uint64(1); prev <= n; prev++

It could be:

confirmedBy := make(map[uint64]types.Bitmap128)
// running = slots that have signed at any nonce >= current n.
var running types.Bitmap128
for n := latestNonce; n > 0; n-- {
	sigs, err := store.GetSignatures(escrowID, n)
	if err != nil {
		// keep previous running set; just skip adding new signers for this nonce
		confirmedBy[n] = running
		continue
	}
	// A slot that signed at n confirms all nonces <= n.
	// In reverse traversal, that means it belongs to running for n and all lower nonces.
	for slotID := range sigs {
		running.Set(slotID)
	}
	confirmedBy[n] = running
}

O(N) cycle

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover we don't need next cycle, we can here add

...
confirmedBy[n] = running
if bitmapSlotWeight(confirmedBy[f], group) >= threshold {
  return n
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


var finalized uint64
if h.store != nil {
finalized = computeFinalizedNonce(h.store, h.escrowID, latestNonce, h.group)
Copy link
Copy Markdown
Collaborator

@akup akup Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are calling here computeFinalizedNonce that is using stored values to calculate finalized

sigs, err := store.GetSignatures(escrowID, n)

but it seams store is empty as peer signatures written after catch-up gossip exchange

Am I right or missed something? where do we fill the store?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computeFinalizedNonce runs before we applyAndPersist the recovered diffs. Peer sigs only go into storage via AccumulateGossipSig, which requires a stored diff for that nonce first, so for nonces in the recovery batch, peer sigs usually aren’t there yet at that line. Recovery also gossips our sigs after apply, so others’ sigs for that gap often arrive later. So finalized is a pre-recovery snapshot; that’s conservative for trusting WarmKeyDelta during catch-up.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that at the next lines we inject warm keys only for finalized nonce:

if rec.Nonce <= finalized && len(rec.WarmKeyDelta) > 0 {
	h.sm.InjectWarmKeys(rec.WarmKeyDelta)
}

So we have following situation:

  1. We are trying to catch up
  2. We got the diff and empty storage
  3. We are trying to computeFinalizedNonce, but because of empty storage it is 0
  4. We do not InjectWarmKeys as condition rec.Nonce <= finalized is never met
  5. We fail on applyAndPersist if there was warm key changes

I think we need somehow get signatures at computeFinalizedNonce, maybe not from store ( store.GetSignatures(escrowID, n) ) but from diffs

}
if err := h.applyAndPersist(rec.Diff); err != nil {
return sigs, fmt.Errorf("apply recovered diff nonce %d: %w", rec.Nonce, err)
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the possible attack from peer:
peer provides wrong warm key diffs and they are stored at h.sm.InjectWarmKeys(rec.WarmKeyDelta)

Then h.applyAndPersist fails

If attacker provides change in warm key where there was no change, we will not be able to catch up and recover at all even on retry as warm key is already broken.

We should make a transaction here and rollback if h.applyAndPersist fails

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brgndy25 added 3 commits April 9, 2026 00:30
problem:
GetSignatures failures in computeFinalizedNonce were ignored with only a silent continue, so operators had no visibility when signature reads failed during finality computation.

solution:
Emit a logging.Error with subsystem host, escrow_id, nonce, and error before continuing.
problem:
computeFinalizedNonce built confirmedBy with a nested loop over prev for each
signature at nonce n, so work scaled like O(nonce * signatures per height) and
could be quadratic. A separate pass then scanned for the highest finalized nonce.

solution:
Iterate nonces from latest down to 1 with a single running Bitmap128 (slots
that signed at any height >= current n), check the 2/3 slot-weight threshold in
the same loop, and return the first matching nonce from the top. Removes the
inner prev loop and the extra scan. Also fix host_recovery_test to handle
NewStateMachine’s (*StateMachine, error) return so ./host tests compile.
problem:
Malicious or bad peers can attach WarmKeyDelta to gossip recovery; we
InjectWarmKeys before applyAndPersist. If ApplyDiff then fails, bindings
were already written and InjectWarmKeys does not overwrite, leaving
poisoned state so retries cannot recover. Injecting when the nonce was
already applied (no-op apply) could also pollute warm keys.

solution:
Snapshot warm keys before inject; on apply failure, if LatestNonce is
unchanged, RestoreWarmKeys to the snapshot. Only inject when the nonce is
still ahead of LatestNonce. Add StateMachine.RestoreWarmKeys and mirror
the pattern in user session replay. Tests cover rollback and restore.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Needs reviewer

Development

Successfully merging this pull request may close these issues.

[P0] Make handling of warm keys deterministic (implementation)

5 participants