feat: intra-epoch circuit breaker with reputation-adjusted executor selection#984
feat: intra-epoch circuit breaker with reputation-adjusted executor selection#984mingles-agent wants to merge 17 commits intogonka-ai:mainfrom
Conversation
…research docs: research notes for health-aware executor selection
Add CalculateSelectionWeight helper to the epochgroup package that scales raw staking weight by a node's reputation score (0-100) using a 1% floor to ensure new/low-reputation nodes still receive occasional selection. Apply this at executor selection time inside GetRandomMember by building an in-memory selectionWeights map from the stored ValidationWeights (which carry both Weight and Reputation from epoch start). The map is passed through to selectRandomParticipant via the updated computeCumulativeArray helper. Design decision: the cosmos-sdk group member weights (raw stake) are left unchanged so that block-production power (GetComputeResults) and PoC voting weights (GetValidationWeights) are unaffected. Selection weighting is applied only at lottery time, computed from already-persisted reputation data — no protobuf schema changes required. Addresses issue #4
…eight feat: reputation-adjusted executor selection weight at epoch start
…#5) Implements a threshold-based circuit breaker in createHealthFilterFn that complements the existing SPRT-based deactivation (getInactiveStatus). Where SPRT is accurate but slow (10–50+ inferences), this CB catches degraded nodes in 4+ samples. State machine: HEALTHY → EXCLUDED → PROBE → HEALTHY/re-EXCLUDED - New keeper/circuit_breaker.go: CBState type, CircuitBreakerEntry struct, Get/Set/Delete/Clear/Exclude/PromoteToProbe/RecordCBResult methods. State persisted as JSON in prefix store (no proto gen required). Addresses issue #5. - keeper/query_get_random_executor.go: createHealthFilterFn composed into createFilterFn for both inference-phase and PoC-phase paths. Safety fallback returns full member list if all candidates are excluded. - module/module.go: RecordCBResult(miss=false) called after expired inference penalty; ClearAllCBState on epoch boundary (epoch = final backstop). - msg_server_finish_inference.go: RecordCBResult(success=true) called after successful inference completion to resolve PROBE state. - types/keys.go: CircuitBreakerStatePrefix (53), CircuitBreakerStateKey. CB parameters (all configurable via constants, promotable to proto params): MissThresholdPct=25%, MinSamples=4, InitialCooldown=50 blocks (~5 min), MaxCooldown=500 blocks (~50 min), backoff=2x per re-exclusion. Unit tests cover: exclusion, probe promotion, probe success/fail + backoff, all-degraded fallback, epoch reset. Addresses issue #5
…own-probe feat: intra-epoch fast circuit breaker with cooldown + probe traffic recovery
Add CalculateSelectionWeight helper to the epochgroup package that scales raw staking weight by a node's reputation score (0-100) using a 1% floor to ensure new/low-reputation nodes still receive occasional selection. Apply this at executor selection time inside GetRandomMember by building an in-memory selectionWeights map from the stored ValidationWeights (which carry both Weight and Reputation from epoch start). The map is passed through to selectRandomParticipant via the updated computeCumulativeArray helper. Design decision: the cosmos-sdk group member weights (raw stake) are left unchanged so that block-production power (GetComputeResults) and PoC voting weights (GetValidationWeights) are unaffected. Selection weighting is applied only at lottery time, computed from already-persisted reputation data — no protobuf schema changes required. Addresses issue #4
…#5) Implements a threshold-based circuit breaker in createHealthFilterFn that complements the existing SPRT-based deactivation (getInactiveStatus). Where SPRT is accurate but slow (10–50+ inferences), this CB catches degraded nodes in 4+ samples. State machine: HEALTHY → EXCLUDED → PROBE → HEALTHY/re-EXCLUDED - New keeper/circuit_breaker.go: CBState type, CircuitBreakerEntry struct, Get/Set/Delete/Clear/Exclude/PromoteToProbe/RecordCBResult methods. State persisted as JSON in prefix store (no proto gen required). Addresses issue #5. - keeper/query_get_random_executor.go: createHealthFilterFn composed into createFilterFn for both inference-phase and PoC-phase paths. Safety fallback returns full member list if all candidates are excluded. - module/module.go: RecordCBResult(miss=false) called after expired inference penalty; ClearAllCBState on epoch boundary (epoch = final backstop). - msg_server_finish_inference.go: RecordCBResult(success=true) called after successful inference completion to resolve PROBE state. - types/keys.go: CircuitBreakerStatePrefix (53), CircuitBreakerStateKey. CB parameters (all configurable via constants, promotable to proto params): MissThresholdPct=25%, MinSamples=4, InitialCooldown=50 blocks (~5 min), MaxCooldown=500 blocks (~50 min), backoff=2x per re-exclusion. Unit tests cover: exclusion, probe promotion, probe success/fail + backoff, all-degraded fallback, epoch reset. Addresses issue #5
…nts-height-zero fix: skip proof query when CreatedAtBlockHeight is 0 for old epochs
…om query context (#19) - Remove k.PromoteCBEntryToProbe() call from CBStateExcluded cooldown-expiry branch - Remove k.ExcludeCBEntry() call from CBStateHealthy high-miss-rate branch - Filter now purely classifies include/exclude without writing any state - EndBlock (PR 2 of 3) will handle all CB state transitions - Update tests: assert CB state is NOT written by filter (stays Healthy/Excluded) - Update function comment to document read-only contract
- Add GetAllCBEntries iterator to scan the full CB store - Add UpdateCBStateForBlock with two-pass logic: Pass 1: promote EXCLUDED entries with expired cooldowns to PROBE Pass 2: exclude HEALTHY participants crossing the miss-rate threshold - Wire UpdateCBStateForBlock into EndBlock after inference expiry handling - Add 5 EndBlock CB tests covering all acceptance criteria Addresses issue #20
refactor: make createHealthFilterFn read-only (remove state writes from query context)
feat: Add UpdateCBStateForBlock to EndBlock — move CB writes out of query context
- Add cb_miss_threshold_pct (field 26), cb_min_samples (field 27), cb_initial_cooldown_blocks (field 28), cb_max_cooldown_blocks (field 29) to ValidationParams proto message - Update params.pb.go: struct fields, getters, Equal, MarshalToSizedBuffer, Size, and Unmarshal for the four new varint fields - Set defaults in DefaultValidationParams() matching Go constant values: 25%, 4 samples, 50 blocks, 500 blocks - Add getCBParams() keeper helper that reads from ValidationParams with fallback to compile-time constants (zero value → constant default) - Update ExcludeCBEntry to use getCBParams for initial and max cooldown - Update UpdateCBStateForBlock to use getCBParams for miss threshold and min-samples checks - Update createHealthFilterFn to use getCBParams for miss-rate filtering Governance proposals can now adjust CB tuning without a chain upgrade. Existing tests unaffected: DefaultParams includes the same values as the previous hardcoded constants. Addresses issue #21
feat: promote circuit breaker params to ValidationParams proto
When RecordCBResult(success=true) fires in block N, the node transitions from PROBE → HEALTHY. However, UpdateCBStateForBlock running in EndBlock of the same block would see a HEALTHY node with stale high miss-rate stats and immediately re-exclude it, undoing the recovery. Fix: instead of deleting the CB entry on probe success, set State=CBStateHealthy + LastRestoredBlock=blockHeight + ProbeRestored=true. In UpdateCBStateForBlock Pass 2, skip nodes where ProbeRestored==true && LastRestoredBlock==blockHeight (one-block grace period). Also fixes pre-existing test failures: - TestUpdateCBStateForBlock_ExcludesHighMissRate: zero-value LastRestoredBlock==0 collided with blockHeight==0 in test context; fixed by adding ProbeRestored bool guard - TestHealthFilterExcludesHighMissRate / TestHealthFilterExcludedNodeStillInCooldown: single-node tests triggered the safety fallback; fixed by adding a second healthy node Addresses issue #25 and #28
|
Core blocker from #974 is still present here. This pr is based on a wrong premise |
Doog-bot534
left a comment
There was a problem hiding this comment.
Review: feat: intra-epoch circuit breaker with reputation-adjusted executor selection
Approve with suggestions ✅
Substantial, well-architected feature. The reputation-weighted selection (1% floor prevents starvation) and CB state machine (EXCLUDED → PROBE → HEALTHY) are sound. Excellent test coverage including probabilistic traffic distribution tests.
Key strengths
- Same-block grace period (
ProbeRestored + LastRestoredBlock) prevents immediate re-exclusion after probe - Exponential backoff cooldowns prevent CB flapping
- 1% weight floor ensures zero-reputation nodes still get occasional traffic for recovery
Potential issues
-
JSON for CB state persistence:
json.Marshal/Unmarshalis deterministic for simple structs, but fragile compared to protobuf. If field ordering or float serialization diverges across Go versions, this could cause consensus divergence. Recommend migrating to protobuf. -
ClearAllCBStateat epoch boundary: Referenced in comments but not visible in this diff. Verify it's wired into the epoch transition handler. -
selectRandomParticipantfallback: IfselectionWeightsis missing an address, it falls back to raw group weight, potentially creating inconsistency ifbuildSelectionWeightsMapsilently skips a nilValidationWeight. -
Governance param validation:
cb_miss_threshold_pctshould be validated to stay in [1, 100].
Payout address: gonka10zaal553duxp05nvfpqtsqrm2g0j6j34r8nan7
Closes #975
Summary
Two complementary features to prevent malfunctioning nodes from continuing to receive client inference requests for the rest of an epoch.
1. Reputation-adjusted executor selection
Stake weight is scaled by reputation score at epoch start, so well-performing nodes are preferred before hitting the exclusion threshold. Nodes accumulate reputation from successful inferences and lose it from misses.
2. Intra-epoch circuit breaker
Per-node state machine:
ACTIVE → EXCLUDED → PROBE → ACTIVEValidationParams)EndBlockonly — query handlers are read-only3. Probe re-exclusion fix
When a probe succeeds,
UpdateCBStateForBlockPass 2 running in the same block could immediately re-exclude the node using stale miss-rate stats. Fixed by addingLastRestoredBlocktoCBEntry: Pass 2 skips nodes whereLastRestoredBlock == blockHeight(one-block grace period).4. Circuit breaker params in ValidationParams
CB thresholds (
MissRateThreshold,MinSamples,CooldownBlocks,ProbeInterval) promoted toValidationParamsproto for on-chain governance adjustability.Files changed
inference-chain/x/inference/keeper/circuit_breaker.go— CB state machine, UpdateCBStateForBlock, GetRandomExecutorinference-chain/x/inference/keeper/circuit_breaker_endblock_test.go— comprehensive testsinference-chain/x/inference/types/— CBEntry with LastRestoredBlock, ValidationParams with CB fieldsinference-chain/x/inference/keeper/params.go— CB param accessors