Skip to content

feat: intra-epoch circuit breaker with reputation-adjusted executor selection#984

Open
mingles-agent wants to merge 17 commits intogonka-ai:mainfrom
MinglesAI:main
Open

feat: intra-epoch circuit breaker with reputation-adjusted executor selection#984
mingles-agent wants to merge 17 commits intogonka-ai:mainfrom
MinglesAI:main

Conversation

@mingles-agent
Copy link
Copy Markdown

Closes #975

Summary

Two complementary features to prevent malfunctioning nodes from continuing to receive client inference requests for the rest of an epoch.

1. Reputation-adjusted executor selection

Stake weight is scaled by reputation score at epoch start, so well-performing nodes are preferred before hitting the exclusion threshold. Nodes accumulate reputation from successful inferences and lose it from misses.

2. Intra-epoch circuit breaker

Per-node state machine: ACTIVE → EXCLUDED → PROBE → ACTIVE

  • Exclusion trigger: miss rate > 25% after ≥4 samples (configurable via ValidationParams)
  • Recovery: after cooldown period, node gets one probe slot; success restores it to ACTIVE, failure doubles the cooldown
  • Cosmos-safe: all state writes in EndBlock only — query handlers are read-only

3. Probe re-exclusion fix

When a probe succeeds, UpdateCBStateForBlock Pass 2 running in the same block could immediately re-exclude the node using stale miss-rate stats. Fixed by adding LastRestoredBlock to CBEntry: Pass 2 skips nodes where LastRestoredBlock == blockHeight (one-block grace period).

4. Circuit breaker params in ValidationParams

CB thresholds (MissRateThreshold, MinSamples, CooldownBlocks, ProbeInterval) promoted to ValidationParams proto for on-chain governance adjustability.

Files changed

  • inference-chain/x/inference/keeper/circuit_breaker.go — CB state machine, UpdateCBStateForBlock, GetRandomExecutor
  • inference-chain/x/inference/keeper/circuit_breaker_endblock_test.go — comprehensive tests
  • inference-chain/x/inference/types/ — CBEntry with LastRestoredBlock, ValidationParams with CB fields
  • inference-chain/x/inference/keeper/params.go — CB param accessors

MinglesAI and others added 17 commits March 24, 2026 17:50
…research

docs: research notes for health-aware executor selection
Add CalculateSelectionWeight helper to the epochgroup package that scales
raw staking weight by a node's reputation score (0-100) using a 1% floor
to ensure new/low-reputation nodes still receive occasional selection.

Apply this at executor selection time inside GetRandomMember by building an
in-memory selectionWeights map from the stored ValidationWeights (which carry
both Weight and Reputation from epoch start). The map is passed through to
selectRandomParticipant via the updated computeCumulativeArray helper.

Design decision: the cosmos-sdk group member weights (raw stake) are left
unchanged so that block-production power (GetComputeResults) and PoC
voting weights (GetValidationWeights) are unaffected. Selection weighting
is applied only at lottery time, computed from already-persisted reputation
data — no protobuf schema changes required.

Addresses issue #4
…eight

feat: reputation-adjusted executor selection weight at epoch start
…#5)

Implements a threshold-based circuit breaker in createHealthFilterFn that
complements the existing SPRT-based deactivation (getInactiveStatus). Where
SPRT is accurate but slow (10–50+ inferences), this CB catches degraded nodes
in 4+ samples.

State machine: HEALTHY → EXCLUDED → PROBE → HEALTHY/re-EXCLUDED

- New keeper/circuit_breaker.go: CBState type, CircuitBreakerEntry struct,
  Get/Set/Delete/Clear/Exclude/PromoteToProbe/RecordCBResult methods. State
  persisted as JSON in prefix store (no proto gen required). Addresses issue #5.

- keeper/query_get_random_executor.go: createHealthFilterFn composed into
  createFilterFn for both inference-phase and PoC-phase paths. Safety fallback
  returns full member list if all candidates are excluded.

- module/module.go: RecordCBResult(miss=false) called after expired inference
  penalty; ClearAllCBState on epoch boundary (epoch = final backstop).

- msg_server_finish_inference.go: RecordCBResult(success=true) called after
  successful inference completion to resolve PROBE state.

- types/keys.go: CircuitBreakerStatePrefix (53), CircuitBreakerStateKey.

CB parameters (all configurable via constants, promotable to proto params):
  MissThresholdPct=25%, MinSamples=4, InitialCooldown=50 blocks (~5 min),
  MaxCooldown=500 blocks (~50 min), backoff=2x per re-exclusion.

Unit tests cover: exclusion, probe promotion, probe success/fail + backoff,
all-degraded fallback, epoch reset.

Addresses issue #5
…own-probe

feat: intra-epoch fast circuit breaker with cooldown + probe traffic recovery
Add CalculateSelectionWeight helper to the epochgroup package that scales
raw staking weight by a node's reputation score (0-100) using a 1% floor
to ensure new/low-reputation nodes still receive occasional selection.

Apply this at executor selection time inside GetRandomMember by building an
in-memory selectionWeights map from the stored ValidationWeights (which carry
both Weight and Reputation from epoch start). The map is passed through to
selectRandomParticipant via the updated computeCumulativeArray helper.

Design decision: the cosmos-sdk group member weights (raw stake) are left
unchanged so that block-production power (GetComputeResults) and PoC
voting weights (GetValidationWeights) are unaffected. Selection weighting
is applied only at lottery time, computed from already-persisted reputation
data — no protobuf schema changes required.

Addresses issue #4
…#5)

Implements a threshold-based circuit breaker in createHealthFilterFn that
complements the existing SPRT-based deactivation (getInactiveStatus). Where
SPRT is accurate but slow (10–50+ inferences), this CB catches degraded nodes
in 4+ samples.

State machine: HEALTHY → EXCLUDED → PROBE → HEALTHY/re-EXCLUDED

- New keeper/circuit_breaker.go: CBState type, CircuitBreakerEntry struct,
  Get/Set/Delete/Clear/Exclude/PromoteToProbe/RecordCBResult methods. State
  persisted as JSON in prefix store (no proto gen required). Addresses issue #5.

- keeper/query_get_random_executor.go: createHealthFilterFn composed into
  createFilterFn for both inference-phase and PoC-phase paths. Safety fallback
  returns full member list if all candidates are excluded.

- module/module.go: RecordCBResult(miss=false) called after expired inference
  penalty; ClearAllCBState on epoch boundary (epoch = final backstop).

- msg_server_finish_inference.go: RecordCBResult(success=true) called after
  successful inference completion to resolve PROBE state.

- types/keys.go: CircuitBreakerStatePrefix (53), CircuitBreakerStateKey.

CB parameters (all configurable via constants, promotable to proto params):
  MissThresholdPct=25%, MinSamples=4, InitialCooldown=50 blocks (~5 min),
  MaxCooldown=500 blocks (~50 min), backoff=2x per re-exclusion.

Unit tests cover: exclusion, probe promotion, probe success/fail + backoff,
all-degraded fallback, epoch reset.

Addresses issue #5
…nts-height-zero

fix: skip proof query when CreatedAtBlockHeight is 0 for old epochs
…om query context (#19)

- Remove k.PromoteCBEntryToProbe() call from CBStateExcluded cooldown-expiry branch
- Remove k.ExcludeCBEntry() call from CBStateHealthy high-miss-rate branch
- Filter now purely classifies include/exclude without writing any state
- EndBlock (PR 2 of 3) will handle all CB state transitions
- Update tests: assert CB state is NOT written by filter (stays Healthy/Excluded)
- Update function comment to document read-only contract
- Add GetAllCBEntries iterator to scan the full CB store
- Add UpdateCBStateForBlock with two-pass logic:
  Pass 1: promote EXCLUDED entries with expired cooldowns to PROBE
  Pass 2: exclude HEALTHY participants crossing the miss-rate threshold
- Wire UpdateCBStateForBlock into EndBlock after inference expiry handling
- Add 5 EndBlock CB tests covering all acceptance criteria

Addresses issue #20
refactor: make createHealthFilterFn read-only (remove state writes from query context)
feat: Add UpdateCBStateForBlock to EndBlock — move CB writes out of query context
- Add cb_miss_threshold_pct (field 26), cb_min_samples (field 27),
  cb_initial_cooldown_blocks (field 28), cb_max_cooldown_blocks (field 29)
  to ValidationParams proto message
- Update params.pb.go: struct fields, getters, Equal, MarshalToSizedBuffer,
  Size, and Unmarshal for the four new varint fields
- Set defaults in DefaultValidationParams() matching Go constant values:
  25%, 4 samples, 50 blocks, 500 blocks
- Add getCBParams() keeper helper that reads from ValidationParams with
  fallback to compile-time constants (zero value → constant default)
- Update ExcludeCBEntry to use getCBParams for initial and max cooldown
- Update UpdateCBStateForBlock to use getCBParams for miss threshold and
  min-samples checks
- Update createHealthFilterFn to use getCBParams for miss-rate filtering

Governance proposals can now adjust CB tuning without a chain upgrade.
Existing tests unaffected: DefaultParams includes the same values as
the previous hardcoded constants.

Addresses issue #21
feat: promote circuit breaker params to ValidationParams proto
When RecordCBResult(success=true) fires in block N, the node transitions from PROBE → HEALTHY. However, UpdateCBStateForBlock running in EndBlock of the same block would see a HEALTHY node with stale high miss-rate stats and immediately re-exclude it, undoing the recovery.

Fix: instead of deleting the CB entry on probe success, set State=CBStateHealthy + LastRestoredBlock=blockHeight + ProbeRestored=true. In UpdateCBStateForBlock Pass 2, skip nodes where ProbeRestored==true && LastRestoredBlock==blockHeight (one-block grace period).

Also fixes pre-existing test failures:
- TestUpdateCBStateForBlock_ExcludesHighMissRate: zero-value LastRestoredBlock==0 collided with blockHeight==0 in test context; fixed by adding ProbeRestored bool guard
- TestHealthFilterExcludesHighMissRate / TestHealthFilterExcludedNodeStillInCooldown: single-node tests triggered the safety fallback; fixed by adding a second healthy node

Addresses issue #25 and #28
@x0152
Copy link
Copy Markdown
Collaborator

x0152 commented Mar 31, 2026

Core blocker from #974 is still present here. This pr is based on a wrong premise

Copy link
Copy Markdown

@Doog-bot534 Doog-bot534 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: feat: intra-epoch circuit breaker with reputation-adjusted executor selection

Approve with suggestions

Substantial, well-architected feature. The reputation-weighted selection (1% floor prevents starvation) and CB state machine (EXCLUDED → PROBE → HEALTHY) are sound. Excellent test coverage including probabilistic traffic distribution tests.

Key strengths

  • Same-block grace period (ProbeRestored + LastRestoredBlock) prevents immediate re-exclusion after probe
  • Exponential backoff cooldowns prevent CB flapping
  • 1% weight floor ensures zero-reputation nodes still get occasional traffic for recovery

Potential issues

  1. JSON for CB state persistence: json.Marshal/Unmarshal is deterministic for simple structs, but fragile compared to protobuf. If field ordering or float serialization diverges across Go versions, this could cause consensus divergence. Recommend migrating to protobuf.

  2. ClearAllCBState at epoch boundary: Referenced in comments but not visible in this diff. Verify it's wired into the epoch transition handler.

  3. selectRandomParticipant fallback: If selectionWeights is missing an address, it falls back to raw group weight, potentially creating inconsistency if buildSelectionWeightsMap silently skips a nil ValidationWeight.

  4. Governance param validation: cb_miss_threshold_pct should be validated to stay in [1, 100].

Payout address: gonka10zaal553duxp05nvfpqtsqrm2g0j6j34r8nan7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nodes with high miss rate continue receiving inference requests for the rest of the epoch

3 participants