Skip to content

[P1] Implementing maintenance windows#998

Open
Ryanchen911 wants to merge 29 commits intogonka-ai:upgrade-v0.2.12from
Ryanchen911:rc/maintenance_windows_v2
Open

[P1] Implementing maintenance windows#998
Ryanchen911 wants to merge 29 commits intogonka-ai:upgrade-v0.2.12from
Ryanchen911:rc/maintenance_windows_v2

Conversation

@Ryanchen911
Copy link
Copy Markdown

@Ryanchen911 Ryanchen911 commented Apr 2, 2026

This PR is for #927.

Maintenance Windows Implementation — Status Overview

Proposal Reference

See proposals/maintenance-windows/maintenance-windows.md for the full specification.
Task breakdown: proposals/maintenance-windows/maintenance-windows-todo.md.


Section Status

Section Status Summary
1. Data Model & API ✅ Done Proto messages, state types, params, keeper storage, lifecycle state machine
2. Scheduling & Credit ✅ Done Schedule/cancel handlers, credit accrual, overlap search, schedulability query
3. Consensus Liveness Exemption ✅ Done Adapter + collateral hook guards done; SDK fork patch submitted, pending approval
4. Duty Exemptions ✅ Done Assignment suppression, penalty waiver, validation duty, CPoC exemption, epoch-phase rejection
5. Queries & Observability ✅ Done 6 query endpoints, structured events/logs for all lifecycle actions
6. Upgrade & Genesis ✅ Done v0_2_12 upgrade handler, lazy MaintenanceState init, zero-credit start, genesis verified
7. Testing ✅ Done Unit tests for scheduling/credit/duty done
8. Deferred Items 📋 Documented In-flight inference, subnet interaction, pruning — explicitly deferred

Completed Tasks

  • 1.1 Proto messages (MsgScheduleMaintenance, MsgCancelMaintenance, 6 query types)
  • 1.2 State types (MaintenanceReservation, MaintenanceState, transition schedule, start-height index)
  • 1.3 Governance parameters (7 params with defaults and validation)
  • 1.4 Keeper storage (5 collections with direct-keyed lookups)
  • 1.5 BeginBlock lifecycle state machine (Scheduled → Active → Completed)
  • 2.1 Scheduling validation (10+ checks: lead time, credit, overlap, concurrency, epoch-phase)
  • 2.2 Cancellation with credit restore
  • 2.3 Scheduling-availability preflight query
  • 2.4 Credit accrual in reward claim path (with epoch-usage suppression)
  • 2.5 Epoch maintenance usage marker (LastMaintenanceEpoch)
  • 2.6 Bounded overlap search using start-height index
  • 3.1 Patch Slashing Liveness Path
  • 3.2 Slashing adapter (MaintenanceSlashingAdapter: ConsAddress → AccAddress bridge)
  • 3.3 Collateral hook guards (defense-in-depth for jailing/slashing)
  • 3.4 Activation-time advisory re-check with warning persistence
  • 4.1 Suppress random inference assignment (filterOutMaintenanceParticipants)
  • 4.1a Epoch-critical phase scheduling rejection (PoC + DKG overlap)
  • 4.2 Inference expiry penalty waiver during active maintenance
  • 4.3 Validation duty exemption (skip hasSignificantMissedValidations penalty)
  • 4.4 CPoC duty exemption (weight penalty + ratio skip)
  • 5.1 All 6 query endpoints implemented
  • 5.2 Events and structured logs for schedule/cancel/activate/complete
  • 6.1 Upgrade handler (v0_2_12) with default params initialization
  • 6.2 Genesis export/import verified for maintenance state
  • 7.1 Unit tests for scheduling, credit, lifecycle, concurrency
  • 7.2 Unit tests for assignment filtering, maintenance status checks
  • 7.3 Cosmos SDK Fork Tests
  • 8.1–8.3 Deferred items documented

Bug Fixes Applied (latest commit)

  • Critical: Fixed off-by-one in completion transition height (window was 1 block too long)
  • Critical: Fixed broken bech32 address decoding in slashing adapter
  • High: Fixed Tokens.Int64() panic risk + integer overflow in power calculations
  • Medium: Fixed swallowed errors in cancel, missing param validation, silent credit grant failure, negative scanFrom

Copilot AI review requested due to automatic review settings April 2, 2026 03:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

WIP scaffolding for Issue #927 (maintenance windows for hosts/participants) that introduces the proto/API surface, params, and initial keeper storage/lifecycle plumbing for maintenance reservations in the inference-chain module.

Changes:

  • Added maintenance window proto definitions (messages, queries, params) and regenerated Go/gRPC/gateway bindings.
  • Introduced maintenance state storage (collections prefixes + keeper maps/items) and a BeginBlock-driven transition processor.
  • Added CLI/query wiring, message signer/permissions registration, and a detailed implementation task plan document.

Reviewed changes

Copilot reviewed 27 out of 31 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
proposals/maintenance-windows/maintenance-windows-todo.md Implementation task plan/checklist for the feature.
inference-chain/proto/inference/inference/maintenance.proto New maintenance state protos (reservation/state/transition + enums).
inference-chain/proto/inference/inference/tx.proto Adds ScheduleMaintenance/CancelMaintenance tx messages + service RPCs.
inference-chain/proto/inference/inference/query.proto Adds maintenance query RPCs + request/response types.
inference-chain/proto/inference/inference/params.proto Adds MaintenanceParams to global inference params.
inference-chain/x/inference/types/params.go Default/validation plumbing for MaintenanceParams.
inference-chain/x/inference/types/errors.go New maintenance-related error codes.
inference-chain/x/inference/types/keys.go New store prefixes for maintenance collections.
inference-chain/x/inference/types/logging.go Adds Maintenance subsystem for logging.
inference-chain/x/inference/types/message_signers.go Registers signers for new maintenance messages.
inference-chain/x/inference/keeper/keeper.go Adds maintenance collections to keeper schema.
inference-chain/x/inference/keeper/maintenance.go CRUD helpers + indexes for reservations/state/transitions.
inference-chain/x/inference/keeper/maintenance_lifecycle.go BeginBlock transition processor + activate/complete logic.
inference-chain/x/inference/module/module.go Calls transition processing from BeginBlock.
inference-chain/x/inference/keeper/query_maintenance.go Stub implementations for new maintenance query endpoints.
inference-chain/x/inference/keeper/msg_server_schedule_maintenance.go Stub ScheduleMaintenance msg server endpoint.
inference-chain/x/inference/keeper/msg_server_cancel_maintenance.go Stub CancelMaintenance msg server endpoint.
inference-chain/x/inference/keeper/permissions.go Adds permissions mapping for new maintenance messages.
inference-chain/x/inference/keeper/params.go Adds GetMaintenanceParams helper (default fallback).
inference-chain/x/inference/module/autocli.go Adds AutoCLI query commands for maintenance endpoints.
inference-chain/x/inference/types/query.pb.gw.go Regenerated grpc-gateway handlers for new queries.
inference-chain/api/inference/inference/query_grpc.pb.go Regenerated gRPC query bindings with new maintenance RPCs.
inference-chain/api/inference/inference/tx_grpc.pb.go Regenerated gRPC tx bindings with new maintenance RPCs.
inference-chain/x/inference/types/maintenance.pb.go Generated gogo-proto Go types for maintenance protos.
inference-chain/api/inference/inference/maintenance.pulsar.go Generated Pulsar Go types for maintenance protos.

Keep NextMaintenanceReservationID error handling as-is per project convention.

Ensure transition row is only deleted after successful activate/complete.
Implement scheduling, cancellation, credit logic and overlap validation
Specific Changes:

msg_server_schedule_maintenance.go: Full implementation of MsgScheduleMaintenance, including end-to-end validation for lead time, duration, credit requirements, concurrency limits, and epoch-phase alignment.

msg_server_cancel_maintenance.go: Implemented MsgCancelMaintenance, featuring credit restoration and cleanup of state transitions and indices.

maintenance_validation.go (New File): Centralized scheduling validation logic:

Overlap checks for Epoch PoC/DKG phases.

Concurrency limits and power cap verification.

Participant-specific overlap checks.

Bounded overlap search optimization.

query_maintenance.go: Implemented MaintenanceSchedulability preflight query for checking schedule feasibility.

maintenance.go: Added grantMaintenanceCredit to issue maintenance credits during the reward claim process.

msg_server_claim_rewards.go: Integrated maintenance credit issuance within the finishSettle flow.

maintenance_lifecycle.go: Updated activation logic to record LastMaintenanceEpoch, suppressing credit issuance during the activation epoch.
…ive hook guards

Description:

maintenance_slashing_adapter.go (New): Implemented a ConsAddress to AccAddress adapter to bridge the Slashing keeper's MaintenanceChecker interface with the Inference keeper’s maintenance status queries.

collateral/types/expected_keepers.go: Added the MaintenanceChecker interface definition.

collateral/keeper/keeper.go: Added maintenanceCheckerRef for lazy injection, a SetMaintenanceChecker setter, and the IsParticipantInActiveMaintenance method.

collateral/module/hooks.go: Integrated maintenance checks into AfterValidatorBeginUnbonding and BeforeValidatorSlashed to skip jailing or slashing for participants currently in maintenance (defense in depth).

app/app.go: Injected InferenceKeeper as the MaintenanceChecker for the collateral module.

keeper/maintenance_lifecycle.go: Full implementation of checkActivationTimeConcurrency to verify concurrent count/power caps and persist advisory warnings.
Task 4.1 — Suppress Random Inference Assignment

query_get_random_executor.go: Wrapped the filter function in GetRandomExecutor to additionally exclude maintenance-covered participants via filterOutMaintenanceParticipants
maintenance.go: Added IsParticipantAddressInActiveMaintenance (bech32 string convenience wrapper) and filterOutMaintenanceParticipants (filters []*group.GroupMember to remove maintenance participants)
Task 4.1a — Epoch-Critical Phase Scheduling Rejections: Already complete from Section 2 (checkEpochPhaseOverlap with ErrMaintenanceOverlapsPoCPhase/ErrMaintenanceOverlapsDKGPhase)

Task 4.2 — Waive Maintenance-Covered Inference Penalties

module.go: In handleExpiredInferenceWithContext, added maintenance check before MissedRequests++ — if executor is in active maintenance, the inference is expired with refund but no penalty
Task 4.3 — Suppress Validation Duties: No code changes needed. Validation duties are probabilistic (no "missed validation" penalty exists in the code). Offline maintenance participants simply don't submit validation txs, with no consequences.

Task 4.4 — Suppress CPoC Duties

confirmation_poc.go: In evaluateConfirmation, skip ConfirmationWeight = 0 penalty for maintenance participants who didn't submit batches. In checkConfirmationSlashing, skip ConfirmationPoCRatio computation for maintenance participants so they are not marked INACTIVE.
Task 5.1 — Implement Maintenance Queries (query_maintenance.go)

All 5 stub handlers replaced with full implementations

Task 5.2 — Events and Structured Logs

All required events and logs were already present from Sections 2-4. One additional event added(maintenance_penalty_waived	in module.go)
@tcharchian tcharchian moved this from Todo to In Progress in Upgrade v0.2.12 Apr 2, 2026
@tcharchian tcharchian added this to the v0.2.12 milestone Apr 2, 2026
@tcharchian tcharchian linked an issue Apr 2, 2026 that may be closed by this pull request
Task 6.1 — Add Upgrade Path for Maintenance Feature

Created upgrade v0.2.12:
- v0_2_12/constants.go: UpgradeName = "v0.2.12"
- v0_2_12/upgrades.go: CreateUpgradeHandler + initMaintenanceParams
  (feature disabled by default, zero credit for all participants)
- app/upgrades.go: registered upgrade handler, migration version 13
- module/module.go: bumped ConsensusVersion 13 → 14

Design decisions:
- Feature starts disabled (MaintenanceEnabled = false), governance enables
- No per-participant init needed — MaintenanceState lazily created via
  GetOrCreateMaintenanceState, missing entries default to zero credit
- Collections-based state (reservations, transitions, indexes) starts empty

Task 6.2 — Verify Genesis / Export Behavior

No code changes needed. Verified:
- InitGenesis: SetParams imports MaintenanceParams as part of Params proto
- ExportGenesis: GetParams exports MaintenanceParams as part of Params
- Maintenance collections not in GenesisState (correct) — fresh genesis
  starts with empty state per the proposal
…ions (Section 7)

Task 7.1 — Unit Tests for Maintenance Scheduling and Credit (maintenance_test.go)

26 new tests covering:
- Scheduling success/failure: TestScheduleMaintenance_Success, _Disabled,
  _InsufficientCredit, _InsufficientLeadTime, _DurationExceeded,
  _AlreadyScheduled, _ParticipantNotFound
- Cancellation: TestCancelMaintenance_Success, _NotFound, _NotScheduled,
  _CreditCapRespected
- Credit: TestCreditAccrual_BasicGrant, _CapEnforced
- Lifecycle: TestLifecycle_ActivateAndComplete,
  _NoTransitionsAtWrongHeight
- Scheduling availability: TestSchedulability_Success,
  _InsufficientCredit, _Disabled
- Queries: TestQueryMaintenanceCredit, _NotFound,
  TestQueryMaintenanceStatus, TestQueryMaintenanceScheduled,
  TestQueryMaintenanceActive

Task 7.2 — Unit Tests for Duty Exemptions (same file)

- TestIsParticipantInActiveMaintenance:
  full lifecycle (not in maintenance → scheduled → active → completed)
- TestIsParticipantAddressInActiveMaintenance:
  string address convenience wrapper + invalid address handling
- TestFilterOutMaintenanceParticipants:
  group member filtering for assignment suppression

Supporting files:
- maintenance_export_test.go: exports FilterOutMaintenanceParticipants
  and CreateTestGroupMembers for keeper_test package

Task 7.3 — SDK Fork Tests: deferred, requires SDK fork PR merge + go.mod update
Task 7.4 — Testermint E2E Tests: deferred, requires framework support for pausing participant execution
Section 8 — Deferred / Follow-Up Items

All three tasks are design-open with no implementation work in this version:

8.1 In-Flight Inference Semantics — Deferred
  Proposal explicitly leaves this as an open issue. Current implementation
  already covers the "current direction":
  - New assignments stop immediately (filterOutMaintenanceParticipants)
  - Penalties for disrupted work are waived (expiry penalty skip in
    handleExpiredInferenceWithContext)
  Full in-flight handling requires a separate design decision.

8.2 Subnet Interaction Design — Deferred
  Subnet functionality is still under development. Cannot implement until
  subnet semantics are defined.

8.3 Reservation Pruning — Deferred
  Not required for v1. Storage is already structured to support future pruning (reservations
  indexed by ID, completed/canceled statuses are tracked).
…g adapter, and duty exemptions

Problems found and fixes applied:

1. Off-by-one in completion transition height
   - Problem: endHeight was computed as `startHeight + durationBlocks` instead of
     `startHeight + durationBlocks - 1`, causing the maintenance window to last
     1 block longer than the credit paid for. The validation code used the correct
     inclusive formula `[start, start+duration-1]` but scheduling/cancel did not.
   - Fix: Align endHeight in msg_server_schedule_maintenance.go and
     msg_server_cancel_maintenance.go to use `- 1`, matching all other call sites.

2. Broken address decoding in slashing adapter
   - Problem: `sdk.ValAddress(v.GetOperator())` cast a bech32 string directly to
     raw bytes instead of decoding it via `ValAddressFromBech32`. This produced a
     garbage address that never matched any participant, making the entire slashing
     liveness exemption non-functional.
   - Fix: Use `sdk.ValAddressFromBech32(v.GetOperator())` to properly decode.

3. Potential chain-halting panic in power calculations
   - Problem: `v.Tokens.Int64()` panics if the math.Int value exceeds int64 range.
     Additionally, summing all validator tokens in `getTotalConsensusPower` could
     silently overflow int64, wrapping negative and breaking the power cap check.
   - Fix: Add `safeTokensToInt64()` that checks `IsInt64()` before converting.
     Use saturating addition in `getTotalConsensusPower` capped at math.MaxInt64.

4. Swallowed delete errors in CancelMaintenance
   - Problem: Three `DeleteMaintenanceTransition` / `DeleteMaintenanceStartHeightIndex`
     calls discarded errors with `_ =`. Failed deletes leave orphaned transition
     rows that could trigger phantom activations/completions in BeginBlock.
   - Fix: Propagate errors and return them to the caller.

5. Missing param validation for MaxConcurrentValidators
   - Problem: `MaintenanceParams.Validate()` did not check that
     `MaintenanceMaxConcurrentValidators > 0`. Setting it to 0 via governance
     would silently block all maintenance scheduling forever.
   - Fix: Add the `> 0` check in Validate().

6. Silent credit grant failure
   - Problem: `grantMaintenanceCredit` returned no error. If `SetMaintenanceState`
     failed, credit was silently lost with only a log message.
   - Fix: Change return type to `error` and propagate to caller.

7. Negative scanFrom in bounded overlap searches
   - Problem: `scanFrom = startHeight - maxWindowBlocks` could go negative near
     genesis, causing wasted iteration over negative heights in 4 call sites.
   - Fix: Clamp scanFrom to 0 in checkConcurrencyLimits, checkParticipantOverlap,
     checkActivationTimeConcurrency, and MaintenanceConcurrency query.

8. Missing validation duty exemption — Task 4.3
   - Problem: Participants in active maintenance were still subject to the
     `hasSignificantMissedValidations` penalty check during reward claims,
     causing their rewards to be denied for missing validations they were
     exempt from performing.
   - Fix: Check `LastMaintenanceEpoch` in `hasSignificantMissedValidations`
     and skip the penalty for maintenance-covered epochs.

9. Slashing adapter not wired into app
   - Problem: `MaintenanceSlashingAdapter` was defined but never connected to
     the slashing keeper. Requires SDK fork to add `SetMaintenanceChecker`.
   - Fix: Add TODO with ready-to-uncomment wiring code in app.go for when
     the cosmos-sdk fork (Task 3.1) lands.
Conflicts resolved:
- inference-chain/app/upgrades.go: adopted upstream's v0_2_11 handler
  signature (added app.BlsKeeper param) while keeping our v0_2_12 handler
- inference-chain/x/inference/types/tx.pb.go: regenerated via ignite
  (upstream added MsgSubmitVerificationVector, we added maintenance msgs)
- inference-chain/api/inference/inference/tx.pulsar.go: regenerated via
  ignite (same proto divergence as tx.pb.go)
@Ryanchen911 Ryanchen911 marked this pull request as ready for review April 3, 2026 10:12
@Ryanchen911
Copy link
Copy Markdown
Author

Hi @tcharchian @patimen ,since the task requires changes to the Cosmos SDK slashing module, I have opened a pull request here: gonka-ai/cosmos-sdk#12. Once this PR is approved and merged, I will be able to complete the remaining tasks(3.1, 7.3, 7.4).

…o v0.53.3-ps18

Complete task 3.1: integrate maintenance-aware liveness exemption.

- Update go.mod to gonka-ai/cosmos-sdk v0.53.3-ps18 (includes
  MaintenanceChecker interface in x/slashing)
- Wire MaintenanceSlashingAdapter into SlashingKeeper in app.go
- Fix MaintenanceSlashingAdapter to properly decode bech32 operator address
@tcharchian
Copy link
Copy Markdown
Collaborator

tcharchian commented Apr 7, 2026

Hi @Ryanchen911, gonka-ai/cosmos-sdk#12 and gonka-ai/cosmos-sdk#13 were merged.

Test code is written and ready: MaintenanceWindowTests.kt — 5 E2E tests
Files changed for 7.4:

-testermint/src/test/kotlin/MaintenanceWindowTests.kt (new — 5 tests)
-testermint/src/main/kotlin/data/Maintenance.kt (new — query/response types)
-testermint/src/main/kotlin/data/AppExport.kt (modified — added MaintenanceParams)
-decentralized-api/go.mod (updated cosmos-sdk to v0.53.3-ps19)
@Ryanchen911
Copy link
Copy Markdown
Author

Hi @tcharchian @patimen,
For task 7.4, I've written 5 Testermint E2E tests for maintenance windows (MaintenanceWindowTests.kt), but I can't run them locally — the Docker images are built for linux/amd64 and crash on Apple Silicon with fatal error: lfstack.push.
Could one of you comment /run-integration on this PR to trigger the CI? Thanks!

@tcharchian
Copy link
Copy Markdown
Collaborator

/run-integration

Resolve merge conflicts:
- decentralized-api/go.mod: take upstream grpc v1.80.0 and genproto versions
- inference-chain/app/upgrades/v0_2_12: merge both upstream (removeTopMiner,
  clearTrainingState, adjustParameters, DistrKeeper) and maintenance
  (initMaintenanceParams) upgrade logic
- inference-chain/x/inference/types/errors.go: upstream ErrEpochIndexOutOfRange
  takes code 1174, maintenance errors shifted to 1175-1187
- 6 proto generated files: regenerated via ignite generate proto-go

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Ryanchen911
Copy link
Copy Markdown
Author

hi @tcharchian ,I have resolved these conflicts, maybe you can send again, thanks

@patimen
Copy link
Copy Markdown
Collaborator

patimen commented Apr 9, 2026

/run-integration

Ryanchen911 and others added 2 commits April 10, 2026 16:36
Fixes critical security, consensus, and performance issues identified in
the AI review report for the maintenance windows feature.

Must Fix:
- ScheduleMaintenance now requires msg.Creator == msg.Participant.
  Without this check, anyone could drain another participant's credit
  and force them into maintenance.
- Replace float64 formatting with integer math in
  checkActivationTimeConcurrency. The previous %.1f%% formatting was
  persisted to state via ActivationWarning, risking AppHash mismatches
  across architectures.
- Eliminate O(N) validator-set scans on the slashing hot path and
  concurrency checks. MaintenanceSlashingAdapter now uses
  GetValidatorByConsAddr (O(1)); getParticipantPower uses GetValidator;
  getTotalConsensusPower uses GetLastTotalPower. Extends StakingKeeper
  expected interface accordingly.

Major:
- Switch all power math to cosmossdk_math.Int to eliminate two int64
  overflow paths: large validator tokens silently returning 0, and
  totalPower * bps wrapping negative.
- MaintenanceActive query now iterates a new MaintenanceActiveIndex
  KeySet (prefix 58) instead of scanning every participant's state.
  O(A) where A is bounded by MaintenanceMaxConcurrentValidators.
- query_get_random_executor.go now uses k.LogError/LogInfo with
  types.EpochGroup category instead of k.Logger().

Review Carefully:
- Fix off-by-one in maintenance window completion. The COMPLETE
  transition now fires at startHeight+DurationBlocks (the block AFTER
  the last covered block) so the active duration is exactly
  DurationBlocks. Schedule and Cancel updated together; overlap
  checks (which use inclusive last-covered-block) unchanged. Tests
  updated.
- filterOutMaintenanceParticipants now collects active addresses once
  via the new index, then does O(1) lookup per member instead of one
  collection read per member.
- Move grantMaintenanceCredit from msgServer to Keeper as
  GrantMaintenanceCredit (reusable from BeginBlock/other modules) and
  return the bech32 decode error instead of swallowing it.
- Refactor MaintenanceWindowTests.kt: extract enableMaintenance helper
  and defaultMaintenanceParams, share a lazy DockerClient across the
  test class, replace !! after isNotNull with checkNotNull.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflicts:
- 4 proto-generated files (params.pulsar.go, tx.pulsar.go, params.pb.go,
  tx.pb.go): regenerated via 'ignite generate proto-go' so the merged
  proto definitions (which include both upstream's new fields and our
  MaintenanceParams) are reflected consistently.

All other files merged automatically. Build passes; full test suite
green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Ryanchen911 Ryanchen911 changed the title WIP: Implementing maintenance windows cleaner Implementing maintenance windows Apr 10, 2026
@Ryanchen911
Copy link
Copy Markdown
Author

Hi @tcharchian ,saw this was removed from the v0.2.12 project — is the plan to retarget v0.2.13, or is there a specific blocker you'd like me to address?

@tcharchian
Copy link
Copy Markdown
Collaborator

Sorry, @Ryanchen911, I missed that you had already implemented the changes requested by @patimen. I may have mistakenly assumed that we wouldn’t make it in time with this pull request. I’ve already asked @patimen to take a look at the updates, and if we’re able to review everything in time and no further changes are needed, we will of course include it in the upgrade.

Resolve conflicts:
- x/inference/types/keys.go: upstream added BridgeMintRefunds /
  BridgeWithdrawalRefunds / BridgeWithdrawalTokenRefs at prefixes
  53-55. Shift the maintenance collection prefixes to 56-61 to avoid
  overlap (previous assignment was 53-58, unshipped so safe to move).
- app/upgrades/v0_2_12/upgrades.go: keep both upstream's
  adjustBLSParameters call and our initMaintenanceParams call, in
  that order.
- api/inference/inference/tx.pulsar.go, x/inference/types/tx.pb.go:
  regenerated via 'ignite generate proto-go' so the merged proto
  definitions are reflected consistently.

Build passes; full test suite green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tcharchian tcharchian moved this from Todo to Needs reviewer in Upgrade v0.2.12 Apr 11, 2026
@tcharchian tcharchian changed the title Implementing maintenance windows [P1] Implementing maintenance windows Apr 11, 2026
Ryanchen911 and others added 4 commits April 12, 2026 08:51
Resolve conflicts:
- api/inference/inference/params.pulsar.go, x/inference/types/params.pb.go:
  regenerated via 'ignite generate proto-go' so the merged proto
  definitions (upstream PoC v2 RNG params + our MaintenanceParams) are
  reflected consistently.

Build passes; full test suite green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…intenance tests

Replace early `return` with `assumeTrue()` so skipped tests are reported
as Skipped instead of silently passing with no assertions. Also derive
block height offsets from `defaultMaintenanceParams` to avoid brittle
hardcoded magic numbers.
Resolve conflicts from upstream PR gonka-ai#981 (Transaction fees):
- proto/inference/inference/params.proto: upstream FeeParams takes
  field 15; shift MaintenanceParams from 15 to 16 to avoid proto
  field number collision.
- app/upgrades/v0_2_12/upgrades.go: keep both upstream's setFeeParams
  + migrateFeegrantsForFees and our initMaintenanceParams, in that
  order. Keep upstream's new authzKeeper/feegrantKeeper parameters.
- x/inference/types/params.go: keep both FeeParams and
  MaintenanceParams validation blocks.
- testermint/src/main/kotlin/data/AppExport.kt: keep both feeParams
  and maintenanceParams fields + FeeParamsData class.
- api/inference/inference/params.pulsar.go, x/inference/types/params.pb.go:
  regenerated via 'ignite generate proto-go'.

Build passes; full test suite green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tcharchian tcharchian modified the milestones: v0.2.12, v0.2.13 Apr 15, 2026
Ryanchen911 and others added 3 commits April 15, 2026 11:09
…mprovements

Address issues identified in AI code review report for maintenance windows:

Performance (BeginBlock/EndBlock):
- Remove MaintenanceStartHeightIndex and its unbounded state growth; replace
  overlap/concurrency checks with bounded iteration over MaintenanceStates
  (active + scheduled reservations only)
- Pre-build maintenance address map once per EndBlock instead of O(log N)
  lookups per validator in evaluateConfirmation, checkConfirmationSlashing,
  and handleExpiredInferenceWithContext

Correctness:
- NextMaintenanceReservationID now checks for collections.ErrNotFound
  specifically; other errors propagate instead of silently resetting counter
  to 0 (prevents ID collisions on DB errors)
- Replace raw fmt.Errorf with registered errors (ErrMaintenanceInvalidParticipant,
  ErrMaintenanceZeroDuration) for deterministic ABCI error codes

Code quality:
- Use strings.Join for warning concatenation in checkActivationTimeConcurrency
- Extract maxFutureEpochsToCheck constant (was magic number 5)
- Add error wrapping context in keeper boundary methods

Test improvements (Go):
- Convert ScheduleMaintenance and Schedulability failure tests to table-driven
- Consolidate IsParticipantInActiveMaintenance tests using t.Run subtests
- Add t.Parallel() to all independent tests
- Clarify StakingKeeper mock comments explaining zero-power behavior

Test improvements (Kotlin):
- Extract awaitMinimumCredit() and nodeContainerFor() helpers to reduce
  duplication
- Replace assumeTrue(false, msg) with idiomatic assumeTrue(condition) { msg }
Resolve conflicts between maintenance windows (ours) and Multi Model
POC + Subnet split (upstream). Both feature sets are preserved:
- Maintenance windows: params field 16, key prefixes 64-68, error codes 1177-1191
- Delegation/PoC: params field 17, key prefixes 48-63, error codes 1175-1176
- Proto-generated Go files regenerated via ignite generate proto-go

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@Doog-bot534 Doog-bot534 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [P1] Implementing maintenance windows

Approve with suggestions

Large, well-structured feature. Credit accrual, lifecycle transitions (Scheduled → Active → Completed/Canceled), concurrency caps, epoch-phase overlap protection, and executor exclusion are all correctly implemented.

Key strengths

  • Credit deduction at scheduling with refund on cancellation
  • Activation-time concurrency check is advisory-only (avoids stuck reservations)
  • filterOutMaintenanceParticipants wraps the existing filter chain cleanly
  • Transition at startHeight + durationBlocks ensures full window honored
  • Comprehensive 700-line test file

Potential issues

  1. Proto field number change: DelegationParams moved from field 16 to 17, MaintenanceParams takes 16. This is a wire-breaking change — verify the v0_2_12 upgrade handler migrates this correctly.

  2. No completed reservation cleanup: MaintenanceReservations stores all reservations forever. Consider adding pruning to prevent unbounded state growth.

  3. Hardcoded maxFutureEpochsToCheck = 5: If epoch length changes via governance, this constant may be insufficient for long maintenance windows. Consider making it a governance param.

  4. CancelMaintenance permissions: Creator check in cancel is redundant since Creator == Participant is enforced at scheduling, but harmless.

Suggestions

  • Confirm upgrade handler for field number migration
  • Add reservation pruning mechanism
  • Consider ParticipantPermission instead of AccountPermission for ScheduleMaintenance

Payout address: gonka10zaal553duxp05nvfpqtsqrm2g0j6j34r8nan7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[P1] Maintenance window for hosts

5 participants