[ARO-25464] Genericise the Monitor bucketing to work for MIMO as well#4718
[ARO-25464] Genericise the Monitor bucketing to work for MIMO as well#4718
Conversation
There was a problem hiding this comment.
Pull request overview
This PR replaces the monitor-specific master/worker bucket allocation with a generic “pool worker” leasing + bucket-balancing mechanism, then reuses it for MIMO (scheduler/actuator) so instances only act on buckets they own.
Changes:
- Introduces
PoolWorkerDocument+database.PoolWorkersand a shared bucket balancer loop inpkg/util/buckets. - Migrates
pkg/monitorbucket coordination fromMonitorsDB to the newPoolWorkersDB. - Updates MIMO scheduler/actuator to use bucket coordination instead of hostname-based static partitioning, and adds Cosmos container + trigger deployments for PoolWorkers.
Reviewed changes
Copilot reviewed 32 out of 32 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/util/buckets/balancer.go |
New generic bucket coordination loop + balancing logic for pool workers. |
pkg/util/buckets/cache.go / pkg/util/buckets/buckets.go |
Bucket worker behavior tweaks (e.g., bucket -1 served by all) and bucket change handling. |
pkg/database/poolworkers.go |
New PoolWorkers DB wrapper and Cosmos queries for leasing/workers/buckets. |
pkg/api/poolworker*.go |
New API types for PoolWorker documents and worker types. |
pkg/monitor/* |
Migrates monitor bucket ownership to PoolWorkers and wires into generic loop. |
pkg/mimo/scheduler/* |
Switches scheduler to coordinated buckets; adds bucket selector data and filtering. |
pkg/mimo/actuator/* |
Switches actuator to coordinated buckets; removes hostname partitioning. |
cmd/aro/* |
Wires PoolWorkers DB into monitor / mimo services. |
pkg/deploy/* |
Adds PoolWorkers Cosmos container + renewLease trigger to generator and baked assets. |
test/database/* |
Adds fake PoolWorkers client wiring for tests. |
pkg/util/buckets/balancer_test.go |
Moves/updates balancing tests to validate PoolWorker bucket balancing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 37 out of 37 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
27c9366 to
9db8036
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 38 out of 38 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
99b1b64 to
0b6e7fe
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 38 out of 38 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if err != nil || doc == nil || doc.PoolWorker == nil { | ||
| return nil, err | ||
| } |
There was a problem hiding this comment.
ListBuckets returns (nil, nil) when the master document doesn't exist yet or when doc.PoolWorker is still nil. Callers treat a nil error as success, which can lead to confusing log spam/behavior during bootstrap. Consider returning a non-nil error (e.g., "bucket allocation not initialized") when doc == nil or doc.PoolWorker == nil, so the caller can handle it explicitly.
| if err != nil || doc == nil || doc.PoolWorker == nil { | |
| return nil, err | |
| } | |
| if err != nil { | |
| return nil, err | |
| } | |
| if doc == nil || doc.PoolWorker == nil { | |
| return nil, fmt.Errorf("bucket allocation not initialized") | |
| } |
Which issue this PR addresses:
https://redhat.atlassian.net/browse/ARO-25464
Fixes some of the problems we've noticed in recent incident where MIMO works against clusters it doesn't have to.
What this PR does / why we need it:
Genericises the Monitor logic for master and bucket allocation, moves it to another name, and then reuses that in MIMO.
Test plan for issue:
CI, E2E
Is there any documentation that needs to be updated for this PR?
Possibly wrt the Monitor dbs?
How do you know this will function as expected in production?
E2E, hopefully :)