Re-work how we select machines for CI... and friends#4619
Re-work how we select machines for CI... and friends#4619tuxerrante wants to merge 11 commits intomasterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tuxerrante The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
ecdae06 to
bbb8c4e
Compare
|
Please rebase pull request. |
|
Current rebased stack has now been force-pushed onto the PR head branch Current statusI reviewed the rebased stack locally, added focused tests, and split the follow-up fixes into separate commits. Fresh local verification run on the current rebased branch:
Commits on the rebased stackExisting main change stack:
New separate follow-up commits:
Possible leftovers / follow-upsThe main remaining behavior gap I still see is that admin master-size validation is now CI-aware, but it is still allowlist-based rather than fully version-aware like Other follow-up candidates or residual gaps:
Risks
Suggested split into spin-off PRsActionable cherry-pick order: Reviewer-friendly grouping: |
There was a problem hiding this comment.
Pull request overview
This PR reworks how VM sizes are selected and validated for CI/E2E cluster creation by removing the RequireD2sWorkers feature flag, centralizing VM-size metadata/constants under pkg/api/util/vms, and threading an isCI signal into static validation so CI can use additional (testing) sizes.
Changes:
- Introduces
pkg/api/util/vmsas the single source of truth for VM size constants, metadata (family/cores/min OCP version), and CI candidate lists with tiered randomization. - Replaces
requireD2sWorkerswith anisCIparameter across static validators and frontend paths, and updates admin endpoints/validation to optionally include CI/testing sizes. - Updates CI/local cluster creation logic and broad test suites to use
vms.VMSizeand the new validation/selection behavior; removes theFeatureRequireD2sWorkersenum and related references; updates docs accordingly.
Reviewed changes
Copilot reviewed 74 out of 74 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/util/clusterdata/worker_profile_test.go | Updates worker-profile enrichment tests to use vms.VMSize and adds provider-spec coercion/security/DES coverage. |
| pkg/util/clusterdata/worker_profile.go | Switches worker profile VMSize enrichment to vms.VMSize. |
| pkg/util/cluster/delete_roleassignments_test.go | Adds regression test ensuring SP role-assignment cleanup is skipped when workload identity is enabled. |
| pkg/util/cluster/cluster_config_test.go | Adds tests for CI candidate VM-size defaults and explicit size candidates; validates WI role-set requirement. |
| pkg/util/cluster/cluster.go | Uses vms.VMSize in config, selects/shuffles candidate sizes via vms helpers, and renames MI/WI role assignment deletion helper. |
| pkg/operator/controllers/machine/machine.go | Adds VM-size validation for machine provider specs using validate.VMSizeIsValid. |
| pkg/frontend/validate_test.go | Extends admin master-size validation tests to include CI behavior. |
| pkg/frontend/validate.go | Switches admin master-size validation to use vms maps and adds CI-aware supported-size selection helper. |
| pkg/frontend/sku_test.go | Updates SKU validation tests to use vms.VMSize. |
| pkg/frontend/shared_test.go | Removes RequireD2sWorkers feature usage and adds IsCI() expectation to env mock. |
| pkg/frontend/quota_validation.go | Updates quota accounting to accept vms.VMSize and uses Family.String() keys. |
| pkg/frontend/openshiftcluster_putorpatch_test.go | Updates test documents/types to use vms.VMSize (including admin API shapes). |
| pkg/frontend/openshiftcluster_putorpatch.go | Updates static validation call signature to pass env.IsCI() instead of feature-flag. |
| pkg/frontend/openshiftcluster_preflightvalidation_test.go | Updates preflight payload tests to use vms.VMSize. |
| pkg/frontend/openshiftcluster_preflightvalidation.go | Updates static validation call signature to pass env.IsCI(). |
| pkg/frontend/admin_supportvmsizes_list.go | Switches supported-size admin listing to vms + CI-aware selection. |
| pkg/frontend/admin_supportedvmsizes_list_test.go | Updates tests for vms maps and adds CI-only inclusion test using mocked env. |
| pkg/frontend/admin_openshiftcluster_vmresize_pre_validation_test.go | Updates resize pre-validation tests to use vms.VMSize. |
| pkg/frontend/admin_openshiftcluster_vmresize_pre_validation.go | Uses vms.VMSize in quota lookups and makes admin validation CI-aware. |
| pkg/frontend/admin_openshiftcluster_vmresize.go | Makes admin master-size validation CI-aware. |
| pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go | Updates resize control plane tests to use vms.VMSize. |
| pkg/frontend/admin_openshiftcluster_resize_controlplane.go | Makes admin master-size validation CI-aware. |
| pkg/env/zz_generated_feature_enumer.go | Regenerates feature enum after removing FeatureRequireD2sWorkers. |
| pkg/env/env.go | Removes FeatureRequireD2sWorkers constant. |
| pkg/env/dev.go | Removes FeatureRequireD2sWorkers from dev feature list. |
| pkg/deploy/devconfig.go | Removes RequireD2sWorkers from dev deployment feature set. |
| pkg/cluster/validate_test.go | Updates tests to use vms.VMSize. |
| pkg/cluster/loadbalancerinternal_test.go | Updates tests to use vms.VMSize. |
| pkg/api/validate/vm.go | Refactors VM-size validation to use vms maps and adds CI-aware size selection + minimum-version enforcement. |
| pkg/api/v20250725/openshiftcluster_validatestatic_test.go | Replaces requireD2sWorkers with isCI in static validation tests; adds CI-only VM-size test cases. |
| pkg/api/v20250725/openshiftcluster_validatestatic.go | Threads isCI through static validation and uses vms.VMSize for checks. |
| pkg/api/v20250725/openshiftcluster_convert.go | Casts external vmSize strings into vms.VMSize for internal representation. |
| pkg/api/v20240812preview/openshiftcluster_validatestatic_test.go | Same isCI refactor + CI-only VM-size tests for this API version. |
| pkg/api/v20240812preview/openshiftcluster_validatestatic.go | Threads isCI through static validation and uses vms.VMSize. |
| pkg/api/v20240812preview/openshiftcluster_convert.go | Casts external vmSize strings into vms.VMSize. |
| pkg/api/v20231122/openshiftcluster_validatestatic_test.go | Same isCI refactor + CI-only VM-size tests for this API version. |
| pkg/api/v20231122/openshiftcluster_validatestatic.go | Threads isCI through static validation and uses vms.VMSize. |
| pkg/api/v20231122/openshiftcluster_convert.go | Casts external vmSize strings into vms.VMSize. |
| pkg/api/v20230904/openshiftcluster_validatestatic_test.go | Same isCI refactor + CI-only VM-size tests for this API version. |
| pkg/api/v20230904/openshiftcluster_validatestatic.go | Threads isCI through static validation and uses vms.VMSize. |
| pkg/api/v20230904/openshiftcluster_convert.go | Casts external vmSize strings into vms.VMSize. |
| pkg/api/v20230701preview/openshiftcluster_validatestatic_test.go | Same isCI refactor for this API version. |
| pkg/api/v20230701preview/openshiftcluster_validatestatic.go | Threads isCI through static validation and uses vms.VMSize. |
| pkg/api/v20230701preview/openshiftcluster_convert.go | Casts external vmSize strings into vms.VMSize. |
| pkg/api/v20230401/openshiftcluster_validatestatic_test.go | Same isCI refactor for this API version. |
| pkg/api/v20230401/openshiftcluster_validatestatic.go | Threads isCI through static validation and uses vms.VMSize. |
| pkg/api/v20230401/openshiftcluster_convert.go | Casts external vmSize strings into vms.VMSize. |
| pkg/api/v20220904/openshiftcluster_validatestatic_test.go | Same isCI refactor for this API version. |
| pkg/api/v20220904/openshiftcluster_validatestatic.go | Threads isCI through static validation and uses vms.VMSize. |
| pkg/api/v20220904/openshiftcluster_convert.go | Casts external vmSize strings into vms.VMSize. |
| pkg/api/v20220401/openshiftcluster_validatestatic_test.go | Same isCI refactor for this API version. |
| pkg/api/v20220401/openshiftcluster_validatestatic.go | Threads isCI through static validation and uses vms.VMSize. |
| pkg/api/v20220401/openshiftcluster_convert.go | Casts external vmSize strings into vms.VMSize. |
| pkg/api/v20210901preview/openshiftcluster_validatestatic_test.go | Same isCI refactor for this API version. |
| pkg/api/v20210901preview/openshiftcluster_validatestatic.go | Threads isCI through static validation and uses vms.VMSize. |
| pkg/api/v20210901preview/openshiftcluster_convert.go | Casts external vmSize strings into vms.VMSize. |
| pkg/api/v20200430/openshiftcluster_validatestatic_test.go | Same isCI refactor for this API version. |
| pkg/api/v20200430/openshiftcluster_validatestatic.go | Threads isCI through static validation and uses vms.VMSize. |
| pkg/api/v20200430/openshiftcluster_convert.go | Casts external vmSize strings into vms.VMSize. |
| pkg/api/v20191231preview/openshiftcluster_validatestatic_test.go | Same isCI refactor for this API version. |
| pkg/api/v20191231preview/openshiftcluster_validatestatic.go | Threads isCI through static validation and uses vms.VMSize. |
| pkg/api/v20191231preview/openshiftcluster_convert.go | Casts external vmSize strings into vms.VMSize. |
| pkg/api/util/vms/types.go | Adds new shared VM-size/role/types + metadata structures (including minimum version handling). |
| pkg/api/util/vms/sizes_test.go | Adds unit tests for size lookup and CI candidate sets. |
| pkg/api/util/vms/sizes.go | Adds centralized supported size maps + CI candidate selection with tiered shuffling. |
| pkg/api/register.go | Updates static validator interface signature to accept isCI. |
| pkg/api/openshiftclusterdocument_example.go | Updates example cluster document to use vms.VMSize. |
| pkg/api/openshiftcluster.go | Migrates core API types from api.VMSize to vms.VMSize and removes duplicated VM-size constants/structs. |
| pkg/api/admin/openshiftcluster_validatestatic_test.go | Updates admin static validator signature to isCI and adds parity tests. |
| pkg/api/admin/openshiftcluster_validatestatic.go | Updates admin static validator signature to accept isCI. |
| pkg/api/admin/openshiftcluster_convert.go | Removes unnecessary VMSize casts by using vms.VMSize directly. |
| pkg/api/admin/openshiftcluster.go | Migrates admin API types from admin.VMSize to vms.VMSize and removes duplicated constants. |
| docs/adding-new-instance-types.md | Updates guidance to reflect removal of FeatureRequireD2sWorkers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tuxerrante
left a comment
There was a problem hiding this comment.
@copilot review
|
Follow-up for the remaining operator machine-controller CI parity gap is now in #4780: This is the leftover called out in the earlier status comment: |
Code ReviewReviewed the current rebased stack at Previously flagged issues — all resolved ✅
Known follow-up (not blocking this PR)
Remaining observations
Overall the PR is in good shape. The only actionable item I'd suggest addressing before merge is adding
The CI failure in that run was actually a missing Added |
|
Please rebase pull request. |
Introduces pkg/api/util/vms/ with canonical VMSize type, size constants, supported size maps (production and testing/CI), and CI candidate selection via shuffleByCoreTier to spread quota pressure across families. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move VM size types and constants from api/openshiftcluster.go and admin/openshiftcluster.go into the new vms package. Simplify validate/vm.go by delegating to vms maps. Update Static() validator interface to replace requireD2sWorkers bool with isCI bool, and update all 11 API version implementations, convert files, and tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… IsCI Remove the FeatureRequireD2sWorkers feature flag and replace all callers with env.IsCI(). Update frontend, operator, cluster tooling, and clusterdata packages to use vms.VMSize types and the new Static() validator signature. Use vms.GetCICandidateMasterVMSizes() and vms.GetCICandidateWorkerVMSizes() with shuffle-by-core-tier for cost-effective quota spreading in CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update adding-new-instance-types.md to point to pkg/api/util/vms/ as the new location for VM size definitions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cover the centralized VM-size helpers, CI-only validator paths, and cluster config/enricher behavior so the rebased stack has stronger regression protection before the follow-up fixes. Made-with: Cursor
Restore the previous fail-closed behavior so malformed or empty cluster versions cannot bypass VM-size validation for otherwise-supported SKUs. Made-with: Cursor
Restore the workload-identity early return so delete flows do not dereference service-principal clients for clusters that no longer use that cleanup path. Made-with: Cursor
Use the testing VM-size tables for admin validation and discovery in CI so resize and preflight endpoints stay aligned with the create/update paths on this rebased stack. Made-with: Cursor
Move the copyright/license header before imports to match repo convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove outdated MAITIU TODO (fields are actively used) and fix "roleassignments" → "role assignments" spacing in log message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…verability Agent-Logs-Url: https://github.com/Azure/ARO-RP/sessions/480f5a36-6a5e-43f9-b25b-b96f4ed2bd6f Co-authored-by: tuxerrante <8364469+tuxerrante@users.noreply.github.com>
3139ec8 to
287f6d6
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 75 out of 75 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I didn't do a real review yet, but i do have one overarching question/suggestion:
What do you think? |
What this PR does / why we need it:
ARO-24603
This PR increases the usable quota for our E2E tests. Given that we now run double the amount of E2E tests (CSP and MI) this is required to avoid toil and the overhead of manually re-running suites.
It does this by:
pkg/api/util/vmsBackground on how we found this:
We (@mrWinston & I) discovered this initially when seeing this error
On investigation we saw that the size was valid but we were setting this RPFeatureFlag and that was failing validation. This meant that we couldn't use the other family of machines (and use our Azure capacity to it's fullest) with the current setup. Rather than add more code and clauses we decided to kill it and trust ourselves to be responsible and not spin up very large machines.
Further improvments:
Lessons learned
This PR surfaced several non-obvious architectural boundaries in the codebase:
Three distinct VMSize types coexist:
api.VMSize(internal),vms.VMSize(utility/admin), and localVMSize(per external API version). They are allstringunder the hood but Go's type system treats them as incompatible — every boundary crossing needs an explicit cast. Missing a single cast produces compile errors that can cascade across 11 API versions.Admin ≠ External: The
pkg/api/admin/package is an internal API with mutual TLS auth, not customer-facing. It correctly usesvms.VMSizedirectly because it doesn't need to match a swagger contract. Applying the "restore local VMSize" pattern to admin would have been wrong — the admin convert file only needed unnecessary cast removal.Conversion files are the bridge: The
_convert.gofiles in each API version are where the type boundary lives.ToExternalcastsvms.VMSize → local VMSize,ToInternalcastslocal VMSize → vms.VMSize. Getting the direction wrong in even one file breaks the build.Validation function signatures evolve:
validate.VMSizeIsValidForVersiongained arequireD2sWorkersparameter on master. During rebase, this caused conflicts that needed careful resolution — taking the wrong side silently produced the wrong function signature.client-generateis destructive: Runningmake client-generatewithout a working Docker autorest image deletes all generated SDK files before failing, breaking unit tests. Recovery:git checkout -- pkg/client/ python/client/.Test plan for issue:
Is there any documentation that needs to be updated for this PR?
Yes and they're done as part of this PR.