Skip to content

Re-work how we select machines for CI... and friends#4619

Open
tuxerrante wants to merge 11 commits intomasterfrom
tuxerrante/min-vm-sizes-on-top-of-4594
Open

Re-work how we select machines for CI... and friends#4619
tuxerrante wants to merge 11 commits intomasterfrom
tuxerrante/min-vm-sizes-on-top-of-4594

Conversation

@tuxerrante
Copy link
Copy Markdown
Collaborator

@tuxerrante tuxerrante commented Feb 19, 2026

What this PR does / why we need it:

ARO-24603

This PR increases the usable quota for our E2E tests. Given that we now run double the amount of E2E tests (CSP and MI) this is required to avoid toil and the overhead of manually re-running suites.

It does this by:

  • removing the D2sWorker flag and allows the good work done here by @bizz001 to be realized.
  • centralising all the VMsizes in pkg/api/util/vms
  • changing the supported VM sizes in CI with a little bit of randomness to ensure even spread

Background on how we found this:

We (@mrWinston & I) discovered this initially when seeing this error

Code="InvalidParameter" Message="The provided worker VM size 'Standard_D4s_v5' is invalid." Target="properties.workerProfiles['worker'].vmSize"

On investigation we saw that the size was valid but we were setting this RPFeatureFlag and that was failing validation. This meant that we couldn't use the other family of machines (and use our Azure capacity to it's fullest) with the current setup. Rather than add more code and clauses we decided to kill it and trust ourselves to be responsible and not spin up very large machines.

Further improvments:

Lessons learned

This PR surfaced several non-obvious architectural boundaries in the codebase:

  1. Three distinct VMSize types coexist: api.VMSize (internal), vms.VMSize (utility/admin), and local VMSize (per external API version). They are all string under the hood but Go's type system treats them as incompatible — every boundary crossing needs an explicit cast. Missing a single cast produces compile errors that can cascade across 11 API versions.

  2. Admin ≠ External: The pkg/api/admin/ package is an internal API with mutual TLS auth, not customer-facing. It correctly uses vms.VMSize directly because it doesn't need to match a swagger contract. Applying the "restore local VMSize" pattern to admin would have been wrong — the admin convert file only needed unnecessary cast removal.

  3. Conversion files are the bridge: The _convert.go files in each API version are where the type boundary lives. ToExternal casts vms.VMSize → local VMSize, ToInternal casts local VMSize → vms.VMSize. Getting the direction wrong in even one file breaks the build.

  4. Validation function signatures evolve: validate.VMSizeIsValidForVersion gained a requireD2sWorkers parameter on master. During rebase, this caused conflicts that needed careful resolution — taking the wrong side silently produced the wrong function signature.

  5. client-generate is destructive: Running make client-generate without a working Docker autorest image deletes all generated SDK files before failing, breaking unit tests. Recovery: git checkout -- pkg/client/ python/client/.

Test plan for issue:

  • Current tests

Is there any documentation that needs to be updated for this PR?

Yes and they're done as part of this PR.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Feb 19, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tuxerrante

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mociarain mociarain changed the title Restore local VMSize string types in external API packages Re-work how we select machines for CI... and friends Feb 19, 2026
@tuxerrante tuxerrante added the size-large Size large label Feb 19, 2026
@mociarain mociarain marked this pull request as draft February 19, 2026 16:05
@mociarain mociarain force-pushed the tuxerrante/min-vm-sizes-on-top-of-4594 branch 3 times, most recently from ecdae06 to bbb8c4e Compare March 4, 2026 10:34
@tuxerrante
Copy link
Copy Markdown
Collaborator Author

Is this detected as an error?
I'd like to add OS_CLUSTER_VERSION to our env.example file if that helps (cc @alcasim ).
Then something else would be needed to mark it also as default..

@github-actions github-actions Bot added the needs-rebase branch needs a rebase label Mar 16, 2026
@github-actions
Copy link
Copy Markdown

Please rebase pull request.

@tuxerrante
Copy link
Copy Markdown
Collaborator Author

tuxerrante commented Apr 16, 2026

Current rebased stack has now been force-pushed onto the PR head branch tuxerrante/min-vm-sizes-on-top-of-4594 at d1531881b.

Current status

I reviewed the rebased stack locally, added focused tests, and split the follow-up fixes into separate commits.

Fresh local verification run on the current rebased branch:

  • go test ./pkg/util/cluster ./pkg/util/clusterdata
  • go test ./pkg/frontend -run 'TestAdminVMResize|TestAdminResizeControlPlane|TestPreResizeControlPlaneVMsValidation|TestSupportedvmsizes|TestValidateAdminMasterVMSize' -count=1
  • cd pkg/api && go test ./validate ./util/vms ./admin ./v20230904 ./v20231122 ./v20240812preview ./v20250725

Commits on the rebased stack

Existing main change stack:

  • 57921a737 [ARO-24603] Add centralized vms package for VM size management
  • 32a7e0161 [ARO-24603] Refactor API module to use centralized vms package
  • bac4a20d5 [ARO-24603] Remove FeatureRequireD2sWorkers and update callers to use IsCI
  • e22517b3a [ARO-24603] Update docs to reference new vms package location

New separate follow-up commits:

  • f72d8ee37 [ARO-24603] Add focused VM size regression tests
  • adbb9ca8f [ARO-24603] Reject malformed versions in VM size validation
  • 81b194c56 [ARO-24603] Skip SP cleanup for workload identity deletes
  • d1531881b [ARO-24603] Make admin VM size checks CI-aware

Possible leftovers / follow-ups

The main remaining behavior gap I still see is that admin master-size validation is now CI-aware, but it is still allowlist-based rather than fully version-aware like VMSizeIsValidForVersion(). So admin resize/preflight flows are still not perfectly aligned with create/update validation for version-gated SKUs.
--> #4780

Other follow-up candidates or residual gaps:

  • the operator machine controller still validates machine VM sizes with isCI=false, so CI-only worker sizes can still be reported as invalid there
  • package-level coverage for several touched packages is still below 80%; I added focused regression tests, but did not add broad filler coverage just to move the metric

Risks

  • adbb9ca8f intentionally restores fail-closed behavior for malformed/empty cluster versions; if any persisted cluster documents still carry a bad properties.clusterProfile.version, later validation paths will now reject them until repaired
  • the admin CI parity fix changes discovery + validation behavior in CI/dev
  • the PR is now on the correct rebased tip, but the long-running CI matrix is still in flight, so this is not merge-ready until those checks settle

Suggested split into spin-off PRs

Actionable cherry-pick order:

  1. 57921a737
  2. 32a7e0161
  3. bac4a20d5
  4. e22517b3a
  5. f72d8ee37
  6. adbb9ca8f
  7. 81b194c56
  8. d1531881b

Reviewer-friendly grouping:

Copilot AI review requested due to automatic review settings April 17, 2026 07:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reworks how VM sizes are selected and validated for CI/E2E cluster creation by removing the RequireD2sWorkers feature flag, centralizing VM-size metadata/constants under pkg/api/util/vms, and threading an isCI signal into static validation so CI can use additional (testing) sizes.

Changes:

  • Introduces pkg/api/util/vms as the single source of truth for VM size constants, metadata (family/cores/min OCP version), and CI candidate lists with tiered randomization.
  • Replaces requireD2sWorkers with an isCI parameter across static validators and frontend paths, and updates admin endpoints/validation to optionally include CI/testing sizes.
  • Updates CI/local cluster creation logic and broad test suites to use vms.VMSize and the new validation/selection behavior; removes the FeatureRequireD2sWorkers enum and related references; updates docs accordingly.

Reviewed changes

Copilot reviewed 74 out of 74 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/util/clusterdata/worker_profile_test.go Updates worker-profile enrichment tests to use vms.VMSize and adds provider-spec coercion/security/DES coverage.
pkg/util/clusterdata/worker_profile.go Switches worker profile VMSize enrichment to vms.VMSize.
pkg/util/cluster/delete_roleassignments_test.go Adds regression test ensuring SP role-assignment cleanup is skipped when workload identity is enabled.
pkg/util/cluster/cluster_config_test.go Adds tests for CI candidate VM-size defaults and explicit size candidates; validates WI role-set requirement.
pkg/util/cluster/cluster.go Uses vms.VMSize in config, selects/shuffles candidate sizes via vms helpers, and renames MI/WI role assignment deletion helper.
pkg/operator/controllers/machine/machine.go Adds VM-size validation for machine provider specs using validate.VMSizeIsValid.
pkg/frontend/validate_test.go Extends admin master-size validation tests to include CI behavior.
pkg/frontend/validate.go Switches admin master-size validation to use vms maps and adds CI-aware supported-size selection helper.
pkg/frontend/sku_test.go Updates SKU validation tests to use vms.VMSize.
pkg/frontend/shared_test.go Removes RequireD2sWorkers feature usage and adds IsCI() expectation to env mock.
pkg/frontend/quota_validation.go Updates quota accounting to accept vms.VMSize and uses Family.String() keys.
pkg/frontend/openshiftcluster_putorpatch_test.go Updates test documents/types to use vms.VMSize (including admin API shapes).
pkg/frontend/openshiftcluster_putorpatch.go Updates static validation call signature to pass env.IsCI() instead of feature-flag.
pkg/frontend/openshiftcluster_preflightvalidation_test.go Updates preflight payload tests to use vms.VMSize.
pkg/frontend/openshiftcluster_preflightvalidation.go Updates static validation call signature to pass env.IsCI().
pkg/frontend/admin_supportvmsizes_list.go Switches supported-size admin listing to vms + CI-aware selection.
pkg/frontend/admin_supportedvmsizes_list_test.go Updates tests for vms maps and adds CI-only inclusion test using mocked env.
pkg/frontend/admin_openshiftcluster_vmresize_pre_validation_test.go Updates resize pre-validation tests to use vms.VMSize.
pkg/frontend/admin_openshiftcluster_vmresize_pre_validation.go Uses vms.VMSize in quota lookups and makes admin validation CI-aware.
pkg/frontend/admin_openshiftcluster_vmresize.go Makes admin master-size validation CI-aware.
pkg/frontend/admin_openshiftcluster_resize_controlplane_test.go Updates resize control plane tests to use vms.VMSize.
pkg/frontend/admin_openshiftcluster_resize_controlplane.go Makes admin master-size validation CI-aware.
pkg/env/zz_generated_feature_enumer.go Regenerates feature enum after removing FeatureRequireD2sWorkers.
pkg/env/env.go Removes FeatureRequireD2sWorkers constant.
pkg/env/dev.go Removes FeatureRequireD2sWorkers from dev feature list.
pkg/deploy/devconfig.go Removes RequireD2sWorkers from dev deployment feature set.
pkg/cluster/validate_test.go Updates tests to use vms.VMSize.
pkg/cluster/loadbalancerinternal_test.go Updates tests to use vms.VMSize.
pkg/api/validate/vm.go Refactors VM-size validation to use vms maps and adds CI-aware size selection + minimum-version enforcement.
pkg/api/v20250725/openshiftcluster_validatestatic_test.go Replaces requireD2sWorkers with isCI in static validation tests; adds CI-only VM-size test cases.
pkg/api/v20250725/openshiftcluster_validatestatic.go Threads isCI through static validation and uses vms.VMSize for checks.
pkg/api/v20250725/openshiftcluster_convert.go Casts external vmSize strings into vms.VMSize for internal representation.
pkg/api/v20240812preview/openshiftcluster_validatestatic_test.go Same isCI refactor + CI-only VM-size tests for this API version.
pkg/api/v20240812preview/openshiftcluster_validatestatic.go Threads isCI through static validation and uses vms.VMSize.
pkg/api/v20240812preview/openshiftcluster_convert.go Casts external vmSize strings into vms.VMSize.
pkg/api/v20231122/openshiftcluster_validatestatic_test.go Same isCI refactor + CI-only VM-size tests for this API version.
pkg/api/v20231122/openshiftcluster_validatestatic.go Threads isCI through static validation and uses vms.VMSize.
pkg/api/v20231122/openshiftcluster_convert.go Casts external vmSize strings into vms.VMSize.
pkg/api/v20230904/openshiftcluster_validatestatic_test.go Same isCI refactor + CI-only VM-size tests for this API version.
pkg/api/v20230904/openshiftcluster_validatestatic.go Threads isCI through static validation and uses vms.VMSize.
pkg/api/v20230904/openshiftcluster_convert.go Casts external vmSize strings into vms.VMSize.
pkg/api/v20230701preview/openshiftcluster_validatestatic_test.go Same isCI refactor for this API version.
pkg/api/v20230701preview/openshiftcluster_validatestatic.go Threads isCI through static validation and uses vms.VMSize.
pkg/api/v20230701preview/openshiftcluster_convert.go Casts external vmSize strings into vms.VMSize.
pkg/api/v20230401/openshiftcluster_validatestatic_test.go Same isCI refactor for this API version.
pkg/api/v20230401/openshiftcluster_validatestatic.go Threads isCI through static validation and uses vms.VMSize.
pkg/api/v20230401/openshiftcluster_convert.go Casts external vmSize strings into vms.VMSize.
pkg/api/v20220904/openshiftcluster_validatestatic_test.go Same isCI refactor for this API version.
pkg/api/v20220904/openshiftcluster_validatestatic.go Threads isCI through static validation and uses vms.VMSize.
pkg/api/v20220904/openshiftcluster_convert.go Casts external vmSize strings into vms.VMSize.
pkg/api/v20220401/openshiftcluster_validatestatic_test.go Same isCI refactor for this API version.
pkg/api/v20220401/openshiftcluster_validatestatic.go Threads isCI through static validation and uses vms.VMSize.
pkg/api/v20220401/openshiftcluster_convert.go Casts external vmSize strings into vms.VMSize.
pkg/api/v20210901preview/openshiftcluster_validatestatic_test.go Same isCI refactor for this API version.
pkg/api/v20210901preview/openshiftcluster_validatestatic.go Threads isCI through static validation and uses vms.VMSize.
pkg/api/v20210901preview/openshiftcluster_convert.go Casts external vmSize strings into vms.VMSize.
pkg/api/v20200430/openshiftcluster_validatestatic_test.go Same isCI refactor for this API version.
pkg/api/v20200430/openshiftcluster_validatestatic.go Threads isCI through static validation and uses vms.VMSize.
pkg/api/v20200430/openshiftcluster_convert.go Casts external vmSize strings into vms.VMSize.
pkg/api/v20191231preview/openshiftcluster_validatestatic_test.go Same isCI refactor for this API version.
pkg/api/v20191231preview/openshiftcluster_validatestatic.go Threads isCI through static validation and uses vms.VMSize.
pkg/api/v20191231preview/openshiftcluster_convert.go Casts external vmSize strings into vms.VMSize.
pkg/api/util/vms/types.go Adds new shared VM-size/role/types + metadata structures (including minimum version handling).
pkg/api/util/vms/sizes_test.go Adds unit tests for size lookup and CI candidate sets.
pkg/api/util/vms/sizes.go Adds centralized supported size maps + CI candidate selection with tiered shuffling.
pkg/api/register.go Updates static validator interface signature to accept isCI.
pkg/api/openshiftclusterdocument_example.go Updates example cluster document to use vms.VMSize.
pkg/api/openshiftcluster.go Migrates core API types from api.VMSize to vms.VMSize and removes duplicated VM-size constants/structs.
pkg/api/admin/openshiftcluster_validatestatic_test.go Updates admin static validator signature to isCI and adds parity tests.
pkg/api/admin/openshiftcluster_validatestatic.go Updates admin static validator signature to accept isCI.
pkg/api/admin/openshiftcluster_convert.go Removes unnecessary VMSize casts by using vms.VMSize directly.
pkg/api/admin/openshiftcluster.go Migrates admin API types from admin.VMSize to vms.VMSize and removes duplicated constants.
docs/adding-new-instance-types.md Updates guidance to reflect removal of FeatureRequireD2sWorkers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/operator/controllers/machine/machine.go
Comment thread pkg/api/util/vms/types.go Outdated
Comment thread pkg/util/cluster/cluster.go Outdated
Comment thread pkg/util/cluster/cluster.go Outdated
Copy link
Copy Markdown
Collaborator Author

@tuxerrante tuxerrante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot review

@tuxerrante
Copy link
Copy Markdown
Collaborator Author

Follow-up for the remaining operator machine-controller CI parity gap is now in #4780:

This is the leftover called out in the earlier status comment:

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 17, 2026

@copilot review

Code Review

Reviewed the current rebased stack at a3d6f55. All tests in pkg/api and pkg/util/cluster pass locally.

Previously flagged issues — all resolved ✅

  • Copyright header in vms/types.go — fixed in 44bc0e1
  • Stale personal TODO in cluster.go — removed in a3d6f55
  • Log message spacing ("roleassignments" → "role assignments") — fixed in a3d6f55

Known follow-up (not blocking this PR)

  • machine.go:49 isCI=false hard-code: Reconciler already has isLocalDevelopmentMode, so the one-line fix is r.isLocalDevelopmentMode instead of false. Tracked in the "Possible leftovers" section; agreed this should be a separate PR.

Remaining observations

  1. OS_CLUSTER_VERSION missing from env.example: The field is mapped in ClusterConfig and present in docker-compose.yml but absent from env.example. Worth adding as a commented-out option with a note about its purpose, so local developers know it exists. The CI failure referenced in #3997476895 was a separate IsCI() mock issue (already fixed), not related to this variable.

  2. deleteRoleAssignments skip log message (cluster.go:1253): The message says "Skipping deletion of service principal role assignments" — clear and correct. The companion deleteWimiRoleAssignments log says "Skipping deletion of wimi role assignments" — consistent with function naming.

  3. adbb9ca fail-closed version parsing: Malformed/empty cluster versions in VMSizeIsValidForVersion now return false (PR description notes this as an intentional risk). Looks correct.

Overall the PR is in good shape. The only actionable item I'd suggest addressing before merge is adding OS_CLUSTER_VERSION to env.example as a commented-out option for developer discoverability.

Is this detected as an error?
I'd like to add OS_CLUSTER_VERSION to our env.example file if that helps (cc @alc...

The CI failure in that run was actually a missing IsCI() mock expectation in the preflight validation test — not related to OS_CLUSTER_VERSION. That has since been fixed.

Added OS_CLUSTER_VERSION to env.example as a commented-out entry in 3139ec8 so developers can discover and override the default cluster version without having to dig through the source. The field already defaults to version.DefaultInstallStream.Version.String() when unset, so no behavior change is needed.

Copilot AI review requested due to automatic review settings April 17, 2026 07:58
@tuxerrante tuxerrante review requested due to automatic review settings April 17, 2026 07:58
@github-actions github-actions Bot added the needs-rebase branch needs a rebase label Apr 20, 2026
@github-actions
Copy link
Copy Markdown

Please rebase pull request.

tuxerrante and others added 11 commits April 21, 2026 10:11
Introduces pkg/api/util/vms/ with canonical VMSize type, size constants,
supported size maps (production and testing/CI), and CI candidate
selection via shuffleByCoreTier to spread quota pressure across families.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move VM size types and constants from api/openshiftcluster.go and
admin/openshiftcluster.go into the new vms package. Simplify
validate/vm.go by delegating to vms maps. Update Static() validator
interface to replace requireD2sWorkers bool with isCI bool, and update
all 11 API version implementations, convert files, and tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… IsCI

Remove the FeatureRequireD2sWorkers feature flag and replace all
callers with env.IsCI(). Update frontend, operator, cluster tooling,
and clusterdata packages to use vms.VMSize types and the new Static()
validator signature. Use vms.GetCICandidateMasterVMSizes() and
vms.GetCICandidateWorkerVMSizes() with shuffle-by-core-tier for
cost-effective quota spreading in CI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update adding-new-instance-types.md to point to pkg/api/util/vms/
as the new location for VM size definitions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cover the centralized VM-size helpers, CI-only validator paths, and cluster config/enricher behavior so the rebased stack has stronger regression protection before the follow-up fixes.

Made-with: Cursor
Restore the previous fail-closed behavior so malformed or empty cluster versions cannot bypass VM-size validation for otherwise-supported SKUs.

Made-with: Cursor
Restore the workload-identity early return so delete flows do not dereference service-principal clients for clusters that no longer use that cleanup path.

Made-with: Cursor
Use the testing VM-size tables for admin validation and discovery in CI so resize and preflight endpoints stay aligned with the create/update paths on this rebased stack.

Made-with: Cursor
Move the copyright/license header before imports to match repo convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove outdated MAITIU TODO (fields are actively used) and fix
"roleassignments" → "role assignments" spacing in log message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…verability

Agent-Logs-Url: https://github.com/Azure/ARO-RP/sessions/480f5a36-6a5e-43f9-b25b-b96f4ed2bd6f

Co-authored-by: tuxerrante <8364469+tuxerrante@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 21, 2026 08:14
@tuxerrante tuxerrante force-pushed the tuxerrante/min-vm-sizes-on-top-of-4594 branch from 3139ec8 to 287f6d6 Compare April 21, 2026 08:14
@github-actions github-actions Bot removed the needs-rebase branch needs a rebase label Apr 21, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 75 out of 75 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mrWinston
Copy link
Copy Markdown
Collaborator

I didn't do a real review yet, but i do have one overarching question/suggestion:
What do you think about renaming and moving the whole package from ‎pkg/api/util/vms to ‎pkg/util/sku?

  1. It doesn't particularly feel like the SKU parsing and types are part of our API, and the fact that it's now in a utils folder, tells me that you thought the same, since utils is where packages go to die.
  2. I feel like the name vms is too vague for the package.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants