Skip to content

ARO-25194: Fetch per-VM master sizes from Azure for resize quota validation#4719

Open
bizz001 wants to merge 4 commits intomasterfrom
bizz001/ARO-25194
Open

ARO-25194: Fetch per-VM master sizes from Azure for resize quota validation#4719
bizz001 wants to merge 4 commits intomasterfrom
bizz001/ARO-25194

Conversation

@bizz001
Copy link
Copy Markdown
Collaborator

@bizz001 bizz001 commented Mar 27, 2026

Which issue this PR addresses:

Fixes ARO-25194

What this PR does / why we need it:

The /preresizevalidation endpoint's quota check previously read the master VM size from the cluster document and assumed all three masters were the same size. After a partial resize, masters can have different sizes and the cluster document may not reflect the actual Azure VM size. This PR adds a MasterVMSizes() method to AzureActions that queries ARM (virtualMachines.List) to get the actual size of each master VM, and updates checkResizeComputeQuota to calculate quota deltas per VM instead of multiplying by a fixed node count.

It also adds panic recovery to the validation goroutines. The dynamicRESTMapper in controller-runtime v0.11.2 nil-pointer panics when the API server is unreachable, and since sync.WaitGroup.Go re-panics recovered panics, an unrecovered panic in these goroutines crashes the RP process.

Test plan for issue:

Unit tests: 28 test cases across 5 suites, all passing.

Manual testing against a live dev cluster (eastus, 3x Standard_D8s_v5 masters, OCP 4.16.30). 15/15 scenarios passed:

  • Input validation: empty/missing vmSize, unsupported size (Standard_B2s), non-existent SKU (Standard_Fake_v99) all correctly return 400.
  • Size changes: same size, upsize, downsize, cross-family, cross-version, large upsize (96 cores) all return 200.
  • Mixed master sizes: used the admin resize endpoint to resize master-0 to Standard_E8s_v5 (while master-1 and master-2 remained Standard_D8s_v5). Validated resizing to match the majority (Standard_D8s_v5) and resizing all to a new size (Standard_D16s_v5), both return 200.
  • API server unreachable (all masters deallocated): panic recovered, returns 400 with error details. RP stays alive.

Is there any documentation that needs to be updated for this PR?

N/A, internal admin endpoint.

How do you know this will function as expected in production?

  • The endpoint is read-only (GET), it does not mutate cluster state.
  • Quota calculation is covered by unit tests for same-family, cross-family, mixed sizes, and downsizing scenarios.
  • Panic recovery prevents RP crashes when the API server is unreachable, confirmed by manual test with deallocated masters.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 27, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

context "context"
io "io"
reflect "reflect"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have your pre-commit config in place and enabled?

@tuxerrante
Copy link
Copy Markdown
Collaborator

Please fix title and description :)

@github-actions github-actions bot added the needs-rebase branch needs a rebase label Apr 1, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

Please rebase pull request.

@bizz001 bizz001 changed the title ARO-25194: Fetch per-VM master sizes from Azure for resize quota vali… ARO-25194: Fetch per-VM master sizes from Azure for resize quota validation Apr 1, 2026
return a.virtualMachines.CreateOrUpdateAndWait(ctx, clusterRGName, vmName, vm)
}

func (a *azureActions) MasterVMSizes(ctx context.Context) ([]string, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a helper for the validation functions that gets a list of machines: https://github.com/Azure/ARO-RP/blob/master/pkg/frontend/admin_openshiftcluster_resize_validation_helpers.go#L55

It feels like duplicating code, so there are two paths that we could follow: either removing the get logic from here (which is based on names, whereas in the helper function is based on labels) getting the list from the helper; or create a simpler "GetMasterVMs" and reuse the code here and the helper.

I would lean towards the second, as it seems that GetMasterVMs would be a legit type method that can be used in many places.

@tuxerrante
Copy link
Copy Markdown
Collaborator

tuxerrante commented Apr 2, 2026

  1. Medium: Master detection by VM name substring is brittle

    • MasterVMSizes() currently classifies masters using strings.Contains(*vm.Name, \"-master-\").
    • This can misclassify non-control-plane VMs if naming conventions evolve or include master in non-master names, which can skew quota calculations.
  2. Medium: Partial master inventory can under-calculate quota

    • Pre-validation only rejects len(currentVMSizes) == 0.
    • If ARM returns only 1-2 masters transiently (delete/recreate/incomplete list), quota is computed only for returned VMs and may understate required subscription quota for full control-plane resize.

Missing test coverage:

  • No tests covering MasterVMSizes() classification collisions (e.g., non-master VM names containing -master-).
  • No tests covering partial inventory (len(currentVMSizes)==1/2) to verify pre-validation fails closed instead of under-calculating quota.

wg.Go(safeGo(func() error { return f.validateVMSKU(ctx, doc, subscriptionDoc, desiredVMSize, log) }))
wg.Go(safeGo(func() error { return validateAPIServerHealth(ctx, k) }))
wg.Go(safeGo(func() error { return validateEtcdHealth(ctx, k) }))
wg.Go(safeGo(func() error { return validateClusterSP(ctx, k) }))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #4733 also has some changes conflicting here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants