ARO-25194: Fetch per-VM master sizes from Azure for resize quota validation#4719
ARO-25194: Fetch per-VM master sizes from Azure for resize quota validation#4719
Conversation
|
Skipping CI for Draft Pull Request. |
| context "context" | ||
| io "io" | ||
| reflect "reflect" | ||
|
|
There was a problem hiding this comment.
do you have your pre-commit config in place and enabled?
|
Please fix title and description :) |
|
Please rebase pull request. |
| return a.virtualMachines.CreateOrUpdateAndWait(ctx, clusterRGName, vmName, vm) | ||
| } | ||
|
|
||
| func (a *azureActions) MasterVMSizes(ctx context.Context) ([]string, error) { |
There was a problem hiding this comment.
There is a helper for the validation functions that gets a list of machines: https://github.com/Azure/ARO-RP/blob/master/pkg/frontend/admin_openshiftcluster_resize_validation_helpers.go#L55
It feels like duplicating code, so there are two paths that we could follow: either removing the get logic from here (which is based on names, whereas in the helper function is based on labels) getting the list from the helper; or create a simpler "GetMasterVMs" and reuse the code here and the helper.
I would lean towards the second, as it seems that GetMasterVMs would be a legit type method that can be used in many places.
Missing test coverage:
|
| wg.Go(safeGo(func() error { return f.validateVMSKU(ctx, doc, subscriptionDoc, desiredVMSize, log) })) | ||
| wg.Go(safeGo(func() error { return validateAPIServerHealth(ctx, k) })) | ||
| wg.Go(safeGo(func() error { return validateEtcdHealth(ctx, k) })) | ||
| wg.Go(safeGo(func() error { return validateClusterSP(ctx, k) })) |
There was a problem hiding this comment.
PR #4733 also has some changes conflicting here.
There was a problem hiding this comment.
There was a problem hiding this comment.
Which issue this PR addresses:
Fixes ARO-25194
What this PR does / why we need it:
The
/preresizevalidationendpoint's quota check previously read the master VM size from the cluster document and assumed all three masters were the same size. After a partial resize, masters can have different sizes and the cluster document may not reflect the actual Azure VM size. This PR adds aMasterVMSizes()method toAzureActionsthat queries ARM (virtualMachines.List) to get the actual size of each master VM, and updatescheckResizeComputeQuotato calculate quota deltas per VM instead of multiplying by a fixed node count.It also adds panic recovery to the validation goroutines. The
dynamicRESTMapperin controller-runtime v0.11.2 nil-pointer panics when the API server is unreachable, and sincesync.WaitGroup.Gore-panics recovered panics, an unrecovered panic in these goroutines crashes the RP process.Test plan for issue:
Unit tests: 28 test cases across 5 suites, all passing.
Manual testing against a live dev cluster (eastus, 3x
Standard_D8s_v5masters, OCP 4.16.30). 15/15 scenarios passed:Standard_B2s), non-existent SKU (Standard_Fake_v99) all correctly return 400.Standard_E8s_v5(while master-1 and master-2 remainedStandard_D8s_v5). Validated resizing to match the majority (Standard_D8s_v5) and resizing all to a new size (Standard_D16s_v5), both return 200.Is there any documentation that needs to be updated for this PR?
N/A, internal admin endpoint.
How do you know this will function as expected in production?