[ARO-25537] Additional E2E Tests for new Control Plane Resize by mrWinston · Pull Request #4787 · Azure/ARO-RP

mrWinston · 2026-04-22T14:53:59Z

Which issue this PR addresses:

What this PR does / why we need it:

Create 3 new e2e test cases for the controlplane resize admin action:
- Skip resizing if machines already have the correct size
- Don't attempt resize with insufficient quota
- Perform the actual resize (this is the happy path)
Add new test tag slow
make CI skip tests tagged as slow
Add make target make e2e that directly runs e2e via go test without creating an intermediate binary and that supports passing ginkgo's -focus parameter to select invidiual tests to be run based on a regex

Test plan for issue:

Tested manually with local RP
If you have a local RP with a cluster running, use this command to only run the new e2e tests with the new make target:

make E2E_FOKUS="Resize control plane" e2e

How do you know this will function as expected in production?

PR Needs to be tested in canary as well to make sure it doesn't block the release deployment
These tests will only run in our Ring 1 deployments, not Ring 2, nor in CI.

Copilot

Pull request overview

Adds additional E2E coverage and tooling around the new control-plane resize admin action, including a new slow test label and CI filtering to keep long-running tests out of standard CI runs.

Changes:

Add new E2E test cases for /resizecontrolplane (no-op when same size, quota failure path, and a slow happy-path resize).
Introduce a new Ginkgo label slow and update CI E2E label filtering to exclude it.
Add a make e2e target to run E2Es directly via go test with focus/label filtering support.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
test/e2e/update.go	Switch E2E controller-runtime client usage to `clients.KubeClient`.
test/e2e/setup.go	Add `slow` label constant; extend `clientSet` with `Usages` client and rename `Client` -> `KubeClient`.
test/e2e/adminapi_resize_controlplane.go	Add new resize-control-plane E2E cases and helper functions for VM/label validation.
Makefile	Add `e2e` target and focus variable; adjust license validation ignore list.
.pipelines/ci.yml	Update CI E2E label filter to skip `slow` tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

swiencki

dupe comment

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tuxerrante · 2026-04-29T08:57:05Z

+
+		By("Validating machine and node labels")
+		validateMasterVMSizeLabels(ctx, targetSku)
+	}, NodeTimeout(30*time.Minute))


A few thoughts on optimizing this test and the suite:

Parallelization — The no-op and quota tests are non-destructive and could run in parallel with most other E2E specs. Only this actual-resize test needs Serial (correctly applied). If the suite grows, consider wrapping the non-destructive cases in their own Describe without Serial so Ginkgo can schedule them concurrently.

Cluster cleanup is not critical — since testing clusters are deleted automatically, the lack of DeferCleanup to resize back is acceptable. If the team does want to save subscription-wide resources, #4619 (allow smaller VM sizes for test clusters) would help more than restoring size here.

CI coverage — slow tests are excluded from all CI stages including IndividualCI/BatchedCI (unlike regressiontest which is re-included there). The PR description says they run in Ring 1 — worth adding a comment in ci.yml near the override clarifying where slow tests actually run, so future readers don't assume they're orphaned.

tuxerrante · 2026-04-29T08:57:21Z

+		Expect(resp.StatusCode).To(Equal(http.StatusBadRequest))
+		Expect(out.Message).To(Equal("Pre-flight validation failed."))
+		Expect(out.Details).To(HaveLen(1))
+		Expect(out.Details[0].Code).To(Equal("ResourceQuotaExceeded"))


For future consideration: approaches to mock/reproduce resize failures in E2E, sorted by ease of integration in the current stack:

Quota-based failures (already done here) — find a SKU with zero quota. Easiest, already working.

Invalid/unsupported SKU (already done above) — request an invalid VM size. Trivial.

Azure Policy deny rules — create a temporary Azure Policy that denies Microsoft.Compute/virtualMachines/write for a specific SKU or resource group. Can be set up/torn down in test setup. No code changes needed, just ARM calls.

RBAC restriction — temporarily remove the RP service principal's Contributor role on the cluster resource group. Simulates permission failures. Easy to script but risks side effects on other tests if not restored.

Mock at the compute client level in unit tests — use the existing pkg/util/mocks patterns to inject errors in VirtualMachines.Update or VirtualMachines.Deallocate. Not E2E, but gives fine-grained control over which step fails (pre-flight, resize, start, uncordon).

Azure Chaos Studio — inject faults at the VM level (stop/start failures, delayed responses). Most realistic but requires Chaos Studio setup on the subscription and experiment definitions.

Options 3-4 are probably the sweet spot for new E2E failure paths — they test real ARM error handling without needing infrastructure beyond what the test subscription already has.

Re 4.:
I don't believe this is a realistic scenario since the RP's permissions in the CUs subscription are managed by azure itself via the First Party Service Principal. This isn't something the customer can modify.

However, Re 3:
This is actually a known failure condition for resizes. Customers can have custom subscription level policies restricting VMs to only use an approved list of SKUs. I don't believe this is something we can catch during the preflight checks unfortunately, but it should be worthwhile to have a good error message for SREs in case we encounter this issue. Since we'll need to reach out to the CU via azcomm, the error message needs to include everything the cu needs to adjust their policies, like the policy id, assignment id, assignment scope, target vm sku.

tuxerrante · 2026-04-29T08:58:20Z

E2E coverage gaps relative to other open resize PRs

This PR tests the core resize flow from #4733 (merged). Three open PRs add features that will need E2E coverage once they land:

#4786 — Response messages (verbose parameter)

The current tests only assert on resp.StatusCode. Once #4786 merges, the success response includes structured JSON (status, summary.totalNodes, summary.nodesResized, summary.nodesSkipped, executionOrder). Suggested additions:

Parse the success response body in the happy-path test and validate nodesResized vs nodesSkipped counts
Add a variant with verbose=true and assert that phases/steps are present in the response
The failure tests already validate CloudError fields, which aligns with Bizz001/resize controlplane response messages #4786's approach (failures stay fully detailed)

#4719 — Per-VM quota validation (mixed master sizes)

The quota test here assumes all 3 masters are the same size. #4719 changes quota calculation to query actual per-VM sizes from ARM to handle partial-resize scenarios. Once it merges:

A mixed-size scenario test would be the highest-value addition: partially resize one master to a different size, then call pre-resize validation and verify quota is computed against individual VM deltas, not a flat 3 * delta
The panic-recovery for unreachable API server (ARO-25194: Fetch per-VM master sizes from Azure for resize quota validation #4719 adds this) could be tested by checking that validation returns a 400 error (not a crash) when the API server is degraded — though this is hard to trigger deterministically in E2E

#4707 — Capacity Reservation (useCapacityReservation parameter)

No current coverage for CRG-backed resize. Once #4707 merges, suggested cases:

Happy path: useCapacityReservation=true&zone=<valid> — resize succeeds and CRG is cleaned up afterward
Invalid parameter: useCapacityReservation=invalid → 400
Zone without CRG flag: zone=1 without useCapacityReservation=true → 400
Zone mismatch: zone=<wrong> with useCapacityReservation=true → 400

These could be added incrementally as each PR merges rather than blocking this one.

Copilot AI review requested due to automatic review settings April 22, 2026 14:54

mrWinston requested review from alcasim, bennerv, cadenmarchese, hawkowl, hlipsig, jharrington22, kevinobriendotca, kimorris27, mociarain, rogbas, sankur-codes, tiguelu, tsatam, tuxerrante, ventifus, wanghaoran1988 and yjst2012 as code owners April 22, 2026 14:54

Copilot started reviewing on behalf of mrWinston April 22, 2026 14:57 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

swiencki reviewed Apr 22, 2026

View reviewed changes

Comment thread test/e2e/adminapi_resize_controlplane.go Outdated

Copilot AI review requested due to automatic review settings April 24, 2026 09:04

Copilot started reviewing on behalf of mrWinston April 24, 2026 09:05 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Comment thread test/e2e/adminapi_resize_controlplane.go Outdated

Comment thread test/e2e/adminapi_resize_controlplane.go Outdated

mrWinston force-pushed the ARO-25537-e2e-test-for-new-cp-resize branch from d259c3d to 323579b Compare April 28, 2026 11:38

Copilot AI review requested due to automatic review settings April 28, 2026 11:53

mrWinston force-pushed the ARO-25537-e2e-test-for-new-cp-resize branch from 323579b to 3f9fac5 Compare April 28, 2026 11:53

Copilot started reviewing on behalf of mrWinston April 28, 2026 11:53 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

Comment thread test/e2e/adminapi_resize_controlplane.go Outdated

Comment thread test/e2e/setup.go Outdated

Comment thread .pipelines/ci.yml

mrWinston force-pushed the ARO-25537-e2e-test-for-new-cp-resize branch from 3f9fac5 to c8aef41 Compare April 28, 2026 12:00

additional e2e tests for resize control plane

18cf2f7

Copilot AI review requested due to automatic review settings April 28, 2026 12:09

mrWinston force-pushed the ARO-25537-e2e-test-for-new-cp-resize branch from c8aef41 to 18cf2f7 Compare April 28, 2026 12:09

Copilot started reviewing on behalf of mrWinston April 28, 2026 12:10 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

Comment thread test/e2e/adminapi_resize_controlplane.go

Comment thread test/e2e/adminapi_resize_controlplane.go

Comment thread test/e2e/adminapi_resize_controlplane.go

Comment thread test/e2e/adminapi_resize_controlplane.go

Comment thread test/e2e/adminapi_resize_controlplane.go

tuxerrante reviewed Apr 29, 2026

View reviewed changes

Comment thread Makefile Outdated

tuxerrante reviewed Apr 29, 2026

View reviewed changes

Comment thread Makefile Outdated

tuxerrante reviewed Apr 29, 2026

View reviewed changes

comment and rename to focus

e185ce8

Conversation

mrWinston commented Apr 22, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue this PR addresses:

What this PR does / why we need it:

Test plan for issue:

How do you know this will function as expected in production?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

swiencki left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tuxerrante Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

tuxerrante Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

mrWinston Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

tuxerrante commented Apr 29, 2026

E2E coverage gaps relative to other open resize PRs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mrWinston commented Apr 22, 2026 •

edited by openshift-ci Bot

Loading

swiencki left a comment •

edited

Loading