[ARO-25537] Additional E2E Tests for new Control Plane Resize#4787
[ARO-25537] Additional E2E Tests for new Control Plane Resize#4787
Conversation
There was a problem hiding this comment.
Pull request overview
Adds additional E2E coverage and tooling around the new control-plane resize admin action, including a new slow test label and CI filtering to keep long-running tests out of standard CI runs.
Changes:
- Add new E2E test cases for
/resizecontrolplane(no-op when same size, quota failure path, and a slow happy-path resize). - Introduce a new Ginkgo label
slowand update CI E2E label filtering to exclude it. - Add a
make e2etarget to run E2Es directly viago testwith focus/label filtering support.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| test/e2e/update.go | Switch E2E controller-runtime client usage to clients.KubeClient. |
| test/e2e/setup.go | Add slow label constant; extend clientSet with Usages client and rename Client -> KubeClient. |
| test/e2e/adminapi_resize_controlplane.go | Add new resize-control-plane E2E cases and helper functions for VM/label validation. |
| Makefile | Add e2e target and focus variable; adjust license validation ignore list. |
| .pipelines/ci.yml | Update CI E2E label filter to skip slow tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d259c3d to
323579b
Compare
323579b to
3f9fac5
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
3f9fac5 to
c8aef41
Compare
c8aef41 to
18cf2f7
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| By("Validating machine and node labels") | ||
| validateMasterVMSizeLabels(ctx, targetSku) | ||
| }, NodeTimeout(30*time.Minute)) |
There was a problem hiding this comment.
A few thoughts on optimizing this test and the suite:
Parallelization — The no-op and quota tests are non-destructive and could run in parallel with most other E2E specs. Only this actual-resize test needs Serial (correctly applied). If the suite grows, consider wrapping the non-destructive cases in their own Describe without Serial so Ginkgo can schedule them concurrently.
Cluster cleanup is not critical — since testing clusters are deleted automatically, the lack of DeferCleanup to resize back is acceptable. If the team does want to save subscription-wide resources, #4619 (allow smaller VM sizes for test clusters) would help more than restoring size here.
CI coverage — slow tests are excluded from all CI stages including IndividualCI/BatchedCI (unlike regressiontest which is re-included there). The PR description says they run in Ring 1 — worth adding a comment in ci.yml near the override clarifying where slow tests actually run, so future readers don't assume they're orphaned.
| Expect(resp.StatusCode).To(Equal(http.StatusBadRequest)) | ||
| Expect(out.Message).To(Equal("Pre-flight validation failed.")) | ||
| Expect(out.Details).To(HaveLen(1)) | ||
| Expect(out.Details[0].Code).To(Equal("ResourceQuotaExceeded")) |
There was a problem hiding this comment.
For future consideration: approaches to mock/reproduce resize failures in E2E, sorted by ease of integration in the current stack:
- Quota-based failures (already done here) — find a SKU with zero quota. Easiest, already working.
- Invalid/unsupported SKU (already done above) — request an invalid VM size. Trivial.
- Azure Policy deny rules — create a temporary Azure Policy that denies
Microsoft.Compute/virtualMachines/writefor a specific SKU or resource group. Can be set up/torn down in test setup. No code changes needed, just ARM calls. - RBAC restriction — temporarily remove the RP service principal's
Contributorrole on the cluster resource group. Simulates permission failures. Easy to script but risks side effects on other tests if not restored. - Mock at the compute client level in unit tests — use the existing
pkg/util/mockspatterns to inject errors inVirtualMachines.UpdateorVirtualMachines.Deallocate. Not E2E, but gives fine-grained control over which step fails (pre-flight, resize, start, uncordon). - Azure Chaos Studio — inject faults at the VM level (stop/start failures, delayed responses). Most realistic but requires Chaos Studio setup on the subscription and experiment definitions.
Options 3-4 are probably the sweet spot for new E2E failure paths — they test real ARM error handling without needing infrastructure beyond what the test subscription already has.
There was a problem hiding this comment.
Re 4.:
I don't believe this is a realistic scenario since the RP's permissions in the CUs subscription are managed by azure itself via the First Party Service Principal. This isn't something the customer can modify.
However, Re 3:
This is actually a known failure condition for resizes. Customers can have custom subscription level policies restricting VMs to only use an approved list of SKUs. I don't believe this is something we can catch during the preflight checks unfortunately, but it should be worthwhile to have a good error message for SREs in case we encounter this issue. Since we'll need to reach out to the CU via azcomm, the error message needs to include everything the cu needs to adjust their policies, like the policy id, assignment id, assignment scope, target vm sku.
E2E coverage gaps relative to other open resize PRsThis PR tests the core resize flow from #4733 (merged). Three open PRs add features that will need E2E coverage once they land: #4786 — Response messages ( The current tests only assert on
#4719 — Per-VM quota validation (mixed master sizes) The quota test here assumes all 3 masters are the same size. #4719 changes quota calculation to query actual per-VM sizes from ARM to handle partial-resize scenarios. Once it merges:
#4707 — Capacity Reservation ( No current coverage for CRG-backed resize. Once #4707 merges, suggested cases:
These could be added incrementally as each PR merges rather than blocking this one. |
Which issue this PR addresses:
Fixes ARO-25537
What this PR does / why we need it:
slowslowmake e2ethat directly runs e2e viago testwithout creating an intermediate binary and that supports passing ginkgo's-focusparameter to select invidiual tests to be run based on a regexTest plan for issue:
How do you know this will function as expected in production?