Skip to content

[ARO-25537] Additional E2E Tests for new Control Plane Resize#4787

Open
mrWinston wants to merge 2 commits intomasterfrom
ARO-25537-e2e-test-for-new-cp-resize
Open

[ARO-25537] Additional E2E Tests for new Control Plane Resize#4787
mrWinston wants to merge 2 commits intomasterfrom
ARO-25537-e2e-test-for-new-cp-resize

Conversation

@mrWinston
Copy link
Copy Markdown
Collaborator

@mrWinston mrWinston commented Apr 22, 2026

Which issue this PR addresses:

Fixes ARO-25537

What this PR does / why we need it:

  • Create 3 new e2e test cases for the controlplane resize admin action:
    • Skip resizing if machines already have the correct size
    • Don't attempt resize with insufficient quota
    • Perform the actual resize (this is the happy path)
  • Add new test tag slow
  • make CI skip tests tagged as slow
  • Add make target make e2e that directly runs e2e via go test without creating an intermediate binary and that supports passing ginkgo's -focus parameter to select invidiual tests to be run based on a regex

Test plan for issue:

  • Tested manually with local RP
  • If you have a local RP with a cluster running, use this command to only run the new e2e tests with the new make target:
make E2E_FOKUS="Resize control plane" e2e

How do you know this will function as expected in production?

  • PR Needs to be tested in canary as well to make sure it doesn't block the release deployment
  • These tests will only run in our Ring 1 deployments, not Ring 2, nor in CI.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds additional E2E coverage and tooling around the new control-plane resize admin action, including a new slow test label and CI filtering to keep long-running tests out of standard CI runs.

Changes:

  • Add new E2E test cases for /resizecontrolplane (no-op when same size, quota failure path, and a slow happy-path resize).
  • Introduce a new Ginkgo label slow and update CI E2E label filtering to exclude it.
  • Add a make e2e target to run E2Es directly via go test with focus/label filtering support.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
test/e2e/update.go Switch E2E controller-runtime client usage to clients.KubeClient.
test/e2e/setup.go Add slow label constant; extend clientSet with Usages client and rename Client -> KubeClient.
test/e2e/adminapi_resize_controlplane.go Add new resize-control-plane E2E cases and helper functions for VM/label validation.
Makefile Add e2e target and focus variable; adjust license validation ignore list.
.pipelines/ci.yml Update CI E2E label filter to skip slow tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Makefile Outdated
Comment thread test/e2e/adminapi_resize_controlplane.go
Comment thread test/e2e/adminapi_resize_controlplane.go
Comment thread test/e2e/adminapi_resize_controlplane.go
Comment thread test/e2e/adminapi_resize_controlplane.go Outdated
Comment thread test/e2e/adminapi_resize_controlplane.go Outdated
Comment thread test/e2e/adminapi_resize_controlplane.go Outdated
Comment thread test/e2e/adminapi_resize_controlplane.go
Comment thread test/e2e/adminapi_resize_controlplane.go Outdated
Comment thread test/e2e/adminapi_resize_controlplane.go
Copy link
Copy Markdown
Collaborator

@swiencki swiencki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dupe comment

Comment thread test/e2e/adminapi_resize_controlplane.go Outdated
Copilot AI review requested due to automatic review settings April 24, 2026 09:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/e2e/adminapi_resize_controlplane.go Outdated
Comment thread test/e2e/adminapi_resize_controlplane.go Outdated
@mrWinston mrWinston force-pushed the ARO-25537-e2e-test-for-new-cp-resize branch from d259c3d to 323579b Compare April 28, 2026 11:38
Copilot AI review requested due to automatic review settings April 28, 2026 11:53
@mrWinston mrWinston force-pushed the ARO-25537-e2e-test-for-new-cp-resize branch from 323579b to 3f9fac5 Compare April 28, 2026 11:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/e2e/adminapi_resize_controlplane.go Outdated
Comment thread test/e2e/setup.go Outdated
Comment thread .pipelines/ci.yml
@mrWinston mrWinston force-pushed the ARO-25537-e2e-test-for-new-cp-resize branch from 3f9fac5 to c8aef41 Compare April 28, 2026 12:00
Copilot AI review requested due to automatic review settings April 28, 2026 12:09
@mrWinston mrWinston force-pushed the ARO-25537-e2e-test-for-new-cp-resize branch from c8aef41 to 18cf2f7 Compare April 28, 2026 12:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/e2e/adminapi_resize_controlplane.go
Comment thread test/e2e/adminapi_resize_controlplane.go
Comment thread test/e2e/adminapi_resize_controlplane.go
Comment thread test/e2e/adminapi_resize_controlplane.go
Comment thread test/e2e/adminapi_resize_controlplane.go
Comment thread Makefile Outdated
Comment thread Makefile Outdated

By("Validating machine and node labels")
validateMasterVMSizeLabels(ctx, targetSku)
}, NodeTimeout(30*time.Minute))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts on optimizing this test and the suite:

Parallelization — The no-op and quota tests are non-destructive and could run in parallel with most other E2E specs. Only this actual-resize test needs Serial (correctly applied). If the suite grows, consider wrapping the non-destructive cases in their own Describe without Serial so Ginkgo can schedule them concurrently.

Cluster cleanup is not critical — since testing clusters are deleted automatically, the lack of DeferCleanup to resize back is acceptable. If the team does want to save subscription-wide resources, #4619 (allow smaller VM sizes for test clusters) would help more than restoring size here.

CI coverageslow tests are excluded from all CI stages including IndividualCI/BatchedCI (unlike regressiontest which is re-included there). The PR description says they run in Ring 1 — worth adding a comment in ci.yml near the override clarifying where slow tests actually run, so future readers don't assume they're orphaned.

Expect(resp.StatusCode).To(Equal(http.StatusBadRequest))
Expect(out.Message).To(Equal("Pre-flight validation failed."))
Expect(out.Details).To(HaveLen(1))
Expect(out.Details[0].Code).To(Equal("ResourceQuotaExceeded"))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future consideration: approaches to mock/reproduce resize failures in E2E, sorted by ease of integration in the current stack:

  1. Quota-based failures (already done here) — find a SKU with zero quota. Easiest, already working.
  2. Invalid/unsupported SKU (already done above) — request an invalid VM size. Trivial.
  3. Azure Policy deny rules — create a temporary Azure Policy that denies Microsoft.Compute/virtualMachines/write for a specific SKU or resource group. Can be set up/torn down in test setup. No code changes needed, just ARM calls.
  4. RBAC restriction — temporarily remove the RP service principal's Contributor role on the cluster resource group. Simulates permission failures. Easy to script but risks side effects on other tests if not restored.
  5. Mock at the compute client level in unit tests — use the existing pkg/util/mocks patterns to inject errors in VirtualMachines.Update or VirtualMachines.Deallocate. Not E2E, but gives fine-grained control over which step fails (pre-flight, resize, start, uncordon).
  6. Azure Chaos Studio — inject faults at the VM level (stop/start failures, delayed responses). Most realistic but requires Chaos Studio setup on the subscription and experiment definitions.

Options 3-4 are probably the sweet spot for new E2E failure paths — they test real ARM error handling without needing infrastructure beyond what the test subscription already has.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re 4.:
I don't believe this is a realistic scenario since the RP's permissions in the CUs subscription are managed by azure itself via the First Party Service Principal. This isn't something the customer can modify.

However, Re 3:
This is actually a known failure condition for resizes. Customers can have custom subscription level policies restricting VMs to only use an approved list of SKUs. I don't believe this is something we can catch during the preflight checks unfortunately, but it should be worthwhile to have a good error message for SREs in case we encounter this issue. Since we'll need to reach out to the CU via azcomm, the error message needs to include everything the cu needs to adjust their policies, like the policy id, assignment id, assignment scope, target vm sku.

@tuxerrante
Copy link
Copy Markdown
Collaborator

E2E coverage gaps relative to other open resize PRs

This PR tests the core resize flow from #4733 (merged). Three open PRs add features that will need E2E coverage once they land:

#4786 — Response messages (verbose parameter)

The current tests only assert on resp.StatusCode. Once #4786 merges, the success response includes structured JSON (status, summary.totalNodes, summary.nodesResized, summary.nodesSkipped, executionOrder). Suggested additions:

  • Parse the success response body in the happy-path test and validate nodesResized vs nodesSkipped counts
  • Add a variant with verbose=true and assert that phases/steps are present in the response
  • The failure tests already validate CloudError fields, which aligns with Bizz001/resize controlplane response messages #4786's approach (failures stay fully detailed)

#4719 — Per-VM quota validation (mixed master sizes)

The quota test here assumes all 3 masters are the same size. #4719 changes quota calculation to query actual per-VM sizes from ARM to handle partial-resize scenarios. Once it merges:

  • A mixed-size scenario test would be the highest-value addition: partially resize one master to a different size, then call pre-resize validation and verify quota is computed against individual VM deltas, not a flat 3 * delta
  • The panic-recovery for unreachable API server (ARO-25194: Fetch per-VM master sizes from Azure for resize quota validation #4719 adds this) could be tested by checking that validation returns a 400 error (not a crash) when the API server is degraded — though this is hard to trigger deterministically in E2E

#4707 — Capacity Reservation (useCapacityReservation parameter)

No current coverage for CRG-backed resize. Once #4707 merges, suggested cases:

  • Happy path: useCapacityReservation=true&zone=<valid> — resize succeeds and CRG is cleaned up afterward
  • Invalid parameter: useCapacityReservation=invalid400
  • Zone without CRG flag: zone=1 without useCapacityReservation=true400
  • Zone mismatch: zone=<wrong> with useCapacityReservation=true400

These could be added incrementally as each PR merges rather than blocking this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants