ARO-25193 - Block control plane resize if apiserver is unhealthy by aasserzo · Pull Request #4715 · Azure/ARO-RP

aasserzo · 2026-03-26T18:31:02Z

Which issue this PR addresses:

https://redhat.atlassian.net/browse/ARO-25193

What this PR does / why we need it:

Enhances the pre-flight validation for control plane VM resize by adding more robust API server health checks.

Previously, the validation only checked if the kube-apiserver ClusterOperator was healthy. This PR adds two additional checks:

Direct API server reachability — Calls the /healthz endpoint to verify the API server is reachable from the RP before proceeding with validation.

Per-pod health validation — Verifies that all 3 kube-apiserver pods are Running and Ready, ensuring the cluster has full API server redundancy before allowing a resize operation that will replace control plane nodes one by one.

These checks help prevent resize operations on clusters where the API server appears healthy at the operator level but may have underlying issues (e.g., a pod stuck in Pending, or the API server not reachable from the RP).

Test plan for issue:

Unit Tests:

TestValidateAPIServerHealth — Tests healthz check failure and ClusterOperator status checks
TestValidateAPIServerPods — Tests pod count validation, phase checks (Running/Pending/Failed), Ready condition checks, and filtering of non-apiserver pods

Manual Testing:

Deploy a dev cluster using go run ./hack/cluster create

Run pre-validation against a healthy cluster:

curl -sk "https://localhost:8443/admin/subscriptions/$AZURE_SUBSCRIPTION_ID/resourceGroups/$RESOURCEGROUP/providers/Microsoft.RedHatOpenShift/openShiftClusters/$CLUSTER/preresizevalidation?vmSize=Standard_D16s_v3"

Verify returns "All pre-flight checks passed"

Simulate unhealthy apiserver pod:
```
kubectl delete pod -n openshift-kube-apiserver <apiserver-pod-name>
```
- Immediately run pre-validation
- Verify returns error: "Unhealthy kube-apiserver pods: [...(phase: Pending)]. Resize is not safe without full API server redundancy."
Wait for pod recovery and re-run pre-validation:
- Verify returns "All pre-flight checks passed"

Is there any documentation that needs to be updated for this PR?

No. The existing documentation in docs/agent-guides/package-deployment-context.md describes the pre-validation endpoint as checking "API server health" — this description remains accurate. The changes in this PR enhance the robustness of those checks but don't change the endpoint's interface or purpose.

How do you know this will function as expected in production?

Unit tests cover all code paths including error handling, edge cases (wrong pod count, unhealthy pods, API failures), and the happy path.
Manual testing against a real dev cluster validated:
- All checks pass on a healthy cluster
- Unhealthy apiserver pod (Pending phase) is detected and blocks resize with a clear error message
- Recovery is detected when pod returns to healthy state
No changes to endpoint interface — The endpoint returns the same success/error responses as before; only the internal validation logic is enhanced.
Fail-safe behavior — The new checks add stricter validation. If they fail incorrectly, the worst case is blocking a resize that would have succeeded, which is safer than allowing a resize that could destabilize the cluster.

…o control plane resize pre-flight checks

pkg/frontend/admin_openshiftcluster_vmresize_pre_validation.go

tuxerrante

Minor

Success response is an unstructured JSON string (line 131 of admin_openshiftcluster_vmresize_pre_validation.go, pre-existing)

json.Marshal("All pre-flight checks passed") returns a raw JSON string literal. If the endpoint later needs to return structured data (e.g., which checks passed, warnings), this would be a breaking change for any caller parsing the response. Consider returning a simple struct from the start, e.g. json.Marshal(map[string]string{"status": "passed"}).

Extra

OperatorStatusText omits the Degraded condition (pre-existing in pkg/util/clusteroperators/isavailable.go)

When Degraded=True is the actual problem, the error text reads:

kube-apiserver Available=True, Progressing=False

...which looks healthy at a glance. The Degraded field — the actual root cause — is missing. Worth a follow-up to include Degraded in OperatorStatusText when it is True.

pkg/frontend/admin_openshiftcluster_vmresize_pre_validation.go

pkg/frontend/adminactions/kubeactions.go

pkg/frontend/admin_openshiftcluster_vmresize_pre_validation.go

add apiserver pod health check and apiserver healthz endpoint check t…

0627575

…o control plane resize pre-flight checks

aasserzo requested review from alcasim, bennerv, cadenmarchese, hawkowl, hlipsig, jharrington22, kimorris27, mociarain, mrWinston, rogbas, sankur-codes, tiguelu, tsatam, tuxerrante, ventifus, wanghaoran1988 and yjst2012 as code owners March 26, 2026 18:31

fix linting errors

98f5484

rh-returners reviewed Mar 27, 2026

View reviewed changes

pkg/frontend/admin_openshiftcluster_vmresize_pre_validation.go Outdated Show resolved Hide resolved

tuxerrante reviewed Mar 27, 2026

View reviewed changes

pkg/frontend/admin_openshiftcluster_vmresize_pre_validation.go Outdated Show resolved Hide resolved

pkg/frontend/adminactions/kubeactions.go Outdated Show resolved Hide resolved

pkg/frontend/admin_openshiftcluster_vmresize_pre_validation.go Show resolved Hide resolved

Anton Asserzon added 5 commits March 27, 2026 15:07

Simplify CheckAPIServerHealthz to reuse existing discovery REST client

c7c5ca3

Remove redundant body check from CheckAPIServerHealthz

30d10ae

Use /readyz endpoint instead of /healthz for API server check

7e500f4

Run readyz check as synchronous gate and group API server validations

c775bba

Use 500 instead of 503 for API server non-ready status

d105bae

aasserzo requested a review from kevinobriendotca as a code owner March 27, 2026 14:45

Merge branch 'master' into aasserzo/ARO-25193-cp-resize-apiserver-check

169dcfa

tuxerrante reviewed Apr 1, 2026

View reviewed changes

pkg/frontend/admin_openshiftcluster_vmresize_pre_validation.go Outdated Show resolved Hide resolved

aasserzo requested a review from tuxerrante April 1, 2026 15:47

aasserzo requested a review from rh-returners April 1, 2026 15:47

tuxerrante reviewed Apr 2, 2026

View reviewed changes

pkg/frontend/admin_openshiftcluster_vmresize_pre_validation.go Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARO-25193 - Block control plane resize if apiserver is unhealthy#4715

ARO-25193 - Block control plane resize if apiserver is unhealthy#4715
aasserzo wants to merge 8 commits intomasterfrom
aasserzo/ARO-25193-cp-resize-apiserver-check

aasserzo commented Mar 26, 2026

Uh oh!

Uh oh!

tuxerrante left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

aasserzo commented Mar 26, 2026

Which issue this PR addresses:

What this PR does / why we need it:

Test plan for issue:

Is there any documentation that needs to be updated for this PR?

How do you know this will function as expected in production?

Uh oh!

Uh oh!

tuxerrante left a comment

Choose a reason for hiding this comment

Minor

Extra

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants