ARO-25193 - Block control plane resize if apiserver is unhealthy#4715
ARO-25193 - Block control plane resize if apiserver is unhealthy#4715
Conversation
…o control plane resize pre-flight checks
tuxerrante
left a comment
There was a problem hiding this comment.
Minor
Success response is an unstructured JSON string (line 131 of admin_openshiftcluster_vmresize_pre_validation.go, pre-existing)
json.Marshal("All pre-flight checks passed") returns a raw JSON string literal. If the endpoint later needs to return structured data (e.g., which checks passed, warnings), this would be a breaking change for any caller parsing the response. Consider returning a simple struct from the start, e.g. json.Marshal(map[string]string{"status": "passed"}).
Extra
OperatorStatusText omits the Degraded condition (pre-existing in pkg/util/clusteroperators/isavailable.go)
When Degraded=True is the actual problem, the error text reads:
kube-apiserver Available=True, Progressing=False
...which looks healthy at a glance. The Degraded field — the actual root cause — is missing. Worth a follow-up to include Degraded in OperatorStatusText when it is True.
Which issue this PR addresses:
https://redhat.atlassian.net/browse/ARO-25193
What this PR does / why we need it:
Enhances the pre-flight validation for control plane VM resize by adding more robust API server health checks.
Previously, the validation only checked if the kube-apiserver ClusterOperator was healthy. This PR adds two additional checks:
Direct API server reachability — Calls the /healthz endpoint to verify the API server is reachable from the RP before proceeding with validation.
Per-pod health validation — Verifies that all 3 kube-apiserver pods are Running and Ready, ensuring the cluster has full API server redundancy before allowing a resize operation that will replace control plane nodes one by one.
These checks help prevent resize operations on clusters where the API server appears healthy at the operator level but may have underlying issues (e.g., a pod stuck in Pending, or the API server not reachable from the RP).
Test plan for issue:
Unit Tests:
TestValidateAPIServerHealth— Tests healthz check failure and ClusterOperator status checksTestValidateAPIServerPods— Tests pod count validation, phase checks (Running/Pending/Failed), Ready condition checks, and filtering of non-apiserver podsManual Testing:
Deploy a dev cluster using
go run ./hack/cluster createRun pre-validation against a healthy cluster:
curl -sk "https://localhost:8443/admin/subscriptions/$AZURE_SUBSCRIPTION_ID/resourceGroups/$RESOURCEGROUP/providers/Microsoft.RedHatOpenShift/openShiftClusters/$CLUSTER/preresizevalidation?vmSize=Standard_D16s_v3""All pre-flight checks passed"Simulate unhealthy apiserver pod:
"Unhealthy kube-apiserver pods: [...(phase: Pending)]. Resize is not safe without full API server redundancy."Wait for pod recovery and re-run pre-validation:
"All pre-flight checks passed"Is there any documentation that needs to be updated for this PR?
No. The existing documentation in
docs/agent-guides/package-deployment-context.mddescribes the pre-validation endpoint as checking "API server health" — this description remains accurate. The changes in this PR enhance the robustness of those checks but don't change the endpoint's interface or purpose.How do you know this will function as expected in production?
Unit tests cover all code paths including error handling, edge cases (wrong pod count, unhealthy pods, API failures), and the happy path.
Manual testing against a real dev cluster validated:
No changes to endpoint interface — The endpoint returns the same success/error responses as before; only the internal validation logic is enhanced.
Fail-safe behavior — The new checks add stricter validation. If they fail incorrectly, the worst case is blocking a resize that would have succeeded, which is safer than allowing a resize that could destabilize the cluster.