Skip to content

ARO-25193 - Block control plane resize if apiserver is unhealthy#4715

Open
aasserzo wants to merge 8 commits intomasterfrom
aasserzo/ARO-25193-cp-resize-apiserver-check
Open

ARO-25193 - Block control plane resize if apiserver is unhealthy#4715
aasserzo wants to merge 8 commits intomasterfrom
aasserzo/ARO-25193-cp-resize-apiserver-check

Conversation

@aasserzo
Copy link
Copy Markdown
Collaborator

Which issue this PR addresses:

https://redhat.atlassian.net/browse/ARO-25193

What this PR does / why we need it:

Enhances the pre-flight validation for control plane VM resize by adding more robust API server health checks.

Previously, the validation only checked if the kube-apiserver ClusterOperator was healthy. This PR adds two additional checks:

Direct API server reachability — Calls the /healthz endpoint to verify the API server is reachable from the RP before proceeding with validation.

Per-pod health validation — Verifies that all 3 kube-apiserver pods are Running and Ready, ensuring the cluster has full API server redundancy before allowing a resize operation that will replace control plane nodes one by one.

These checks help prevent resize operations on clusters where the API server appears healthy at the operator level but may have underlying issues (e.g., a pod stuck in Pending, or the API server not reachable from the RP).

Test plan for issue:

Unit Tests:

  • TestValidateAPIServerHealth — Tests healthz check failure and ClusterOperator status checks
  • TestValidateAPIServerPods — Tests pod count validation, phase checks (Running/Pending/Failed), Ready condition checks, and filtering of non-apiserver pods

Manual Testing:

  1. Deploy a dev cluster using go run ./hack/cluster create

  2. Run pre-validation against a healthy cluster:

    curl -sk "https://localhost:8443/admin/subscriptions/$AZURE_SUBSCRIPTION_ID/resourceGroups/$RESOURCEGROUP/providers/Microsoft.RedHatOpenShift/openShiftClusters/$CLUSTER/preresizevalidation?vmSize=Standard_D16s_v3"
    • Verify returns "All pre-flight checks passed"
  3. Simulate unhealthy apiserver pod:

    kubectl delete pod -n openshift-kube-apiserver <apiserver-pod-name>
    • Immediately run pre-validation
    • Verify returns error: "Unhealthy kube-apiserver pods: [...(phase: Pending)]. Resize is not safe without full API server redundancy."
  4. Wait for pod recovery and re-run pre-validation:

    • Verify returns "All pre-flight checks passed"

Is there any documentation that needs to be updated for this PR?

No. The existing documentation in docs/agent-guides/package-deployment-context.md describes the pre-validation endpoint as checking "API server health" — this description remains accurate. The changes in this PR enhance the robustness of those checks but don't change the endpoint's interface or purpose.

How do you know this will function as expected in production?

  1. Unit tests cover all code paths including error handling, edge cases (wrong pod count, unhealthy pods, API failures), and the happy path.

  2. Manual testing against a real dev cluster validated:

    • All checks pass on a healthy cluster
    • Unhealthy apiserver pod (Pending phase) is detected and blocks resize with a clear error message
    • Recovery is detected when pod returns to healthy state
  3. No changes to endpoint interface — The endpoint returns the same success/error responses as before; only the internal validation logic is enhanced.

  4. Fail-safe behavior — The new checks add stricter validation. If they fail incorrectly, the worst case is blocking a resize that would have succeeded, which is safer than allowing a resize that could destabilize the cluster.

Copy link
Copy Markdown
Collaborator

@tuxerrante tuxerrante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor

Success response is an unstructured JSON string (line 131 of admin_openshiftcluster_vmresize_pre_validation.go, pre-existing)

json.Marshal("All pre-flight checks passed") returns a raw JSON string literal. If the endpoint later needs to return structured data (e.g., which checks passed, warnings), this would be a breaking change for any caller parsing the response. Consider returning a simple struct from the start, e.g. json.Marshal(map[string]string{"status": "passed"}).

Extra

OperatorStatusText omits the Degraded condition (pre-existing in pkg/util/clusteroperators/isavailable.go)

When Degraded=True is the actual problem, the error text reads:

kube-apiserver Available=True, Progressing=False

...which looks healthy at a glance. The Degraded field — the actual root cause — is missing. Worth a follow-up to include Degraded in OperatorStatusText when it is True.

@aasserzo aasserzo requested a review from tuxerrante April 1, 2026 15:47
@aasserzo aasserzo requested a review from rh-returners April 1, 2026 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants