Skip to content

Fix CSR approval deadlock in assisted installer scale-up#147

Merged
wkulhanek merged 2 commits into
mainfrom
fix-assisted-scale-csr-deadlock
Apr 27, 2026
Merged

Fix CSR approval deadlock in assisted installer scale-up#147
wkulhanek merged 2 commits into
mainfrom
fix-assisted-scale-csr-deadlock

Conversation

@rut31337

Copy link
Copy Markdown
Contributor

Problem

CNV cluster provisioning fails with a 3-hour timeout for any lab using host_ocp4_assisted_scale to add worker nodes. Observed on LB2391 and LB2863 — all 12 worker VMs install OCP successfully, reboot, and generate kubelet CSRs, but the CSRs are never approved and the job times out.

Root cause: deadlock between CSR approval and wait_for_hosts

The machine-approver operator rejects CSRs for assisted-installer workers because no Machine API objects exist for them (these nodes are added via assisted installer, not the Machine API). The role handles this with approve_csr_nodes.yaml which manually approves pending CSRs — but it runs after wait_for_hosts.

wait_for_hosts blocks until the assisted installer marks hosts as "ready", which requires nodes to join the cluster, which requires CSR approval → deadlock.

wait_for_hosts (blocks)  →  needs nodes to join cluster
                             →  needs CSR approval
                                →  runs AFTER wait_for_hosts  ← deadlock

Evidence from cluster-lsrl5 (LB2391)

  • 12 worker VMs: all Running on CNV host, OCP installed successfully
  • 209 pending CSRs, 0 approved (for 5+ hours)
  • machine-approver logs: failed to find machine with InternalDNS matching worker-cluster-lsrl5-3, cannot approve
  • Manually approving CSRs → all 12 workers joined and went Ready within 60 seconds
  • Tower job 28088 on event0: ran for 10803s, killed by 10800s timeout

Fix

  1. Move CSR approval before wait_for_hosts — the approve_csr_nodes.yaml task already has a retry loop that waits for the expected number of ready worker nodes. Running it first lets CSRs get approved as workers come up, so by the time wait_for_hosts runs, nodes have already joined.

  2. Increase approve_csr_retries from 90 to 180 (15 min → 30 min) — CSR approval now runs before wait_for_hosts and needs to cover the full install + reboot + kubelet startup cycle.

Files changed

File Change
tasks/main.yaml Move CSR approval block before wait_for_hosts
defaults/main.yml approve_csr_retries: 90 → 180

The machine-approver operator rejects CSRs for assisted-installer
workers because no Machine API objects exist for them. The role's
approve_csr_nodes.yaml handles this by manually approving CSRs,
but it ran AFTER wait_for_hosts — which blocks until the assisted
installer marks hosts "ready", which requires nodes to join the
cluster, which requires CSR approval. Deadlock.

Move CSR approval before wait_for_hosts so CSRs are approved as
workers come up. Increase approve_csr_retries default from 90 to
180 (30 min) to cover the full install+reboot+join cycle.

Observed on OCP 4.17 CNV clusters: all 12 worker VMs installed
successfully but sat with 209 pending CSRs for 3 hours until
the tower job timeout killed the playbook.
Workers need time to install OCP from ISO, reboot, and generate
CSRs. The previous default of 90 retries (15 min) was too short
when CSR approval runs before wait_for_hosts.
@rut31337 rut31337 requested a review from a team as a code owner April 27, 2026 20:07
@wkulhanek wkulhanek merged commit 272c034 into main Apr 27, 2026
2 checks passed
@wkulhanek wkulhanek deleted the fix-assisted-scale-csr-deadlock branch April 27, 2026 20:10
wkulhanek added a commit that referenced this pull request Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants