Fix CSR approval deadlock in assisted installer scale-up#147
Merged
Conversation
The machine-approver operator rejects CSRs for assisted-installer workers because no Machine API objects exist for them. The role's approve_csr_nodes.yaml handles this by manually approving CSRs, but it ran AFTER wait_for_hosts — which blocks until the assisted installer marks hosts "ready", which requires nodes to join the cluster, which requires CSR approval. Deadlock. Move CSR approval before wait_for_hosts so CSRs are approved as workers come up. Increase approve_csr_retries default from 90 to 180 (30 min) to cover the full install+reboot+join cycle. Observed on OCP 4.17 CNV clusters: all 12 worker VMs installed successfully but sat with 209 pending CSRs for 3 hours until the tower job timeout killed the playbook.
Workers need time to install OCP from ISO, reboot, and generate CSRs. The previous default of 90 retries (15 min) was too short when CSR approval runs before wait_for_hosts.
wkulhanek
approved these changes
Apr 27, 2026
wkulhanek
added a commit
that referenced
this pull request
Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
CNV cluster provisioning fails with a 3-hour timeout for any lab using
host_ocp4_assisted_scaleto add worker nodes. Observed on LB2391 and LB2863 — all 12 worker VMs install OCP successfully, reboot, and generate kubelet CSRs, but the CSRs are never approved and the job times out.Root cause: deadlock between CSR approval and wait_for_hosts
The
machine-approveroperator rejects CSRs for assisted-installer workers because noMachineAPI objects exist for them (these nodes are added via assisted installer, not the Machine API). The role handles this withapprove_csr_nodes.yamlwhich manually approves pending CSRs — but it runs afterwait_for_hosts.wait_for_hostsblocks until the assisted installer marks hosts as "ready", which requires nodes to join the cluster, which requires CSR approval → deadlock.Evidence from cluster-lsrl5 (LB2391)
Runningon CNV host, OCP installed successfullymachine-approverlogs:failed to find machine with InternalDNS matching worker-cluster-lsrl5-3, cannot approveReadywithin 60 secondsFix
Move CSR approval before
wait_for_hosts— theapprove_csr_nodes.yamltask already has a retry loop that waits for the expected number of ready worker nodes. Running it first lets CSRs get approved as workers come up, so by the timewait_for_hostsruns, nodes have already joined.Increase
approve_csr_retriesfrom 90 to 180 (15 min → 30 min) — CSR approval now runs beforewait_for_hostsand needs to cover the full install + reboot + kubelet startup cycle.Files changed
tasks/main.yamlwait_for_hostsdefaults/main.ymlapprove_csr_retries: 90 → 180