Skip to content

Handle CSR approval race condition in assisted scale-up#148

Closed
rut31337 wants to merge 1 commit into
mainfrom
fix-assisted-scale-csr-deadlock-v2
Closed

Handle CSR approval race condition in assisted scale-up#148
rut31337 wants to merge 1 commit into
mainfrom
fix-assisted-scale-csr-deadlock-v2

Conversation

@rut31337

Copy link
Copy Markdown
Contributor

Problem

Follow-up to #147 (merged). LB2863 cluster provisioning failed with a different root cause than LB2391's deadlock — the CSR approval step itself crashed due to a race condition.

What happened (LB2863, tower job 28080)

  1. Worker VMs came up (45s)
  2. wait_for_hosts passed (328s) — assisted installer completed
  3. approve_csr_nodes.yaml ran — listed 2 pending CSRs
  4. PUT to approve csr-6m6t7 returned an error (409/422 — the machine-approver touched it between list and approve)
  5. The task has no error handling → entire playbook aborted

The machine-approver operator runs continuously and sometimes approves or rejects a CSR in the window between when we list pending CSRs and when we PUT the approval. This is a benign race — the CSR is already handled — but the uri task treats any non-2xx as fatal.

Fix

Add failed_when: _r_approve_csr.status >= 500 and status_code: [200, 201] to the approval task. This way:

  • 200/201: approved successfully (normal)
  • 409 Conflict / 422 Unprocessable: CSR already handled by another controller (harmless, continue)
  • 5xx: actual server error (fail)

Files changed

File Change
tasks/approve_csr_nodes.yaml Add failed_when and register to CSR approval URI task

The uri PUT to approve a CSR can fail with 409/422 if another
controller (e.g. machine-approver) touches the CSR between the
time we list it and when we try to approve it. This killed the
entire playbook for LB2863 — one CSR raced, the task failed,
and the playbook aborted.

Add failed_when to only fail on server errors (5xx), not on
conflicts or already-approved responses which are harmless.
@rut31337 rut31337 requested a review from a team as a code owner April 27, 2026 20:19
@wkulhanek wkulhanek closed this Apr 28, 2026
@wkulhanek

Copy link
Copy Markdown
Collaborator

Please test on a branch before submitting PRs that AI "suggested" to avoid breaking prod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants