Handle CSR approval race condition in assisted scale-up#148
Closed
rut31337 wants to merge 1 commit into
Closed
Conversation
The uri PUT to approve a CSR can fail with 409/422 if another controller (e.g. machine-approver) touches the CSR between the time we list it and when we try to approve it. This killed the entire playbook for LB2863 — one CSR raced, the task failed, and the playbook aborted. Add failed_when to only fail on server errors (5xx), not on conflicts or already-approved responses which are harmless.
Collaborator
|
Please test on a branch before submitting PRs that AI "suggested" to avoid breaking prod. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Follow-up to #147 (merged). LB2863 cluster provisioning failed with a different root cause than LB2391's deadlock — the CSR approval step itself crashed due to a race condition.
What happened (LB2863, tower job 28080)
wait_for_hostspassed (328s) — assisted installer completedapprove_csr_nodes.yamlran — listed 2 pending CSRscsr-6m6t7returned an error (409/422 — themachine-approvertouched it between list and approve)The
machine-approveroperator runs continuously and sometimes approves or rejects a CSR in the window between when we list pending CSRs and when we PUT the approval. This is a benign race — the CSR is already handled — but theuritask treats any non-2xx as fatal.Fix
Add
failed_when: _r_approve_csr.status >= 500andstatus_code: [200, 201]to the approval task. This way:Files changed
tasks/approve_csr_nodes.yamlfailed_whenandregisterto CSR approval URI task