fix: handle initial paused state correctly in ScaledObject controller#7563
fix: handle initial paused state correctly in ScaledObject controller#7563rohansood10 wants to merge 2 commits intokedacore:mainfrom
Conversation
|
Thank you for your contribution! 🙏 Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected. While you are waiting, make sure to:
Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient. Learn more about our contribution guide. |
|
we have had perfornance issues because of status propagation. I'm not fully sure about writting the status immediatelly as it's an extra request that KEDA does. Nevertheless, the overriding issue could be a problem too. |
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
|
Good call on the performance concern - I definitely want to avoid adding extra API requests if we can help it. What if instead of writing the status immediately, we propagate the unpause state through an in-memory flag on the ScaledObject reconciler? Something like tracking a The core issue is the race between the reconciler's deferred status write and the scale loop reading stale cached status. If we can make the scale loop aware of the pending state change without hitting etcd, that should solve it cleanly. I can rework the approach if that direction sounds right - let me know what you think. |
When a ScaledObject was created with the paused annotation and then unpaused by removing the annotation, the Paused condition would flip back to True and the Active condition would stay stuck on Unknown. Root cause: The unpause code path only set Paused=False in the local conditions variable, deferring the server-side status update to the end of the Reconcile function. However, the scale loop (started via requestScaleLoop later in the same reconcile) could read the stale Paused=True condition from the API server before the final status update, and overwrite it back via a JSON merge patch that replaces the entire conditions array. Fix: Write the Paused=False condition to the API server immediately in the unpause code path, before proceeding to create the HPA and start the scale loop. This mirrors the pause path which already writes Paused=True to the server before stopping the scale loop, and eliminates the race condition between the reconciler and the scale loop goroutine. Fixes kedacore#6421 Signed-off-by: Rohan Sood <56945243+rohansood10@users.noreply.github.com>
Signed-off-by: Rohan Sood <56945243+rohansood10@users.noreply.github.com>
859b432 to
fc0e90b
Compare
|
I'd like to propose an alternative that maybe fixes this. What if we patch individual conditions by type instead of replacing the entire conditions array. This makes it structurally impossible for the scale loop to overwrite |
|
that's a much cleaner approach - patching conditions individually avoids the race entirely without any extra writes. I'll rework the PR to use per-condition patching instead of the early status write. |
When a ScaledObject is created with
autoscaling.keda.sh/paused: "true"and thenunpaused by removing the annotation, the Paused condition flips back to
Truewithinseconds and the Active condition stays stuck on
Unknown.Root cause: The unpause code path (
case isPausedInStatus:inreconcileScaledObject)only set
Paused=Falsein the local conditions variable, deferring the server-side statusupdate to the end of the
Reconcilefunction. However, the scale loop started viarequestScaleLooplater in the same reconcile could read the stalePaused=Trueconditionfrom the API server before the final status update, and overwrite it back via a JSON merge
patch that replaces the entire conditions array.
Fix: Write the
Paused=Falsecondition to the API server immediately in the unpause codepath, before proceeding to create the HPA and start the scale loop. This mirrors the pause
path which already writes
Paused=Trueto the server before stopping the scale loop(lines 260-264), eliminating the race between the reconciler and the scale loop goroutine.
Checklist
make generate-scalers-schemahas been run to update any outdated generated filesFixes #6421