Skip to content

fix: handle initial paused state correctly in ScaledObject controller#7563

Open
rohansood10 wants to merge 2 commits intokedacore:mainfrom
rohansood10:fix/6421-paused-annotation-initial-state
Open

fix: handle initial paused state correctly in ScaledObject controller#7563
rohansood10 wants to merge 2 commits intokedacore:mainfrom
rohansood10:fix/6421-paused-annotation-initial-state

Conversation

@rohansood10
Copy link
Copy Markdown
Contributor

When a ScaledObject is created with autoscaling.keda.sh/paused: "true" and then
unpaused by removing the annotation, the Paused condition flips back to True within
seconds and the Active condition stays stuck on Unknown.

Root cause: The unpause code path (case isPausedInStatus: in reconcileScaledObject)
only set Paused=False in the local conditions variable, deferring the server-side status
update to the end of the Reconcile function. However, the scale loop started via
requestScaleLoop later in the same reconcile could read the stale Paused=True condition
from the API server before the final status update, and overwrite it back via a JSON merge
patch that replaces the entire conditions array.

Fix: Write the Paused=False condition to the API server immediately in the unpause code
path, before proceeding to create the HPA and start the scale loop. This mirrors the pause
path which already writes Paused=True to the server before stopping the scale loop
(lines 260-264), eliminating the race between the reconciler and the scale loop goroutine.

Checklist

  • I have verified that my change is according to the deprecations & breaking changes policy
  • Tests have been added
  • Ensure make generate-scalers-schema has been run to update any outdated generated files
  • Changelog has been updated and is aligned with our changelog requirements, only when the change impacts end users
  • A PR is opened to update our Helm chart (repo) (if applicable, ie. when deployment manifests are modified)
  • A PR is opened to update the documentation on (repo) (if applicable)
  • Commits are signed with Developer Certificate of Origin (DCO - learn more)

Fixes #6421

@rohansood10 rohansood10 requested a review from a team as a code owner March 19, 2026 21:00
@keda-automation keda-automation requested a review from a team March 19, 2026 21:00
@github-actions
Copy link
Copy Markdown

Thank you for your contribution! 🙏

Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected.

While you are waiting, make sure to:

  • Add an entry in our changelog in alphabetical order and link related issue
  • Update the documentation, if needed
  • Add unit & e2e tests for your changes
  • GitHub checks are passing
  • Is the DCO check failing? Here is how you can fix DCO issues

Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient.

Learn more about our contribution guide.

@JorTurFer
Copy link
Copy Markdown
Member

we have had perfornance issues because of status propagation. I'm not fully sure about writting the status immediatelly as it's an extra request that KEDA does. Nevertheless, the overriding issue could be a problem too.
Can we propagate somehow the state without storing it in the status twice? Maybe with some extra flag or son that we propagate but not save in ETCD?

@snyk-io
Copy link
Copy Markdown

snyk-io Bot commented Mar 19, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@rohansood10
Copy link
Copy Markdown
Contributor Author

Good call on the performance concern - I definitely want to avoid adding extra API requests if we can help it.

What if instead of writing the status immediately, we propagate the unpause state through an in-memory flag on the ScaledObject reconciler? Something like tracking a pendingUnpause state that gets checked before the scale loop starts, so the scale loop knows not to re-assert Paused=True from the cached status. Then the actual status write only happens once at the end of the reconcile cycle as it does today.

The core issue is the race between the reconciler's deferred status write and the scale loop reading stale cached status. If we can make the scale loop aware of the pending state change without hitting etcd, that should solve it cleanly.

I can rework the approach if that direction sounds right - let me know what you think.

When a ScaledObject was created with the paused annotation and then
unpaused by removing the annotation, the Paused condition would flip
back to True and the Active condition would stay stuck on Unknown.

Root cause: The unpause code path only set Paused=False in the local
conditions variable, deferring the server-side status update to the
end of the Reconcile function. However, the scale loop (started via
requestScaleLoop later in the same reconcile) could read the stale
Paused=True condition from the API server before the final status
update, and overwrite it back via a JSON merge patch that replaces
the entire conditions array.

Fix: Write the Paused=False condition to the API server immediately
in the unpause code path, before proceeding to create the HPA and
start the scale loop. This mirrors the pause path which already
writes Paused=True to the server before stopping the scale loop,
and eliminates the race condition between the reconciler and the
scale loop goroutine.

Fixes kedacore#6421

Signed-off-by: Rohan Sood <56945243+rohansood10@users.noreply.github.com>
Signed-off-by: Rohan Sood <56945243+rohansood10@users.noreply.github.com>
@rohansood10 rohansood10 force-pushed the fix/6421-paused-annotation-initial-state branch from 859b432 to fc0e90b Compare March 20, 2026 18:52
@rickbrouwer
Copy link
Copy Markdown
Member

I'd like to propose an alternative that maybe fixes this. What if we patch individual conditions by type instead of replacing the entire conditions array. This makes it structurally impossible for the scale loop to overwrite Paused (and no early write needed in the reconciler and no extra etcd round trip per unpause.)

@rohansood10
Copy link
Copy Markdown
Contributor Author

that's a much cleaner approach - patching conditions individually avoids the race entirely without any extra writes. I'll rework the PR to use per-condition patching instead of the early status write.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Creating a new ScaledObject with the “paused” annotation is not working

3 participants