Skip to content

fix(scaling): guard GetCurrentReplicas against nil ScaleTargetGVKR#7661

Open
ggarb wants to merge 1 commit intokedacore:mainfrom
ggarb:fix-getcurrentreplicas-nil-gvkr
Open

fix(scaling): guard GetCurrentReplicas against nil ScaleTargetGVKR#7661
ggarb wants to merge 1 commit intokedacore:mainfrom
ggarb:fix-getcurrentreplicas-nil-gvkr

Conversation

@ggarb
Copy link
Copy Markdown

@ggarb ggarb commented Apr 17, 2026

Problem

At ~10k ScaledObjects being created at 10/s, the KEDA operator panics:

panic: runtime error: invalid memory address or nil pointer dereference
pkg/scaling/resolver/scale_resolvers.go:731 GetCurrentReplicas
pkg/scaling/executor/scale_scaledobjects.go:44 RequestScale
pkg/scaling/scale_handler.go:282 checkScalers
pkg/scaling/scale_handler.go:199 startScaleLoop

Root cause is the same informer-cache race described in #4389 / tracked
in #4955: scaledObject.Status.ScaleTargetGVKR can be nil when the
scale loop first invokes GetCurrentReplicas. The existing code
dereferences .Group / .Kind on the nil pointer, crashing the whole
operator process and taking down every other scale loop with it.

ResolveScaleTargetPodSpec already defends against this race
(scale_resolvers.go L103-L119). GetCurrentReplicas does not.
This PR applies the same pattern.

Fix

If Status.ScaleTargetGVKR is nil on entry, re-fetch the ScaledObject
via the client. If it is still nil after re-fetch, return a descriptive
error instead of panicking.

Repro context

Observed during a 10k-ScaledObject KWOK load test at Netflix, after
raising --kube-api-qps/burst to eliminate client-go throttling
(previous 1k bottleneck). Once client-side throttling was gone, fast
ScaledObject creation widened the cache-race window enough that the
nil-pointer panic fired reliably before the 750th object was created.

Tests

  • Added TestGetCurrentReplicas_NilScaleTargetGVKR with three cases:
    • nil on input, re-fetch succeeds with populated GVKR → returns
      correct replica count (Deployment path)
    • nil on input, re-fetch also returns nil → returns a descriptive
      "probably invalid ScaledObject cache" error
    • nil on input, re-fetch fails (SO missing) → returns fetch error
  • All existing tests in ./pkg/scaling/... pass.

Fixes / refs

@ggarb ggarb requested a review from a team as a code owner April 17, 2026 17:32
@github-actions
Copy link
Copy Markdown

Thank you for your contribution! 🙏

Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected.

While you are waiting, make sure to:

  • Add an entry in our changelog in alphabetical order and link related issue
  • Update the documentation, if needed
  • Add unit & e2e tests for your changes
  • GitHub checks are passing
  • Is the DCO check failing? Here is how you can fix DCO issues

Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient.

Learn more about our contribution guide.

@keda-automation keda-automation requested a review from a team April 17, 2026 17:32
@snyk-io
Copy link
Copy Markdown

snyk-io Bot commented Apr 17, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@ggarb ggarb force-pushed the fix-getcurrentreplicas-nil-gvkr branch from 3d1ea68 to 4414c85 Compare April 17, 2026 17:46
@ggarb ggarb force-pushed the fix-getcurrentreplicas-nil-gvkr branch 2 times, most recently from f55e568 to db90396 Compare April 23, 2026 15:11
Copy link
Copy Markdown
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!
As now it's starting to be in multiple places, does it make sense to extract that code into a funciton that we can reuse and implicitly tracks all the usages of this hack?

When the informer cache races with ScaledObject creation,
scaledObject.Status.ScaleTargetGVKR can be nil at the point the scale
loop invokes GetCurrentReplicas. The current code then dereferences
.Group / .Kind on a nil pointer and panics, taking down the operator.

This applies the same defensive pattern already used in
ResolveScaleTargetPodSpec: re-fetch the ScaledObject via the client
when Status.ScaleTargetGVKR is nil, and if it is still nil after
re-fetch, return a descriptive error instead of panicking.

Observed in a 10k-ScaledObject KWOK load test where kube-burner
created ScaledObjects at 10/s; the cache-race window opened wide
enough that the panic fired reliably within the first 750 objects.

Refs: kedacore#4389, kedacore#4955, kedacore#6176

Signed-off-by: Greg Garber <ggarb@netflix.com>
@ggarb ggarb force-pushed the fix-getcurrentreplicas-nil-gvkr branch from db90396 to 07cd22a Compare May 4, 2026 17:20
@keda-automation keda-automation requested a review from a team May 4, 2026 17:21
@rickbrouwer rickbrouwer added Awaiting/2nd-approval This PR needs one more approval review merge-conflict This PR has a merge conflict labels May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Awaiting/2nd-approval This PR needs one more approval review merge-conflict This PR has a merge conflict

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants