fix(scaling): guard GetCurrentReplicas against nil ScaleTargetGVKR#7661
fix(scaling): guard GetCurrentReplicas against nil ScaleTargetGVKR#7661ggarb wants to merge 1 commit intokedacore:mainfrom
Conversation
|
Thank you for your contribution! 🙏 Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected. While you are waiting, make sure to:
Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient. Learn more about our contribution guide. |
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
3d1ea68 to
4414c85
Compare
f55e568 to
db90396
Compare
JorTurFer
left a comment
There was a problem hiding this comment.
Thanks for the fix!
As now it's starting to be in multiple places, does it make sense to extract that code into a funciton that we can reuse and implicitly tracks all the usages of this hack?
When the informer cache races with ScaledObject creation, scaledObject.Status.ScaleTargetGVKR can be nil at the point the scale loop invokes GetCurrentReplicas. The current code then dereferences .Group / .Kind on a nil pointer and panics, taking down the operator. This applies the same defensive pattern already used in ResolveScaleTargetPodSpec: re-fetch the ScaledObject via the client when Status.ScaleTargetGVKR is nil, and if it is still nil after re-fetch, return a descriptive error instead of panicking. Observed in a 10k-ScaledObject KWOK load test where kube-burner created ScaledObjects at 10/s; the cache-race window opened wide enough that the panic fired reliably within the first 750 objects. Refs: kedacore#4389, kedacore#4955, kedacore#6176 Signed-off-by: Greg Garber <ggarb@netflix.com>
db90396 to
07cd22a
Compare
Problem
At ~10k ScaledObjects being created at 10/s, the KEDA operator panics:
panic: runtime error: invalid memory address or nil pointer dereference
pkg/scaling/resolver/scale_resolvers.go:731 GetCurrentReplicas
pkg/scaling/executor/scale_scaledobjects.go:44 RequestScale
pkg/scaling/scale_handler.go:282 checkScalers
pkg/scaling/scale_handler.go:199 startScaleLoop
Root cause is the same informer-cache race described in #4389 / tracked
in #4955:
scaledObject.Status.ScaleTargetGVKRcan be nil when thescale loop first invokes
GetCurrentReplicas. The existing codedereferences
.Group/.Kindon the nil pointer, crashing the wholeoperator process and taking down every other scale loop with it.
ResolveScaleTargetPodSpecalready defends against this race(scale_resolvers.go L103-L119).
GetCurrentReplicasdoes not.This PR applies the same pattern.
Fix
If
Status.ScaleTargetGVKRis nil on entry, re-fetch the ScaledObjectvia the client. If it is still nil after re-fetch, return a descriptive
error instead of panicking.
Repro context
Observed during a 10k-ScaledObject KWOK load test at Netflix, after
raising --kube-api-qps/burst to eliminate client-go throttling
(previous 1k bottleneck). Once client-side throttling was gone, fast
ScaledObject creation widened the cache-race window enough that the
nil-pointer panic fired reliably before the 750th object was created.
Tests
TestGetCurrentReplicas_NilScaleTargetGVKRwith three cases:correct replica count (Deployment path)
"probably invalid ScaledObject cache" error
./pkg/scaling/...pass.Fixes / refs
site, doesn't close the underlying race)