Skip to content

test: Accurate scaling strategy e2e test and functional tests for replica count defaults#7607

Open
jansworld wants to merge 10 commits intokedacore:mainfrom
jansworld:new-scaledjob-tests
Open

test: Accurate scaling strategy e2e test and functional tests for replica count defaults#7607
jansworld wants to merge 10 commits intokedacore:mainfrom
jansworld:new-scaledjob-tests

Conversation

@jansworld
Copy link
Copy Markdown

@jansworld jansworld commented Apr 6, 2026

Added a test for the accurate scaling strategy (within its own sub-directory so as to not modify the package name on the eager scaling strategy test). This test verifies both cases of the accurate scaling strategy. That is, when (maxScale + runningJobCount) <= maxReplicaCount, the number of new jobs created is maxScale - pendingJobCount (case 1). When (maxScale + runningJobCount) > maxReplicaCount, the number of new jobs created is maxReplicaCount - runningJobCount (case 2). We use a maxReplicaCount of 10.

*It's worth noting that since we want a long enough running job such that they stick around for a bit in order to test case 2, the decision was made to simply use sleeper pods instead of an actual message processor. To simulate message consumption, the queue is cleared, as this mimics the behavior of Azure Storage Queue's length property when a processor is consuming a message (locked messages not reported in queue length). If this is a problem, I can put some more time in and actually create a message processor that behaves similarly to the sleeper pod.

*Also note that these tests are performed with the pending job count as effectively 0.

Case 1: Send 4 messages into the queue. Wait for 4 jobs to be running. Clear the queue to simulate message consumption. Wait for all jobs to succeed.

Case 2: Send 4 messages into the queue. Wait for 4 jobs to be running. Clear the queue to simulate message consumption. Send 8 more messages into the queue. Assert that running jobs is clamped by maxReplicaCount (put differently, wait for 10 jobs to be running). This verifies the accurate strategy. Clean up by clearing the queue & waiting for all jobs to succeed.

Also added 3 functional tests confirming minReplicaCount and maxReplicaCount behavior in ScaledJobs. The first 2 tests verify default behavior, checking that the 2 fields evaluate to nil when omitted in the ScaledJob spec. The third test verifies that when minReplicaCount > maxReplicaCount, minReplicaCount is set to maxReplicaCount.

Potential additional e2e tests: default & custom strategies

Checklist

  • When introducing a new scaler, I agree with the scaling governance policy
  • I have verified that my change is according to the deprecations & breaking changes policy
  • Tests have been added (if applicable)
  • Ensure make generate-scalers-schema has been run to update any outdated generated files
  • Changelog has been updated and is aligned with our changelog requirements, only when the change impacts end users
  • A PR is opened to update our Helm chart (repo) (if applicable, ie. when deployment manifests are modified)
  • A PR is opened to update the documentation on (repo) (if applicable)
  • Commits are signed with Developer Certificate of Origin (DCO - learn more)

Fixes #3661

Relates to #

@jansworld jansworld requested a review from a team as a code owner April 6, 2026 03:03
@snyk-io
Copy link
Copy Markdown

snyk-io Bot commented Apr 6, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 6, 2026

Thank you for your contribution! 🙏

Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected.

While you are waiting, make sure to:

  • Add an entry in our changelog in alphabetical order and link related issue
  • Update the documentation, if needed
  • Add unit & e2e tests for your changes
  • GitHub checks are passing
  • Is the DCO check failing? Here is how you can fix DCO issues

Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient.

Learn more about our contribution guide.

@keda-automation keda-automation requested a review from a team April 6, 2026 03:03
Copy link
Copy Markdown
Member

@rickbrouwer rickbrouwer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this test! Happy with it 🙂 My first feedback.


// Queue up 8 more messages to trigger the cap condition
enqueueMessages(ctx, t, client, 8)
assert.True(t, WaitForRunningJobCount(t, kc, scaledJobName, testNamespace, 10, iterationCount, 1),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a potential race condition in the cap condition test. After WaitForRunningJobCount returns, the pods may still be in Pending state, which affects KEDA's scaling decision when you put the 8 messages. What do you think, can we wait until the pods are actually Running?

Copy link
Copy Markdown
Author

@jansworld jansworld Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rickbrouwer Good point. I wasn't thinking about the underlying pod at the time. I should be able to fix this in the helper function. I'll probably rename the helper function to WaitForRunningPodCount instead of WaitForRunningJobCount due to the distinction you pointed out.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I haven't looked properly myself yet (something about a lot of PRs to review lately), but could you check if WaitForAllPodRunningInNamespace might be used? Again, I'm not sure, but it's worth a check.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about using that, but felt it didn't give enough assurance that only a certain number of jobs were created. Alternatively, would WaitForScaledJobCount followed by WaitForAllPodRunningInNamespace be preferable to WaitForRunningPodCount?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think WaitForAllPodRunningInNamespace will work at first glance. You can adjust it and then test it locally to check. If you still find that difficult to test locally, you can also adjust it and I can start an e2e test here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I can work that in after a WaitForScaledJobCount call to verify job count. I'll see about testing it locally as well. I'll just need to create a test Storage Account & Queue so the connection string actually goes somewhere. I should be able to get to that after work.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rickbrouwer I wasn't able to test things locally, but I didn't want to hold this up. So, I pushed changes for the time being. If you could start the e2e test I would be very appreciative. I'm having some issues getting the ScaledJob to talk to Azurite at the moment when testing locally. This would probably work better with a real storage account, but I was trying to be frugal and not pay for one

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jansworld started here: #7607 (comment)

Comment thread tests/helper/helper.go Outdated
@keda-automation keda-automation requested a review from a team April 6, 2026 15:26
@rickbrouwer
Copy link
Copy Markdown
Member

One test is now nicely presented in a folder. What do you think, would it be nice to give the other one its own folder as well?

image

@zroubalik
Copy link
Copy Markdown
Member

zroubalik commented Apr 8, 2026

/run-e2e internal
Update: You can check the progress here

@zroubalik
Copy link
Copy Markdown
Member

zroubalik commented Apr 9, 2026

/run-e2e internal
Update: You can check the progress here

@jansworld jansworld force-pushed the new-scaledjob-tests branch from 474993c to 9e8c90b Compare April 10, 2026 01:44
Added tests to confirm that minReplicaCount and maxReplicaCount are nil when not specified in the ScaledJob. Added a test to confirm that when minReplicaCount is greater than maxReplicaCount, minReplicaCount is set to maxReplicaCount

Issue kedacore#3661

Signed-off-by: jansworld <navon.josh@gmail.com>
Added test to verify accurate scaling strategy behavior. The test covers 2 cases. The first case checks that when maxScale + runningJobs <= maxReplicaCount, jobs created = maxScale - pendingJobs. Pending jobs here is 0. The second case checks that when maxScale + runningJobs > maxReplicaCount, jobs created is clamped by maxReplicaCount (maxReplicaCount - runningJobCount

Issue kedacore#3661

Signed-off-by: jansworld <navon.josh@gmail.com>
Added an accurate sclaing strategy subdirectory
 test to avoid package naming conflicts with the eager test without modifying that go file.

Issue kedacore#3661

Signed-off-by: jansworld <navon.josh@gmail.com>
Deleted the test after reading the code for MinReplicaCount() and realizing that nothing actually gets set. It simply returns maxReplicaCount in the case that the min is greater than the max. Something like this belongs in pkg/scaling/executor/scaled_job_test.go if anything.

Issue kedacore#3661

Signed-off-by: jansworld <navon.josh@gmail.com>
…changes

Renamed WaitForRunningJobCount to WaitForRunningPodCount to address the fact that despite a job showing as Running, its underlying pod could still be in a Pending state. So, we instead wait for the pods to be Running. The other change made was to have the helper return false if the number of pods in a Running state differs from the target, instead of if the count was greater than or equal to the target. This will help ensure test accuracy.

Issue kedacore#3661

Signed-off-by: jansworld <navon.josh@gmail.com>
Fixed a mistake in the WaitForRunningPodCount helper function, where pods were targeted instead of Jobs, but the test for whether or not the pod was in a Running state used the wrong types to check

Issue kedacore#3661

Signed-off-by: jansworld <navon.josh@gmail.com>
Issue kedacore#3661

Signed-off-by: jansworld <navon.josh@gmail.com>
Removed WaitForRunningPodCount and opted to go with WaitForScaledJobCount followed by WaitForAllPodRunningInNamespace since that keeps the changes a bit more concise and works for what this test needs.

Issue kedacore#3661

Signed-off-by: jansworld <navon.josh@gmail.com>
…point of assertion for testing the accurate strategy

I hastily assumed that a combination of WaitForScaledJobCount and WaitForAllPodRunningInNamespace would satisfy what is needed for the test. However, that approach comes with a major flaw. That is, unless you delete the jobs, those pods count towards the condition in WaitForAllPodRunningInNamespace. Thus, a much simpler way to test the strategy is to wait for the correct amount of pods to be running. A running pod in the given namespace implies a corresponding Job exists. So, we test the 2 strategy cases with pod count. This also passes local tests

Issue kedacore#3661

Signed-off-by: jansworld <navon.josh@gmail.com>
@jansworld jansworld force-pushed the new-scaledjob-tests branch from 9e8c90b to 6283b40 Compare April 10, 2026 01:58
@rickbrouwer
Copy link
Copy Markdown
Member

rickbrouwer commented Apr 10, 2026

/run-e2e internal
Update: You can check the progress here

Added WaitForScaledJobCount back into the checks since it allows for quicker deletion of the queue messages instead of having to wait for the pods to be running. Additionally, increased the job sleep time to 120 seconds in order to account for any slowness in dispatching jobs. In e2e tests that were kicked off, it looked like some jobs finished before the queue was cleared, causing the running pod count to differ from the expected value. This issue looks to be fixed with the increased sleep time in the pod.

Issue kedacore#3661

Signed-off-by: jansworld <navon.josh@gmail.com>
@jansworld
Copy link
Copy Markdown
Author

@rickbrouwer are you able to kick off the e2e tests when you have a moment?

@rickbrouwer
Copy link
Copy Markdown
Member

rickbrouwer commented Apr 17, 2026

/run-e2e internal
Update: You can check the progress here

@rickbrouwer rickbrouwer added the Awaiting/2nd-approval This PR needs one more approval review label Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Awaiting/2nd-approval This PR needs one more approval review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extend e2e and unit test coverage for ScaledJob

3 participants