test: Accurate scaling strategy e2e test and functional tests for replica count defaults#7607
test: Accurate scaling strategy e2e test and functional tests for replica count defaults#7607jansworld wants to merge 10 commits intokedacore:mainfrom
Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
|
Thank you for your contribution! 🙏 Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected. While you are waiting, make sure to:
Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient. Learn more about our contribution guide. |
rickbrouwer
left a comment
There was a problem hiding this comment.
Thanks for adding this test! Happy with it 🙂 My first feedback.
|
|
||
| // Queue up 8 more messages to trigger the cap condition | ||
| enqueueMessages(ctx, t, client, 8) | ||
| assert.True(t, WaitForRunningJobCount(t, kc, scaledJobName, testNamespace, 10, iterationCount, 1), |
There was a problem hiding this comment.
I think there's a potential race condition in the cap condition test. After WaitForRunningJobCount returns, the pods may still be in Pending state, which affects KEDA's scaling decision when you put the 8 messages. What do you think, can we wait until the pods are actually Running?
There was a problem hiding this comment.
@rickbrouwer Good point. I wasn't thinking about the underlying pod at the time. I should be able to fix this in the helper function. I'll probably rename the helper function to WaitForRunningPodCount instead of WaitForRunningJobCount due to the distinction you pointed out.
There was a problem hiding this comment.
Yeah. I haven't looked properly myself yet (something about a lot of PRs to review lately), but could you check if WaitForAllPodRunningInNamespace might be used? Again, I'm not sure, but it's worth a check.
There was a problem hiding this comment.
I thought about using that, but felt it didn't give enough assurance that only a certain number of jobs were created. Alternatively, would WaitForScaledJobCount followed by WaitForAllPodRunningInNamespace be preferable to WaitForRunningPodCount?
There was a problem hiding this comment.
I think WaitForAllPodRunningInNamespace will work at first glance. You can adjust it and then test it locally to check. If you still find that difficult to test locally, you can also adjust it and I can start an e2e test here.
There was a problem hiding this comment.
Sounds good. I can work that in after a WaitForScaledJobCount call to verify job count. I'll see about testing it locally as well. I'll just need to create a test Storage Account & Queue so the connection string actually goes somewhere. I should be able to get to that after work.
There was a problem hiding this comment.
@rickbrouwer I wasn't able to test things locally, but I didn't want to hold this up. So, I pushed changes for the time being. If you could start the e2e test I would be very appreciative. I'm having some issues getting the ScaledJob to talk to Azurite at the moment when testing locally. This would probably work better with a real storage account, but I was trying to be frugal and not pay for one
|
/run-e2e internal |
|
/run-e2e internal |
474993c to
9e8c90b
Compare
Added tests to confirm that minReplicaCount and maxReplicaCount are nil when not specified in the ScaledJob. Added a test to confirm that when minReplicaCount is greater than maxReplicaCount, minReplicaCount is set to maxReplicaCount Issue kedacore#3661 Signed-off-by: jansworld <navon.josh@gmail.com>
Added test to verify accurate scaling strategy behavior. The test covers 2 cases. The first case checks that when maxScale + runningJobs <= maxReplicaCount, jobs created = maxScale - pendingJobs. Pending jobs here is 0. The second case checks that when maxScale + runningJobs > maxReplicaCount, jobs created is clamped by maxReplicaCount (maxReplicaCount - runningJobCount Issue kedacore#3661 Signed-off-by: jansworld <navon.josh@gmail.com>
Added an accurate sclaing strategy subdirectory test to avoid package naming conflicts with the eager test without modifying that go file. Issue kedacore#3661 Signed-off-by: jansworld <navon.josh@gmail.com>
Deleted the test after reading the code for MinReplicaCount() and realizing that nothing actually gets set. It simply returns maxReplicaCount in the case that the min is greater than the max. Something like this belongs in pkg/scaling/executor/scaled_job_test.go if anything. Issue kedacore#3661 Signed-off-by: jansworld <navon.josh@gmail.com>
…changes Renamed WaitForRunningJobCount to WaitForRunningPodCount to address the fact that despite a job showing as Running, its underlying pod could still be in a Pending state. So, we instead wait for the pods to be Running. The other change made was to have the helper return false if the number of pods in a Running state differs from the target, instead of if the count was greater than or equal to the target. This will help ensure test accuracy. Issue kedacore#3661 Signed-off-by: jansworld <navon.josh@gmail.com>
Fixed a mistake in the WaitForRunningPodCount helper function, where pods were targeted instead of Jobs, but the test for whether or not the pod was in a Running state used the wrong types to check Issue kedacore#3661 Signed-off-by: jansworld <navon.josh@gmail.com>
Issue kedacore#3661 Signed-off-by: jansworld <navon.josh@gmail.com>
Removed WaitForRunningPodCount and opted to go with WaitForScaledJobCount followed by WaitForAllPodRunningInNamespace since that keeps the changes a bit more concise and works for what this test needs. Issue kedacore#3661 Signed-off-by: jansworld <navon.josh@gmail.com>
…point of assertion for testing the accurate strategy I hastily assumed that a combination of WaitForScaledJobCount and WaitForAllPodRunningInNamespace would satisfy what is needed for the test. However, that approach comes with a major flaw. That is, unless you delete the jobs, those pods count towards the condition in WaitForAllPodRunningInNamespace. Thus, a much simpler way to test the strategy is to wait for the correct amount of pods to be running. A running pod in the given namespace implies a corresponding Job exists. So, we test the 2 strategy cases with pod count. This also passes local tests Issue kedacore#3661 Signed-off-by: jansworld <navon.josh@gmail.com>
9e8c90b to
6283b40
Compare
|
/run-e2e internal |
Added WaitForScaledJobCount back into the checks since it allows for quicker deletion of the queue messages instead of having to wait for the pods to be running. Additionally, increased the job sleep time to 120 seconds in order to account for any slowness in dispatching jobs. In e2e tests that were kicked off, it looked like some jobs finished before the queue was cleared, causing the running pod count to differ from the expected value. This issue looks to be fixed with the increased sleep time in the pod. Issue kedacore#3661 Signed-off-by: jansworld <navon.josh@gmail.com>
|
@rickbrouwer are you able to kick off the e2e tests when you have a moment? |
|
/run-e2e internal |

Added a test for the accurate scaling strategy (within its own sub-directory so as to not modify the package name on the eager scaling strategy test). This test verifies both cases of the accurate scaling strategy. That is, when (maxScale + runningJobCount) <= maxReplicaCount, the number of new jobs created is maxScale - pendingJobCount (case 1). When (maxScale + runningJobCount) > maxReplicaCount, the number of new jobs created is maxReplicaCount - runningJobCount (case 2). We use a maxReplicaCount of 10.
*It's worth noting that since we want a long enough running job such that they stick around for a bit in order to test case 2, the decision was made to simply use sleeper pods instead of an actual message processor. To simulate message consumption, the queue is cleared, as this mimics the behavior of Azure Storage Queue's length property when a processor is consuming a message (locked messages not reported in queue length). If this is a problem, I can put some more time in and actually create a message processor that behaves similarly to the sleeper pod.
*Also note that these tests are performed with the pending job count as effectively 0.
Case 1: Send 4 messages into the queue. Wait for 4 jobs to be running. Clear the queue to simulate message consumption. Wait for all jobs to succeed.
Case 2: Send 4 messages into the queue. Wait for 4 jobs to be running. Clear the queue to simulate message consumption. Send 8 more messages into the queue. Assert that running jobs is clamped by maxReplicaCount (put differently, wait for 10 jobs to be running). This verifies the accurate strategy. Clean up by clearing the queue & waiting for all jobs to succeed.
Also added 3 functional tests confirming minReplicaCount and maxReplicaCount behavior in ScaledJobs. The first 2 tests verify default behavior, checking that the 2 fields evaluate to nil when omitted in the ScaledJob spec. The third test verifies that when minReplicaCount > maxReplicaCount, minReplicaCount is set to maxReplicaCount.
Potential additional e2e tests: default & custom strategies
Checklist
make generate-scalers-schemahas been run to update any outdated generated filesFixes #3661
Relates to #