Conversation
There was a problem hiding this comment.
Pull request overview
Improves e2e test resilience by waiting for Deployments to be ready (not just present) and by skipping MUO tests when MUO isn’t enabled/managed, addressing flakes caused by timeout/race conditions around reconciliation readiness.
Changes:
- Add
Eventually-based helpers to wait for MUO/Gatekeeper Deployments to become ready. - Update MUO restore test to skip when MUO isn’t enabled/managed and to wait for readiness after reconciliation.
- Update Guardrails Gatekeeper restore tests to wait for readiness after reconciliation.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: hlipsig <8000786+hlipsig@users.noreply.github.com>
Revert flakey test changes and increase MUO SpecTimeout to 3 minutes
test/e2e/operator.go
Outdated
| By("waiting for the MUO deployment to be reconciled") | ||
| GetK8sObjectWithRetry(ctx, getFunc, managedUpgradeOperatorDeployment, metav1.GetOptions{}) | ||
| }, SpecTimeout(2*time.Minute)) | ||
| }, SpecTimeout(3*time.Minute)) |
There was a problem hiding this comment.
Rather than specify a timeout for this specific test, can we leverage the Eventually() pattern like other tests, using the DefaultEventuallyTimeout at test/e2e/helpers.go, which is 5 minutes?
There was a problem hiding this comment.
If the .bingo change is required, can we leave it to a new PR since this one is meant to fix the e2e flake?
There was a problem hiding this comment.
Yeah. I'll do that.
… explicit goversion from bingo's go.mod
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
test/e2e/operator.go:432
Describecontainer bodies in this test suite appear to consistently usefunc()(withctx context.Contextonly onIt/hooks). Changing thisDescribetofunc(ctx context.Context)is likely an invalid signature for Ginkgo container nodes and will break compilation. Please revert thisDescribebody back tofunc()and keep context usage withinIt/BeforeEach/AfterEach where supported.
var _ = Describe("ARO Operator - MUO Deployment", func() {
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| By("waiting for the MUO deployment to be reconciled") | ||
| GetK8sObjectWithRetry(ctx, getFunc, managedUpgradeOperatorDeployment, metav1.GetOptions{}) | ||
| }, SpecTimeout(2*time.Minute)) | ||
| Eventually(func(g Gomega, ctx context.Context) { | ||
| _, err := getFunc(ctx, managedUpgradeOperatorDeployment, metav1.GetOptions{}) | ||
| g.Expect(err).NotTo(HaveOccurred()) | ||
| }).WithContext(ctx).WithTimeout(DefaultEventuallyTimeout).Should(Succeed()) | ||
| }) |
There was a problem hiding this comment.
This spec previously had a spec-level timeout and the PR description/linked context indicates the intent is to increase it to ~3 minutes to reduce flakes. The current change removes the SpecTimeout(...) entirely and instead relies on DefaultEventuallyTimeout (currently 5 minutes) for this wait, which doesn’t match the stated goal and changes the overall bounding behavior of the spec. Please reintroduce an explicit spec-level timeout (e.g., 3 minutes) or set this Eventually timeout to the intended value and keep a spec-level bound for the whole test.
| Eventually(func(g Gomega, ctx context.Context) { | ||
| _, err := getFunc(ctx, managedUpgradeOperatorDeployment, metav1.GetOptions{}) | ||
| g.Expect(err).NotTo(HaveOccurred()) | ||
| }).WithContext(ctx).WithTimeout(DefaultEventuallyTimeout).Should(Succeed()) |
There was a problem hiding this comment.
This Eventually block duplicates the existing GetK8sObjectWithRetry helper (used throughout this file) and also changes retry semantics (helper uses DefaultTimeout and PollingInterval, while this uses DefaultEventuallyTimeout and the suite’s default polling interval). To keep retries/timeouts consistent across the e2e suite and avoid future tuning in multiple places, prefer using the shared helper here (or explicitly match its timeout/polling settings).
| Eventually(func(g Gomega, ctx context.Context) { | |
| _, err := getFunc(ctx, managedUpgradeOperatorDeployment, metav1.GetOptions{}) | |
| g.Expect(err).NotTo(HaveOccurred()) | |
| }).WithContext(ctx).WithTimeout(DefaultEventuallyTimeout).Should(Succeed()) | |
| GetK8sObjectWithRetry(ctx, getFunc, managedUpgradeOperatorDeployment, metav1.GetOptions{}) |
| Eventually(func(g Gomega, ctx context.Context) { | ||
| _, err := getFunc(ctx, gkControllerManagerDeployment, metav1.GetOptions{}) | ||
| g.Expect(err).NotTo(HaveOccurred()) | ||
| }).WithContext(ctx).WithTimeout(DefaultEventuallyTimeout).Should(Succeed()) |
There was a problem hiding this comment.
This reconciliation wait was changed from GetK8sObjectWithRetry to a custom Eventually with different timeout/polling defaults. Unless there’s a specific behavior change needed here, consider using the shared helper for consistency (or explicitly set polling/timeout to match it) to avoid subtly different retry behavior across similar tests.
| Eventually(func(g Gomega, ctx context.Context) { | ||
| _, err := getFunc(ctx, gkAuditDeployment, metav1.GetOptions{}) | ||
| g.Expect(err).NotTo(HaveOccurred()) | ||
| }).WithContext(ctx).WithTimeout(DefaultEventuallyTimeout).Should(Succeed()) |
There was a problem hiding this comment.
This reconciliation wait was changed from GetK8sObjectWithRetry to a custom Eventually with different timeout/polling defaults. Unless there’s a specific behavior change needed here, consider using the shared helper for consistency (or explicitly set polling/timeout to match it) to avoid subtly different retry behavior across similar tests.
| Eventually(func(g Gomega, ctx context.Context) { | |
| _, err := getFunc(ctx, gkAuditDeployment, metav1.GetOptions{}) | |
| g.Expect(err).NotTo(HaveOccurred()) | |
| }).WithContext(ctx).WithTimeout(DefaultEventuallyTimeout).Should(Succeed()) | |
| GetK8sObjectWithRetry(ctx, getFunc, gkAuditDeployment, metav1.GetOptions{}) |
Which issue this PR addresses:
ARO-25407 Flakes for timeouts in MUO test.
What this PR does / why we need it:
Improves tests resilience by changing the timeout condition to 3 minutes.
Test plan for issue:
Local run looked good. Green e2e signal is what's needed.
Is there any documentation that needs to be updated for this PR?
How do you know this will function as expected in production?
Only fixes tests.