Skip to content

Attempt at fixing flakey tests#4700

Open
hlipsig wants to merge 7 commits intomasterfrom
hlipsig/ARO-25407/fix-muo-gatekeeper-e2e-flake
Open

Attempt at fixing flakey tests#4700
hlipsig wants to merge 7 commits intomasterfrom
hlipsig/ARO-25407/fix-muo-gatekeeper-e2e-flake

Conversation

@hlipsig
Copy link
Copy Markdown
Collaborator

@hlipsig hlipsig commented Mar 19, 2026

Which issue this PR addresses:

ARO-25407 Flakes for timeouts in MUO test.

What this PR does / why we need it:

Improves tests resilience by changing the timeout condition to 3 minutes.

Test plan for issue:

Local run looked good. Green e2e signal is what's needed.

Is there any documentation that needs to be updated for this PR?

How do you know this will function as expected in production?

Only fixes tests.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves e2e test resilience by waiting for Deployments to be ready (not just present) and by skipping MUO tests when MUO isn’t enabled/managed, addressing flakes caused by timeout/race conditions around reconciliation readiness.

Changes:

  • Add Eventually-based helpers to wait for MUO/Gatekeeper Deployments to become ready.
  • Update MUO restore test to skip when MUO isn’t enabled/managed and to wait for readiness after reconciliation.
  • Update Guardrails Gatekeeper restore tests to wait for readiness after reconciliation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI commented Mar 20, 2026

@hlipsig I've opened a new pull request, #4701, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 3 commits March 20, 2026 00:25
Co-authored-by: hlipsig <8000786+hlipsig@users.noreply.github.com>
Revert flakey test changes and increase MUO SpecTimeout to 3 minutes
By("waiting for the MUO deployment to be reconciled")
GetK8sObjectWithRetry(ctx, getFunc, managedUpgradeOperatorDeployment, metav1.GetOptions{})
}, SpecTimeout(2*time.Minute))
}, SpecTimeout(3*time.Minute))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than specify a timeout for this specific test, can we leverage the Eventually() pattern like other tests, using the DefaultEventuallyTimeout at test/e2e/helpers.go, which is 5 minutes?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the .bingo change is required, can we leave it to a new PR since this one is meant to fix the e2e flake?

Copy link
Copy Markdown
Collaborator Author

@hlipsig hlipsig Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I'll do that.

Copilot AI review requested due to automatic review settings April 3, 2026 15:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

test/e2e/operator.go:432

  • Describe container bodies in this test suite appear to consistently use func() (with ctx context.Context only on It/hooks). Changing this Describe to func(ctx context.Context) is likely an invalid signature for Ginkgo container nodes and will break compilation. Please revert this Describe body back to func() and keep context usage within It/BeforeEach/AfterEach where supported.
var _ = Describe("ARO Operator - MUO Deployment", func() {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 448 to +453
By("waiting for the MUO deployment to be reconciled")
GetK8sObjectWithRetry(ctx, getFunc, managedUpgradeOperatorDeployment, metav1.GetOptions{})
}, SpecTimeout(2*time.Minute))
Eventually(func(g Gomega, ctx context.Context) {
_, err := getFunc(ctx, managedUpgradeOperatorDeployment, metav1.GetOptions{})
g.Expect(err).NotTo(HaveOccurred())
}).WithContext(ctx).WithTimeout(DefaultEventuallyTimeout).Should(Succeed())
})
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spec previously had a spec-level timeout and the PR description/linked context indicates the intent is to increase it to ~3 minutes to reduce flakes. The current change removes the SpecTimeout(...) entirely and instead relies on DefaultEventuallyTimeout (currently 5 minutes) for this wait, which doesn’t match the stated goal and changes the overall bounding behavior of the spec. Please reintroduce an explicit spec-level timeout (e.g., 3 minutes) or set this Eventually timeout to the intended value and keep a spec-level bound for the whole test.

Copilot uses AI. Check for mistakes.
Comment on lines +449 to +452
Eventually(func(g Gomega, ctx context.Context) {
_, err := getFunc(ctx, managedUpgradeOperatorDeployment, metav1.GetOptions{})
g.Expect(err).NotTo(HaveOccurred())
}).WithContext(ctx).WithTimeout(DefaultEventuallyTimeout).Should(Succeed())
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Eventually block duplicates the existing GetK8sObjectWithRetry helper (used throughout this file) and also changes retry semantics (helper uses DefaultTimeout and PollingInterval, while this uses DefaultEventuallyTimeout and the suite’s default polling interval). To keep retries/timeouts consistent across the e2e suite and avoid future tuning in multiple places, prefer using the shared helper here (or explicitly match its timeout/polling settings).

Suggested change
Eventually(func(g Gomega, ctx context.Context) {
_, err := getFunc(ctx, managedUpgradeOperatorDeployment, metav1.GetOptions{})
g.Expect(err).NotTo(HaveOccurred())
}).WithContext(ctx).WithTimeout(DefaultEventuallyTimeout).Should(Succeed())
GetK8sObjectWithRetry(ctx, getFunc, managedUpgradeOperatorDeployment, metav1.GetOptions{})

Copilot uses AI. Check for mistakes.
Comment on lines +755 to +758
Eventually(func(g Gomega, ctx context.Context) {
_, err := getFunc(ctx, gkControllerManagerDeployment, metav1.GetOptions{})
g.Expect(err).NotTo(HaveOccurred())
}).WithContext(ctx).WithTimeout(DefaultEventuallyTimeout).Should(Succeed())
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reconciliation wait was changed from GetK8sObjectWithRetry to a custom Eventually with different timeout/polling defaults. Unless there’s a specific behavior change needed here, consider using the shared helper for consistency (or explicitly set polling/timeout to match it) to avoid subtly different retry behavior across similar tests.

Copilot uses AI. Check for mistakes.
Comment on lines +780 to +783
Eventually(func(g Gomega, ctx context.Context) {
_, err := getFunc(ctx, gkAuditDeployment, metav1.GetOptions{})
g.Expect(err).NotTo(HaveOccurred())
}).WithContext(ctx).WithTimeout(DefaultEventuallyTimeout).Should(Succeed())
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reconciliation wait was changed from GetK8sObjectWithRetry to a custom Eventually with different timeout/polling defaults. Unless there’s a specific behavior change needed here, consider using the shared helper for consistency (or explicitly set polling/timeout to match it) to avoid subtly different retry behavior across similar tests.

Suggested change
Eventually(func(g Gomega, ctx context.Context) {
_, err := getFunc(ctx, gkAuditDeployment, metav1.GetOptions{})
g.Expect(err).NotTo(HaveOccurred())
}).WithContext(ctx).WithTimeout(DefaultEventuallyTimeout).Should(Succeed())
GetK8sObjectWithRetry(ctx, getFunc, gkAuditDeployment, metav1.GetOptions{})

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants