Skip to content

br/pkg/streamhelper: stabilize flaky TestSubBasic#67859

Open
flaky-claw wants to merge 2 commits intopingcap:masterfrom
flaky-claw:flakyfixer/case_900c57431749-a6
Open

br/pkg/streamhelper: stabilize flaky TestSubBasic#67859
flaky-claw wants to merge 2 commits intopingcap:masterfrom
flaky-claw:flakyfixer/case_900c57431749-a6

Conversation

@flaky-claw
Copy link
Copy Markdown
Contributor

@flaky-claw flaky-claw commented Apr 17, 2026

What problem does this PR solve?

Issue Number: close #67839

Problem Summary:
Flaky test TestSubBasic in br/pkg/streamhelper intermittently fails, so this PR stabilizes that path.

What changed and how does it work?

Root Cause

TEST_ISSUE in TestSubBasic waiting logic, where a single quiet poll could falsely conclude event drain completion before all queued flush events arrived.

Fix

TestSubBasic must deterministically wait for the full expected flush-event set before Drop() to avoid truncating event delivery and producing false checkpoint mismatches.

Verification

Spec:

  • target: br/pkg/streamhelper :: TestSubBasic
  • strategy: tidb.go_flaky.default
  • plan mode: BASELINE_ONLY
  • requirements: required case must execute; no skip; repeat count = 1
  • baseline gates: required_flaky_gate, build_safety_gate, intent_guard_gate

Observed result:

  • status: passed
  • required case executed: yes
  • submission decision: ALLOWED
  • scope debt present: yes

Gate checklist:

  • Required flaky gate: PASS
  • Build safety gate: PASS
  • Intent guard gate: PASS
  • Repo-wide advisory gate: SKIPPED
  • Feedback specific gate: SKIPPED

Commands:

  • go test -json ./br/pkg/streamhelper -run '^TestSubBasic$' -count=1
  • go test -json ./br/pkg/streamhelper -count=1
  • make build

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Fixes #67839

Summary by CodeRabbit

  • Tests
    • Reworked subscription tests to use checkpoint-based synchronization instead of the prior stabilization wait. Adds a hard timeout and early-closure detection for more robust failure handling, reduces an initial flush round, and ensures subscriber progress is waited-for deterministically. Final event merge and progress assertion remain unchanged, improving overall test reliability and determinism.

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 17, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai Bot commented Apr 17, 2026

@flaky-claw I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot ti-chi-bot Bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Apr 17, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 17, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign 3pointer for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tiprow
Copy link
Copy Markdown

tiprow Bot commented Apr 17, 2026

Hi @flaky-claw. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 17, 2026

📝 Walkthrough

Walkthrough

Replaces a timing-based stabilization wait in TestSubBasic with a new test helper waitCheckpoint that reads from sub.Events() until a merged span's MinValue() equals a target checkpoint, uses a 30s timeout, and reorganizes the test to coordinate failpoint pause/advance/flush before waiting for subscriber progress.

Changes

Cohort / File(s) Summary
Test changes
br/pkg/streamhelper/subscription_test.go
Added waitCheckpoint helper that continuously drains sub.Events() and merges spans until the merged MinValue() reaches a target checkpoint (30s timeout, fail on early close). Modified TestSubBasic to remove one initial checkpoint/flush round, insert a failpoint pause (aboutToSend), advance+flush to produce cp, release pause asynchronously, then wait for the subscriber to reach cp via waitCheckpoint; final merge/assert unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

ok-to-test, approved, lgtm

Suggested reviewers

  • YuJuncen
  • Leavrth
  • 3pointer

Poem

🐰 I hopped through events, one by one,
Merged the spans until the checkpoint shone,
A thirty-second watch, no frantic race,
Now tests align with steady pace,
Thump-thump — the stream helper's home.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately summarizes the main change: stabilizing a flaky test in br/pkg/streamhelper. It is concise, clear, and directly related to the changeset.
Description check ✅ Passed The PR description includes the required template sections: issue number linking to #67839, problem summary explaining the flakiness, root cause analysis, fix explanation, verification details, and test checklist with unit test selected.
Linked Issues check ✅ Passed The PR addresses the flaky test issue #67839 by modifying the test's waiting logic to deterministically wait for expected flush events before calling Drop(), which directly resolves the intermittent failure.
Out of Scope Changes check ✅ Passed All changes are scoped to the failing test TestSubBasic: adding a waitCheckpoint helper and modifying the test logic. No unrelated changes to production code or other components are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch flakyfixer/case_900c57431749-a6

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
br/pkg/streamhelper/subscription_test.go (1)

47-51: Optional: include an informative failure message.

If the subscriber stalls below expected, require.Eventually will fail with a generic "condition never satisfied" message and no visibility into how many events actually arrived, which makes future flake diagnosis harder. Consider capturing the last observed count and passing a msgAndArgs to aid debugging.

♻️ Proposed refactor
 func waitAtLeastEvents(t *testing.T, sub *streamhelper.FlushSubscriber, expected int) {
+	var last int
 	require.Eventually(t, func() bool {
-		return len(sub.Events()) >= expected
-	}, 30*time.Second, 10*time.Millisecond)
+		last = len(sub.Events())
+		return last >= expected
+	}, 30*time.Second, 10*time.Millisecond, "got %d events, expected at least %d", last, expected)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@br/pkg/streamhelper/subscription_test.go` around lines 47 - 51, The helper
waitAtLeastEvents currently calls require.Eventually with a bare predicate so
failures only show "condition never satisfied"; modify waitAtLeastEvents to
record the last observed count inside the predicate (calling sub.Events()) and
pass a descriptive msgAndArgs to require.Eventually that includes that
lastObserved value and the expected value so test failures show how many events
actually arrived; update references in this function (waitAtLeastEvents,
FlushSubscriber, Events()) only—no other API changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@br/pkg/streamhelper/subscription_test.go`:
- Around line 47-51: The helper waitAtLeastEvents currently calls
require.Eventually with a bare predicate so failures only show "condition never
satisfied"; modify waitAtLeastEvents to record the last observed count inside
the predicate (calling sub.Events()) and pass a descriptive msgAndArgs to
require.Eventually that includes that lastObserved value and the expected value
so test failures show how many events actually arrived; update references in
this function (waitAtLeastEvents, FlushSubscriber, Events()) only—no other API
changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 8fde9afe-1b3e-43f4-9813-48f92c709a5b

📥 Commits

Reviewing files that changed from the base of the PR and between eea8b1e and 7769048.

📒 Files selected for processing (1)
  • br/pkg/streamhelper/subscription_test.go

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.6346%. Comparing base (cc186b3) to head (7769048).
⚠️ Report is 71 commits behind head on master.

⚠️ Current head 7769048 differs from pull request most recent head 27b50b3

Please upload reports for the commit 27b50b3 to get more accurate results.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #67859        +/-   ##
================================================
+ Coverage   77.5862%   78.6346%   +1.0484%     
================================================
  Files          1982       1984         +2     
  Lines        548966     549132       +166     
================================================
+ Hits         425922     431808      +5886     
+ Misses       122239     116308      -5931     
- Partials        805       1016       +211     
Flag Coverage Δ
integration 44.2974% <ø> (+9.9573%) ⬆️
unit 76.6397% <ø> (+0.2997%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 61.5065% <ø> (+0.0901%) ⬆️
parser ∅ <ø> (∅)
br 66.1005% <ø> (+5.5880%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@yinsustart
Copy link
Copy Markdown

/retest

@tiprow
Copy link
Copy Markdown

tiprow Bot commented Apr 22, 2026

@yinsustart: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Copy Markdown
Contributor

@YuJuncen YuJuncen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to extend expected checkpoint range instead of dropping events.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 23, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-23 04:10:48.892074246 +0000 UTC m=+2225454.097434304: ✖️🔁 reset by YuJuncen.

}
}()

req.Equal(cp, s.MinValue(), "%d vs %d", cp, s.MinValue())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASSERT: cp ≥ s.MinValue()

If you noticed cp < s.MinValue() in a "flaky" test, there must be other problems. Upload full log and ask for help.

@flaky-claw
Copy link
Copy Markdown
Contributor Author

FlakyFixer PR update summary

Fix

  • The test must wait for semantic checkpoint coverage because Drop can cancel a last in-flight flush event after an idle-looking window and before the assertion is true.

Verification

Spec:

  • target: br/pkg/streamhelper :: TestSubBasic
  • strategy: tidb.go_flaky.default
  • plan mode: BASELINE_PLUS_FEEDBACK_DELTA
  • requirements: required case must execute; no skip; repeat count = 1
  • baseline gates: required_flaky_gate, build_safety_gate, intent_guard_gate
  • feedback delta: 1 finding(s) via feedback_specific_gate (non-blocking fallback)
  • feedback surface source: baseline_only

Observed result:

  • status: passed
  • required case executed: yes
  • submission decision: ALLOWED
  • scope debt present: yes

Gate checklist:

  • Required flaky gate: PASS
  • Build safety gate: PASS
  • Intent guard gate: PASS
  • Repo-wide advisory gate: SKIPPED
  • Feedback specific gate: SKIPPED

Commands:

  • go test -json ./br/pkg/streamhelper -run '^TestSubBasic$' -count=1
  • go test -json ./br/pkg/streamhelper -count=1
  • make build

@ti-chi-bot ti-chi-bot Bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 27, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
br/pkg/streamhelper/subscription_test.go (1)

77-89: Consider hardening failpoint cleanup with t.Cleanup.

In the happy path the spawned goroutine reliably disables subscription.listenOver.aboutToSend (it survives t.FailNow() on the test goroutine since Goexit only terminates the calling goroutine, and releaseErr is buffered). However, if failpoint.Disable itself returns an error, req.NoError(<-releaseErr) fails the test while leaving the failpoint enabled and leaking it into subsequent tests in the package. A t.Cleanup registered immediately after Enable would make disable idempotent and guaranteed.

♻️ Suggested hardening
 fp := "github.com/pingcap/tidb/br/pkg/streamhelper/subscription.listenOver.aboutToSend"
 req.NoError(failpoint.Enable(fp, "pause"))
+t.Cleanup(func() { _ = failpoint.Disable(fp) })
 releaseErr := make(chan error, 1)
 cp = c.advanceCheckpoints()
 c.flushAll()
 go func() {
 	time.Sleep(10 * time.Millisecond)
 	releaseErr <- failpoint.Disable(fp)
 }()

As per coding guidelines: "Unit tests in a package that uses failpoints: MUST enable failpoints before tests and disable afterward."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@br/pkg/streamhelper/subscription_test.go` around lines 77 - 89, After calling
failpoint.Enable (fp :=
"github.com/pingcap/tidb/br/pkg/streamhelper/subscription.listenOver.aboutToSend"
and req.NoError(failpoint.Enable(fp, "pause"))), register a t.Cleanup that
disables the failpoint (call failpoint.Disable(fp) and ignore or log its error)
so disable is idempotent and always run even if the test fails; keep the
existing goroutine that writes to releaseErr but ensure the cleanup is the
guaranteed fallback to avoid leaking the failpoint.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@br/pkg/streamhelper/subscription_test.go`:
- Around line 77-89: After calling failpoint.Enable (fp :=
"github.com/pingcap/tidb/br/pkg/streamhelper/subscription.listenOver.aboutToSend"
and req.NoError(failpoint.Enable(fp, "pause"))), register a t.Cleanup that
disables the failpoint (call failpoint.Disable(fp) and ignore or log its error)
so disable is idempotent and always run even if the test fails; keep the
existing goroutine that writes to releaseErr but ensure the cleanup is the
guaranteed fallback to avoid leaking the failpoint.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a612be71-da5e-4a3f-a676-f50ecc46e0b2

📥 Commits

Reviewing files that changed from the base of the PR and between 7769048 and 27b50b3.

📒 Files selected for processing (1)
  • br/pkg/streamhelper/subscription_test.go

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 28, 2026

@flaky-claw: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-integration-e2e-test 27b50b3 link true /test pull-integration-e2e-test
pull-br-integration-test 27b50b3 link true /test pull-br-integration-test
idc-jenkins-ci-tidb/mysql-test 27b50b3 link true /test mysql-test
idc-jenkins-ci-tidb/check_dev_2 27b50b3 link true /test check-dev2
pull-mysql-client-test 27b50b3 link true /test pull-mysql-client-test
pull-build-next-gen 27b50b3 link true /test pull-build-next-gen
pull-mysql-client-test-next-gen 27b50b3 link true /test pull-mysql-client-test-next-gen
pull-unit-test-next-gen 27b50b3 link true /test pull-unit-test-next-gen
pull-integration-realcluster-test-next-gen 27b50b3 link true /test pull-integration-realcluster-test-next-gen
idc-jenkins-ci-tidb/unit-test 27b50b3 link true /test unit-test
idc-jenkins-ci-tidb/build 27b50b3 link true /test build
idc-jenkins-ci-tidb/check_dev 27b50b3 link true /test check-dev

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: TestSubBasic in br/pkg/streamhelper

3 participants