br/pkg/streamhelper: stabilize flaky TestSubBasic by flaky-claw · Pull Request #67859 · pingcap/tidb

flaky-claw · 2026-04-17T14:36:59Z

What problem does this PR solve?

Issue Number: close #67839

Problem Summary:
Flaky test TestSubBasic in br/pkg/streamhelper intermittently fails, so this PR stabilizes that path.

What changed and how does it work?

Root Cause

TEST_ISSUE in TestSubBasic waiting logic, where a single quiet poll could falsely conclude event drain completion before all queued flush events arrived.

Fix

TestSubBasic must deterministically wait for the full expected flush-event set before Drop() to avoid truncating event delivery and producing false checkpoint mismatches.

Verification

Spec:

target: br/pkg/streamhelper :: TestSubBasic
strategy: tidb.go_flaky.default
plan mode: BASELINE_ONLY
requirements: required case must execute; no skip; repeat count = 1
baseline gates: required_flaky_gate, build_safety_gate, intent_guard_gate

Observed result:

status: passed
required case executed: yes
submission decision: ALLOWED
scope debt present: yes

Gate checklist:

Required flaky gate: PASS
Build safety gate: PASS
Intent guard gate: PASS
Repo-wide advisory gate: SKIPPED
Feedback specific gate: SKIPPED

Commands:

go test -json ./br/pkg/streamhelper -run '^TestSubBasic$' -count=1
go test -json ./br/pkg/streamhelper -count=1
make build

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Fixes #67839

Summary by CodeRabbit

Tests
- Reworked subscription tests to use checkpoint-based synchronization instead of the prior stabilization wait. Adds a hard timeout and early-closure detection for more robust failure handling, reduces an initial flush round, and ensures subscriber progress is waited-for deterministically. Final event merge and progress assertion remain unchanged, improving overall test reliability and determinism.

pantheon-ai · 2026-04-17T14:37:06Z

@flaky-claw I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

_{ℹ️ Learn more details on Pantheon AI.}

ti-chi-bot · 2026-04-17T14:37:11Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign 3pointer for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

br/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tiprow · 2026-04-17T14:37:20Z

Hi @flaky-claw. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai · 2026-04-17T14:37:23Z

📝 Walkthrough

Walkthrough

Replaces a timing-based stabilization wait in TestSubBasic with a new test helper waitCheckpoint that reads from sub.Events() until a merged span's MinValue() equals a target checkpoint, uses a 30s timeout, and reorganizes the test to coordinate failpoint pause/advance/flush before waiting for subscriber progress.

Changes

Cohort / File(s)	Summary
Test changes `br/pkg/streamhelper/subscription_test.go`	Added `waitCheckpoint` helper that continuously drains `sub.Events()` and merges spans until the merged `MinValue()` reaches a target checkpoint (30s timeout, fail on early close). Modified `TestSubBasic` to remove one initial checkpoint/flush round, insert a failpoint pause (`aboutToSend`), advance+flush to produce `cp`, release pause asynchronously, then wait for the subscriber to reach `cp` via `waitCheckpoint`; final merge/assert unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

br/pkg/streamhelper: stabilize flaky TestStoreRemoved #67954 — similar changes to br/pkg/streamhelper/subscription_test.go: adds a helper that drains sub.Events() and replaces timing-based waits with checkpoint-based synchronization.

Suggested labels

ok-to-test, approved, lgtm

Suggested reviewers

YuJuncen
Leavrth
3pointer

Poem

🐰 I hopped through events, one by one,
Merged the spans until the checkpoint shone,
A thirty-second watch, no frantic race,
Now tests align with steady pace,
Thump-thump — the stream helper's home.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title accurately summarizes the main change: stabilizing a flaky test in br/pkg/streamhelper. It is concise, clear, and directly related to the changeset.
Description check	✅ Passed	The PR description includes the required template sections: issue number linking to `#67839`, problem summary explaining the flakiness, root cause analysis, fix explanation, verification details, and test checklist with unit test selected.
Linked Issues check	✅ Passed	The PR addresses the flaky test issue `#67839` by modifying the test's waiting logic to deterministically wait for expected flush events before calling Drop(), which directly resolves the intermittent failure.
Out of Scope Changes check	✅ Passed	All changes are scoped to the failing test TestSubBasic: adding a waitCheckpoint helper and modifying the test logic. No unrelated changes to production code or other components are present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

⚔️ Resolve merge conflicts

Resolve merge conflict in branch flakyfixer/case_900c57431749-a6

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

Command failed

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

br/pkg/streamhelper/subscription_test.go (1)

47-51: Optional: include an informative failure message.

If the subscriber stalls below expected, require.Eventually will fail with a generic "condition never satisfied" message and no visibility into how many events actually arrived, which makes future flake diagnosis harder. Consider capturing the last observed count and passing a msgAndArgs to aid debugging.

♻️ Proposed refactor

 func waitAtLeastEvents(t *testing.T, sub *streamhelper.FlushSubscriber, expected int) {
+	var last int
 	require.Eventually(t, func() bool {
-		return len(sub.Events()) >= expected
-	}, 30*time.Second, 10*time.Millisecond)
+		last = len(sub.Events())
+		return last >= expected
+	}, 30*time.Second, 10*time.Millisecond, "got %d events, expected at least %d", last, expected)
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@br/pkg/streamhelper/subscription_test.go` around lines 47 - 51, The helper
waitAtLeastEvents currently calls require.Eventually with a bare predicate so
failures only show "condition never satisfied"; modify waitAtLeastEvents to
record the last observed count inside the predicate (calling sub.Events()) and
pass a descriptive msgAndArgs to require.Eventually that includes that
lastObserved value and the expected value so test failures show how many events
actually arrived; update references in this function (waitAtLeastEvents,
FlushSubscriber, Events()) only—no other API changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@br/pkg/streamhelper/subscription_test.go`:
- Around line 47-51: The helper waitAtLeastEvents currently calls
require.Eventually with a bare predicate so failures only show "condition never
satisfied"; modify waitAtLeastEvents to record the last observed count inside
the predicate (calling sub.Events()) and pass a descriptive msgAndArgs to
require.Eventually that includes that lastObserved value and the expected value
so test failures show how many events actually arrived; update references in
this function (waitAtLeastEvents, FlushSubscriber, Events()) only—no other API
changes.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 8fde9afe-1b3e-43f4-9813-48f92c709a5b

📥 Commits

Reviewing files that changed from the base of the PR and between eea8b1e and 7769048.

📒 Files selected for processing (1)

br/pkg/streamhelper/subscription_test.go

codecov · 2026-04-17T14:54:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.6346%. Comparing base (cc186b3) to head (7769048).
⚠️ Report is 71 commits behind head on master.

⚠️ Current head 7769048 differs from pull request most recent head 27b50b3

Please upload reports for the commit 27b50b3 to get more accurate results.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #67859        +/-   ##
================================================
+ Coverage   77.5862%   78.6346%   +1.0484%     
================================================
  Files          1982       1984         +2     
  Lines        548966     549132       +166     
================================================
+ Hits         425922     431808      +5886     
+ Misses       122239     116308      -5931     
- Partials        805       1016       +211

Flag	Coverage Δ
integration	`44.2974% <ø> (+9.9573%)`	⬆️
unit	`76.6397% <ø> (+0.2997%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`61.5065% <ø> (+0.0901%)`	⬆️
parser	`∅ <ø> (∅)`
br	`66.1005% <ø> (+5.5880%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

yinsustart · 2026-04-22T11:26:33Z

/retest

tiprow · 2026-04-22T11:26:57Z

@yinsustart: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

YuJuncen

Better to extend expected checkpoint range instead of dropping events.

ti-chi-bot · 2026-04-23T04:10:49Z

[LGTM Timeline notifier]

Timeline:

2026-04-23 04:10:48.892074246 +0000 UTC m=+2225454.097434304: ✖️🔁 reset by YuJuncen.

YuJuncen · 2026-04-23T04:25:38Z

 		}
 	}()

 	req.Equal(cp, s.MinValue(), "%d vs %d", cp, s.MinValue())


ASSERT: cp ≥ s.MinValue()

If you noticed cp < s.MinValue() in a "flaky" test, there must be other problems. Upload full log and ask for help.

flaky-claw · 2026-04-27T13:05:52Z

FlakyFixer PR update summary

Fix

The test must wait for semantic checkpoint coverage because Drop can cancel a last in-flight flush event after an idle-looking window and before the assertion is true.

Verification

Spec:

target: br/pkg/streamhelper :: TestSubBasic
strategy: tidb.go_flaky.default
plan mode: BASELINE_PLUS_FEEDBACK_DELTA
requirements: required case must execute; no skip; repeat count = 1
baseline gates: required_flaky_gate, build_safety_gate, intent_guard_gate
feedback delta: 1 finding(s) via feedback_specific_gate (non-blocking fallback)
feedback surface source: baseline_only

Observed result:

status: passed
required case executed: yes
submission decision: ALLOWED
scope debt present: yes

Gate checklist:

Required flaky gate: PASS
Build safety gate: PASS
Intent guard gate: PASS
Repo-wide advisory gate: SKIPPED
Feedback specific gate: SKIPPED

Commands:

go test -json ./br/pkg/streamhelper -run '^TestSubBasic$' -count=1
go test -json ./br/pkg/streamhelper -count=1
make build

coderabbitai

🧹 Nitpick comments (1)

br/pkg/streamhelper/subscription_test.go (1)
77-89: Consider hardening failpoint cleanup with t.Cleanup.

In the happy path the spawned goroutine reliably disables subscription.listenOver.aboutToSend (it survives t.FailNow() on the test goroutine since Goexit only terminates the calling goroutine, and releaseErr is buffered). However, if failpoint.Disable itself returns an error, req.NoError(<-releaseErr) fails the test while leaving the failpoint enabled and leaking it into subsequent tests in the package. A t.Cleanup registered immediately after Enable would make disable idempotent and guaranteed.
♻️ Suggested hardening
 fp := "github.com/pingcap/tidb/br/pkg/streamhelper/subscription.listenOver.aboutToSend"
 req.NoError(failpoint.Enable(fp, "pause"))
+t.Cleanup(func() { _ = failpoint.Disable(fp) })
 releaseErr := make(chan error, 1)
 cp = c.advanceCheckpoints()
 c.flushAll()
 go func() {
 	time.Sleep(10 * time.Millisecond)
 	releaseErr <- failpoint.Disable(fp)
 }()
As per coding guidelines: "Unit tests in a package that uses failpoints: MUST enable failpoints before tests and disable afterward."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@br/pkg/streamhelper/subscription_test.go` around lines 77 - 89, After calling
failpoint.Enable (fp :=
"github.com/pingcap/tidb/br/pkg/streamhelper/subscription.listenOver.aboutToSend"
and req.NoError(failpoint.Enable(fp, "pause"))), register a t.Cleanup that
disables the failpoint (call failpoint.Disable(fp) and ignore or log its error)
so disable is idempotent and always run even if the test fails; keep the
existing goroutine that writes to releaseErr but ensure the cleanup is the
guaranteed fallback to avoid leaking the failpoint.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@br/pkg/streamhelper/subscription_test.go`:
- Around line 77-89: After calling failpoint.Enable (fp :=
"github.com/pingcap/tidb/br/pkg/streamhelper/subscription.listenOver.aboutToSend"
and req.NoError(failpoint.Enable(fp, "pause"))), register a t.Cleanup that
disables the failpoint (call failpoint.Disable(fp) and ignore or log its error)
so disable is idempotent and always run even if the test fails; keep the
existing goroutine that writes to releaseErr but ensure the cleanup is the
guaranteed fallback to avoid leaking the failpoint.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a612be71-da5e-4a3f-a676-f50ecc46e0b2

📥 Commits

Reviewing files that changed from the base of the PR and between 7769048 and 27b50b3.

📒 Files selected for processing (1)

br/pkg/streamhelper/subscription_test.go

ti-chi-bot · 2026-04-28T10:18:32Z

@flaky-claw: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-integration-e2e-test	`27b50b3`	link	true	`/test pull-integration-e2e-test`
pull-br-integration-test	`27b50b3`	link	true	`/test pull-br-integration-test`
idc-jenkins-ci-tidb/mysql-test	`27b50b3`	link	true	`/test mysql-test`
idc-jenkins-ci-tidb/check_dev_2	`27b50b3`	link	true	`/test check-dev2`
pull-mysql-client-test	`27b50b3`	link	true	`/test pull-mysql-client-test`
pull-build-next-gen	`27b50b3`	link	true	`/test pull-build-next-gen`
pull-mysql-client-test-next-gen	`27b50b3`	link	true	`/test pull-mysql-client-test-next-gen`
pull-unit-test-next-gen	`27b50b3`	link	true	`/test pull-unit-test-next-gen`
pull-integration-realcluster-test-next-gen	`27b50b3`	link	true	`/test pull-integration-realcluster-test-next-gen`
idc-jenkins-ci-tidb/unit-test	`27b50b3`	link	true	`/test unit-test`
idc-jenkins-ci-tidb/build	`27b50b3`	link	true	`/test build`
idc-jenkins-ci-tidb/check_dev	`27b50b3`	link	true	`/test check-dev`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

fix: stabilize flaky issue pingcap#67839

7769048

ti-chi-bot Bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 17, 2026

ti-chi-bot Bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Apr 17, 2026

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

ti-chi-bot Bot removed the do-not-merge/needs-triage-completed label Apr 22, 2026

YuJuncen requested changes Apr 23, 2026

View reviewed changes

YuJuncen reviewed Apr 23, 2026

View reviewed changes

fix: stabilize flaky issue pingcap#67839

27b50b3

ti-chi-bot Bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 27, 2026

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

Conversation

flaky-claw commented Apr 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What changed and how does it work?

Root Cause

Fix

Verification

Check List

Release note

Summary by CodeRabbit

Uh oh!

pantheon-ai Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ti-chi-bot Bot commented Apr 17, 2026

Uh oh!

tiprow Bot commented Apr 17, 2026

Uh oh!

coderabbitai Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yinsustart commented Apr 22, 2026

Uh oh!

tiprow Bot commented Apr 22, 2026

Uh oh!

YuJuncen left a comment

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented Apr 23, 2026

[LGTM Timeline notifier]

Uh oh!

YuJuncen Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

flaky-claw commented Apr 27, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

flaky-claw commented Apr 17, 2026 •

edited by coderabbitai Bot

Loading

pantheon-ai Bot commented Apr 17, 2026 •

edited

Loading

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading

codecov Bot commented Apr 17, 2026 •

edited

Loading