br/pkg/streamhelper: stabilize flaky TestOwnerDropped by flaky-claw · Pull Request #67791 · pingcap/tidb

flaky-claw · 2026-04-15T18:43:15Z

What problem does this PR solve?

Issue Number: close #67556

Problem Summary:
Flaky test TestOwnerDropped in br/pkg/streamhelper intermittently fails, so this PR stabilizes that path.

What changed and how does it work?

Root Cause

TestOwnerDropped is flaky because the test races owner teardown against the in-flight subscriber refresh and fallback manual-poll window, so it can read a stale checkpoint tree even when production behavior is correct.

Fix

The current change set is necessary to preserve the deterministic owner-drop/manual-poll timing control while removing the reviewer-rejected production hook surface and keeping all synchronization on the test side.

Verification

Spec:

target: br/pkg/streamhelper :: TestOwnerDropped
strategy: tidb.go_flaky.default
plan mode: BASELINE_ONLY
requirements: required case must execute; no skip; repeat count = 1
baseline gates: required_flaky_gate, build_safety_gate, intent_guard_gate

Observed result:

status: passed
required case executed: yes
submission decision: ALLOWED
scope debt present: yes
Required flaky gate passed.
Build safety gate passed.
Intent guard gate passed.

Gate checklist:

Required flaky gate: PASS
Build safety gate: PASS
Intent guard gate: PASS
Repo-wide advisory gate: SKIPPED
Feedback specific gate: SKIPPED

Commands:

go test -json ./br/pkg/streamhelper -run '^TestOwnerDropped$' -count=1
go test -json ./br/pkg/streamhelper -count=1
make build

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Fixes #67556

Summary by CodeRabbit

Release Notes

Tests
- Replaced global failpoint synchronization with per-test timing hooks to improve control and reliability of concurrent test flows.
- Added hook-based timing support in test helpers to coordinate asynchronous operations and simplify deterministic test sequencing.

pantheon-ai · 2026-04-15T18:43:21Z

@flaky-claw I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

_{ℹ️ Learn more details on Pantheon AI.}

ti-chi-bot · 2026-04-15T18:43:23Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bornchanger for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

br/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tiprow · 2026-04-15T18:43:34Z

Hi @flaky-claw. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai · 2026-04-15T18:43:40Z

📝 Walkthrough

Walkthrough

Replace failpoint-based test synchronization with per-test timing hooks and hook-aware testEnv methods; update TestOwnerDropped to coordinate subscriber refresh, OnStop, and manual poll using channels and sync.Once, and run tick execution in a goroutine reporting via a buffered channel.

Changes

Cohort / File(s)	Summary
Test: advancer synchronization `br/pkg/streamhelper/advancer_test.go`	Remove global failpoint pause/resume code. Introduce test-local timing hooks, channels, and `sync.Once` to (1) block until subscriber refresh reaches a hook, (2) start `OnStop()` while refresh is in-flight, and (3) prevent manual poll until `OnStop()` completes. Move tick execution to a goroutine and wait for its result via a buffered `tickDone` channel; remove prior failpoint enable/disable and channel close/receive patterns.
Test infra: timing hooks on testEnv `br/pkg/streamhelper/basic_lib_for_test.go`	Add `testEnvOption` and `withTestEnvTimingHooks` to register per-test timing hooks (`beforeStores`, `beforeScan`). Store hooks on `testEnv`, update `newTestEnv` to accept options, and implement `testEnv.Stores` and `testEnv.RegionScan` to invoke hooks immediately before delegating to underlying cluster methods.

Sequence Diagram(s)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

br/pkg/streamhelper: stabilize flaky TestOwnerChangeCheckPointLagged #67827 — Changes test synchronization around CheckpointAdvancer lifecycle (OnStart/OnStop/OnTick) in br/pkg/streamhelper tests to avoid races.

Suggested labels

approved, lgtm

Suggested reviewers

YuJuncen
3pointer
Leavrth

Poem

🐰 I hop through hooks and channels wide,
I guard the tick with buffered pride,
No failpoint traps to trip the run,
The test flows neat, each step well-timed—hop, done!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: stabilizing a flaky test in br/pkg/streamhelper named TestOwnerDropped.
Description check	✅ Passed	The description provides problem summary, root cause analysis, fix explanation, verification results, and follows the repository template with proper issue linking and test checklist completion.
Linked Issues check	✅ Passed	The PR directly addresses the flaky test issue `#67556` by implementing deterministic synchronization through test-local timing hooks to remove race conditions in TestOwnerDropped.
Out of Scope Changes check	✅ Passed	All changes are scoped to fixing TestOwnerDropped flakiness: modifications to advancer_test.go and basic_lib_for_test.go introduce test-only synchronization infrastructure without affecting production code or the public API.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

Command failed

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

br/pkg/streamhelper/advancer_test.go (1)

462-475: Add bounded waits for sync channels to avoid long hangs.

Line 462, Line 469, Line 470, and Line 474 use unbounded receives. If hook ordering regresses, this can stall until suite timeout instead of failing quickly.

⏱️ Proposed refactor (fail-fast waits)

-	<-getSubscriberReached
+	select {
+	case <-getSubscriberReached:
+	case <-time.After(5 * time.Second):
+		t.Fatal("timed out waiting for subscriber hook")
+	}
@@
-	<-stopDone
-	<-beforeManualPollReached
+	select {
+	case <-stopDone:
+	case <-time.After(5 * time.Second):
+		t.Fatal("timed out waiting for OnStop")
+	}
+	select {
+	case <-beforeManualPollReached:
+	case <-time.After(5 * time.Second):
+		t.Fatal("timed out waiting for manual poll hook")
+	}
@@
-	require.NoError(t, <-tickDone)
+	select {
+	case err := <-tickDone:
+		require.NoError(t, err)
+	case <-time.After(5 * time.Second):
+		t.Fatal("timed out waiting for tick completion")
+	}

As per coding guidelines "Keep test changes minimal and deterministic; avoid broad golden/testdata churn unless required."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@br/pkg/streamhelper/advancer_test.go` around lines 462 - 475, The test uses
unbounded receives on sync channels (getSubscriberReached, stopDone,
beforeManualPollReached, tickDone) which can hang if hooks mis-order; update
each receive in advancer_test.go to a fail-fast bounded wait by using a select
that waits for the channel OR a time.After timeout (short, e.g. a few hundred
ms) and fail the test immediately (require.FailNow or require.NoError with a
timeout error) if the timeout fires; apply the same pattern for the receives
after closing releaseSubscriber and releaseManualPoll so the test
deterministically fails fast on regressions instead of stalling.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@br/pkg/streamhelper/advancer_test.go`:
- Around line 462-475: The test uses unbounded receives on sync channels
(getSubscriberReached, stopDone, beforeManualPollReached, tickDone) which can
hang if hooks mis-order; update each receive in advancer_test.go to a fail-fast
bounded wait by using a select that waits for the channel OR a time.After
timeout (short, e.g. a few hundred ms) and fail the test immediately
(require.FailNow or require.NoError with a timeout error) if the timeout fires;
apply the same pattern for the receives after closing releaseSubscriber and
releaseManualPoll so the test deterministically fails fast on regressions
instead of stalling.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a0b17285-b005-4958-a245-57743a7d0ceb

📥 Commits

Reviewing files that changed from the base of the PR and between 757952a and 0da05e7.

📒 Files selected for processing (2)

br/pkg/streamhelper/advancer_test.go
br/pkg/streamhelper/basic_lib_for_test.go

codecov · 2026-04-15T19:02:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.6260%. Comparing base (757952a) to head (388fdf1).
⚠️ Report is 88 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #67791        +/-   ##
================================================
+ Coverage   77.6156%   78.6260%   +1.0104%     
================================================
  Files          1982       1993        +11     
  Lines        548909     562473     +13564     
================================================
+ Hits         426039     442250     +16211     
+ Misses       122060     118500      -3560     
- Partials        810       1723       +913

Flag	Coverage Δ
integration	`44.7265% <ø> (+10.3868%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`61.5065% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`65.8948% <ø> (+5.4660%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

dveeden · 2026-04-18T15:09:34Z

/ok-to-test

dveeden · 2026-04-18T15:10:35Z

/check-issue-triage-completed

dveeden · 2026-04-18T15:18:07Z

/check-issue-triage-complete

yinsustart · 2026-04-22T05:34:21Z

/retest

YuJuncen · 2026-04-27T06:32:19Z

+func (t *testEnv) setTimingHooks(beforeStores, beforeScan func()) {
+	t.hookMu.Lock()
+	defer t.hookMu.Unlock()
+	t.beforeStores = beforeStores
+	t.beforeScan = beforeScan
+}
+
+func (t *testEnv) Stores(ctx context.Context) ([]streamhelper.Store, error) {
+	t.hookMu.Lock()
+	hook := t.beforeStores
+	t.hookMu.Unlock()
+	if hook != nil {
+		hook()
+	}
+	return t.Cluster.Stores(ctx)
+}
+
+func (t *testEnv) RegionScan(
+	ctx context.Context,
+	key []byte,
+	endKey []byte,
+	limit int,
+) ([]streamhelper.RegionWithLeader, error) {
+	t.hookMu.Lock()
+	hook := t.beforeScan
+	t.hookMu.Unlock()
+	if hook != nil {
+		hook()
+	}
+	return t.Cluster.RegionScan(ctx, key, endKey, limit)
+}
+


Make your hooks immutable. Remove hookMu and setTimingHooks.

YuJuncen · 2026-04-27T06:33:19Z

+		select {
+		case getSubscriberReached <- struct{}{}:
+		default:
+		}
+		<-releaseSubscriber
+	}, func() {
+		select {
+		case beforeManualPollReached <- struct{}{}:
+		default:
+		}
+		<-releaseManualPoll


Explain why these selects MUST be here in comment or remove them.

YuJuncen · 2026-04-27T06:35:26Z

 	adv.OnStart(ctx)
 	adv.SpawnSubscriptionHandler(ctx)
 	require.NoError(t, adv.OnTick(ctx))
-	failpoint.Enable(fp, "pause")


Explain why remove failpoint and impl it in env, what is the advantage?

YuJuncen · 2026-04-27T06:43:50Z

@flaky-claw Would you show me the origin flaky test result? Honestly say I'm not sure what problem you are trying to fix...

YuJuncen · 2026-04-27T07:04:28Z

@flaky-claw This case was introduced by #47537 which focus on verifing no panic but NOT checkpoint advanced.
You'd better DON'T modify event sequnce or hook system but relax the assertion from EQ to LE.

flaky-claw · 2026-04-27T12:00:30Z

FlakyFixer PR update summary

Fix

The new stopDone handoff is necessary to make the existing test-local timing repro actually serialize owner drop before manual polling while preserving the immutable-hook cleanup requested in review.

Verification

Spec:

target: br/pkg/streamhelper :: TestOwnerDropped
strategy: tidb.go_flaky.default
plan mode: BASELINE_PLUS_FEEDBACK_DELTA
requirements: required case must execute; no skip; repeat count = 1
baseline gates: required_flaky_gate, build_safety_gate, intent_guard_gate
feedback delta: 3 finding(s) via feedback_specific_gate (non-blocking fallback)
feedback surface source: baseline_only

Observed result:

status: passed
required case executed: yes
submission decision: ALLOWED
scope debt present: yes

Gate checklist:

Required flaky gate: PASS
Build safety gate: PASS
Intent guard gate: PASS
Repo-wide advisory gate: SKIPPED
Feedback specific gate: SKIPPED

Commands:

go test -json ./br/pkg/streamhelper -run '^TestOwnerDropped$' -count=1
go test -json ./br/pkg/streamhelper -count=1
make build

tiprow · 2026-04-27T12:06:00Z

@flaky-claw: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
fast_test_tiprow	`388fdf1`	link	true	`/test fast_test_tiprow`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

coderabbitai

🧹 Nitpick comments (1)

br/pkg/streamhelper/advancer_test.go (1)
428-458: Optional: consider encapsulating the hook coordination state.

Five channels, two sync.Onces, and an atomic flag at the top of the test are a lot of moving parts to track. A small helper struct (e.g., ownerDropCoord with subscriberReached, manualPollReached, releaseSubscriber, releaseManualPoll, stopDone and a hooks() (func(), func()) method) would make the orchestration narrative at lines 475-488 read more linearly and would localize the timingHooksEnabled gate. Purely a readability nit — feel free to skip if you'd rather keep the test self-contained.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@br/pkg/streamhelper/advancer_test.go` around lines 428 - 458, The test
currently uses many loose coordination variables (getSubscriberReached,
beforeManualPollReached, releaseSubscriber, releaseManualPoll, stopDone,
getSubscriberReachedOnce, beforeManualPollReachedOnce, timingHooksEnabled)
passed into newTestEnv via withTestEnvTimingHooks, which makes the orchestration
hard to follow; refactor by encapsulating those into a small helper struct
(e.g., ownerDropCoord) that holds subscriberReached, manualPollReached,
releaseSubscriber, releaseManualPoll, stopDone, the two sync.Once fields and the
atomic timingHooksEnabled, and expose a hooks() (func(), func()) method that
returns the two hook functions to pass to withTestEnvTimingHooks, then replace
the scattered variables with a single instantiation of ownerDropCoord and call
ownerDropCoord.hooks() when constructing newTestEnv/newTestEnv
withTestEnvTimingHooks to simplify and localize coordination state.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@br/pkg/streamhelper/advancer_test.go`:
- Around line 428-458: The test currently uses many loose coordination variables
(getSubscriberReached, beforeManualPollReached, releaseSubscriber,
releaseManualPoll, stopDone, getSubscriberReachedOnce,
beforeManualPollReachedOnce, timingHooksEnabled) passed into newTestEnv via
withTestEnvTimingHooks, which makes the orchestration hard to follow; refactor
by encapsulating those into a small helper struct (e.g., ownerDropCoord) that
holds subscriberReached, manualPollReached, releaseSubscriber,
releaseManualPoll, stopDone, the two sync.Once fields and the atomic
timingHooksEnabled, and expose a hooks() (func(), func()) method that returns
the two hook functions to pass to withTestEnvTimingHooks, then replace the
scattered variables with a single instantiation of ownerDropCoord and call
ownerDropCoord.hooks() when constructing newTestEnv/newTestEnv
withTestEnvTimingHooks to simplify and localize coordination state.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ee9723fa-1812-422b-9f2a-36a0ced7658a

📥 Commits

Reviewing files that changed from the base of the PR and between 0da05e7 and 388fdf1.

📒 Files selected for processing (2)

br/pkg/streamhelper/advancer_test.go
br/pkg/streamhelper/basic_lib_for_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

br/pkg/streamhelper/basic_lib_for_test.go

fix: stabilize flaky issue pingcap#67556

0da05e7

ti-chi-bot Bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 15, 2026

ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 15, 2026

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

ti-chi-bot Bot added the ok-to-test Indicates a PR is ready to be tested. label Apr 18, 2026

ti-chi-bot Bot removed the do-not-merge/needs-triage-completed label Apr 18, 2026

yinsustart requested review from 3pointer, RidRisR and YuJuncen April 27, 2026 05:52

YuJuncen reviewed Apr 27, 2026

View reviewed changes

fix: stabilize flaky issue pingcap#67556

388fdf1

ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 27, 2026

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

Conversation

flaky-claw commented Apr 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What changed and how does it work?

Root Cause

Fix

Verification

Check List

Release note

Summary by CodeRabbit

Release Notes

Uh oh!

pantheon-ai Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ti-chi-bot Bot commented Apr 15, 2026

Uh oh!

tiprow Bot commented Apr 15, 2026

Uh oh!

coderabbitai Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dveeden commented Apr 18, 2026

Uh oh!

dveeden commented Apr 18, 2026

Uh oh!

dveeden commented Apr 18, 2026

Uh oh!

yinsustart commented Apr 22, 2026

Uh oh!

YuJuncen Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

YuJuncen Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

YuJuncen Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YuJuncen commented Apr 27, 2026

Uh oh!

YuJuncen commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flaky-claw commented Apr 27, 2026

Uh oh!

tiprow Bot commented Apr 27, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

flaky-claw commented Apr 15, 2026 •

edited by coderabbitai Bot

Loading

pantheon-ai Bot commented Apr 15, 2026 •

edited

Loading

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading

codecov Bot commented Apr 15, 2026 •

edited

Loading

YuJuncen Apr 27, 2026 •

edited

Loading

YuJuncen commented Apr 27, 2026 •

edited

Loading