Skip to content

br/pkg/streamhelper: stabilize flaky TestOwnerDropped#67791

Open
flaky-claw wants to merge 2 commits intopingcap:masterfrom
flaky-claw:flakyfixer/case_6c5faf3740de-a1
Open

br/pkg/streamhelper: stabilize flaky TestOwnerDropped#67791
flaky-claw wants to merge 2 commits intopingcap:masterfrom
flaky-claw:flakyfixer/case_6c5faf3740de-a1

Conversation

@flaky-claw
Copy link
Copy Markdown
Contributor

@flaky-claw flaky-claw commented Apr 15, 2026

What problem does this PR solve?

Issue Number: close #67556

Problem Summary:
Flaky test TestOwnerDropped in br/pkg/streamhelper intermittently fails, so this PR stabilizes that path.

What changed and how does it work?

Root Cause

TestOwnerDropped is flaky because the test races owner teardown against the in-flight subscriber refresh and fallback manual-poll window, so it can read a stale checkpoint tree even when production behavior is correct.

Fix

The current change set is necessary to preserve the deterministic owner-drop/manual-poll timing control while removing the reviewer-rejected production hook surface and keeping all synchronization on the test side.

Verification

Spec:

  • target: br/pkg/streamhelper :: TestOwnerDropped
  • strategy: tidb.go_flaky.default
  • plan mode: BASELINE_ONLY
  • requirements: required case must execute; no skip; repeat count = 1
  • baseline gates: required_flaky_gate, build_safety_gate, intent_guard_gate

Observed result:

  • status: passed
  • required case executed: yes
  • submission decision: ALLOWED
  • scope debt present: yes
    Required flaky gate passed.
    Build safety gate passed.
    Intent guard gate passed.

Gate checklist:

  • Required flaky gate: PASS
  • Build safety gate: PASS
  • Intent guard gate: PASS
  • Repo-wide advisory gate: SKIPPED
  • Feedback specific gate: SKIPPED

Commands:

  • go test -json ./br/pkg/streamhelper -run '^TestOwnerDropped$' -count=1
  • go test -json ./br/pkg/streamhelper -count=1
  • make build

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Fixes #67556

Summary by CodeRabbit

Release Notes

  • Tests
    • Replaced global failpoint synchronization with per-test timing hooks to improve control and reliability of concurrent test flows.
    • Added hook-based timing support in test helpers to coordinate asynchronous operations and simplify deterministic test sequencing.

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 15, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai Bot commented Apr 15, 2026

@flaky-claw I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 15, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bornchanger for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 15, 2026
@tiprow
Copy link
Copy Markdown

tiprow Bot commented Apr 15, 2026

Hi @flaky-claw. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 15, 2026

📝 Walkthrough

Walkthrough

Replace failpoint-based test synchronization with per-test timing hooks and hook-aware testEnv methods; update TestOwnerDropped to coordinate subscriber refresh, OnStop, and manual poll using channels and sync.Once, and run tick execution in a goroutine reporting via a buffered channel.

Changes

Cohort / File(s) Summary
Test: advancer synchronization
br/pkg/streamhelper/advancer_test.go
Remove global failpoint pause/resume code. Introduce test-local timing hooks, channels, and sync.Once to (1) block until subscriber refresh reaches a hook, (2) start OnStop() while refresh is in-flight, and (3) prevent manual poll until OnStop() completes. Move tick execution to a goroutine and wait for its result via a buffered tickDone channel; remove prior failpoint enable/disable and channel close/receive patterns.
Test infra: timing hooks on testEnv
br/pkg/streamhelper/basic_lib_for_test.go
Add testEnvOption and withTestEnvTimingHooks to register per-test timing hooks (beforeStores, beforeScan). Store hooks on testEnv, update newTestEnv to accept options, and implement testEnv.Stores and testEnv.RegionScan to invoke hooks immediately before delegating to underlying cluster methods.

Sequence Diagram(s)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

approved, lgtm

Suggested reviewers

  • YuJuncen
  • 3pointer
  • Leavrth

Poem

🐰 I hop through hooks and channels wide,
I guard the tick with buffered pride,
No failpoint traps to trip the run,
The test flows neat, each step well-timed—hop, done!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: stabilizing a flaky test in br/pkg/streamhelper named TestOwnerDropped.
Description check ✅ Passed The description provides problem summary, root cause analysis, fix explanation, verification results, and follows the repository template with proper issue linking and test checklist completion.
Linked Issues check ✅ Passed The PR directly addresses the flaky test issue #67556 by implementing deterministic synchronization through test-local timing hooks to remove race conditions in TestOwnerDropped.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing TestOwnerDropped flakiness: modifications to advancer_test.go and basic_lib_for_test.go introduce test-only synchronization infrastructure without affecting production code or the public API.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
br/pkg/streamhelper/advancer_test.go (1)

462-475: Add bounded waits for sync channels to avoid long hangs.

Line 462, Line 469, Line 470, and Line 474 use unbounded receives. If hook ordering regresses, this can stall until suite timeout instead of failing quickly.

⏱️ Proposed refactor (fail-fast waits)
-	<-getSubscriberReached
+	select {
+	case <-getSubscriberReached:
+	case <-time.After(5 * time.Second):
+		t.Fatal("timed out waiting for subscriber hook")
+	}
@@
-	<-stopDone
-	<-beforeManualPollReached
+	select {
+	case <-stopDone:
+	case <-time.After(5 * time.Second):
+		t.Fatal("timed out waiting for OnStop")
+	}
+	select {
+	case <-beforeManualPollReached:
+	case <-time.After(5 * time.Second):
+		t.Fatal("timed out waiting for manual poll hook")
+	}
@@
-	require.NoError(t, <-tickDone)
+	select {
+	case err := <-tickDone:
+		require.NoError(t, err)
+	case <-time.After(5 * time.Second):
+		t.Fatal("timed out waiting for tick completion")
+	}

As per coding guidelines "Keep test changes minimal and deterministic; avoid broad golden/testdata churn unless required."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@br/pkg/streamhelper/advancer_test.go` around lines 462 - 475, The test uses
unbounded receives on sync channels (getSubscriberReached, stopDone,
beforeManualPollReached, tickDone) which can hang if hooks mis-order; update
each receive in advancer_test.go to a fail-fast bounded wait by using a select
that waits for the channel OR a time.After timeout (short, e.g. a few hundred
ms) and fail the test immediately (require.FailNow or require.NoError with a
timeout error) if the timeout fires; apply the same pattern for the receives
after closing releaseSubscriber and releaseManualPoll so the test
deterministically fails fast on regressions instead of stalling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@br/pkg/streamhelper/advancer_test.go`:
- Around line 462-475: The test uses unbounded receives on sync channels
(getSubscriberReached, stopDone, beforeManualPollReached, tickDone) which can
hang if hooks mis-order; update each receive in advancer_test.go to a fail-fast
bounded wait by using a select that waits for the channel OR a time.After
timeout (short, e.g. a few hundred ms) and fail the test immediately
(require.FailNow or require.NoError with a timeout error) if the timeout fires;
apply the same pattern for the receives after closing releaseSubscriber and
releaseManualPoll so the test deterministically fails fast on regressions
instead of stalling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a0b17285-b005-4958-a245-57743a7d0ceb

📥 Commits

Reviewing files that changed from the base of the PR and between 757952a and 0da05e7.

📒 Files selected for processing (2)
  • br/pkg/streamhelper/advancer_test.go
  • br/pkg/streamhelper/basic_lib_for_test.go

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.6260%. Comparing base (757952a) to head (388fdf1).
⚠️ Report is 88 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #67791        +/-   ##
================================================
+ Coverage   77.6156%   78.6260%   +1.0104%     
================================================
  Files          1982       1993        +11     
  Lines        548909     562473     +13564     
================================================
+ Hits         426039     442250     +16211     
+ Misses       122060     118500      -3560     
- Partials        810       1723       +913     
Flag Coverage Δ
integration 44.7265% <ø> (+10.3868%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 61.5065% <ø> (ø)
parser ∅ <ø> (∅)
br 65.8948% <ø> (+5.4660%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dveeden
Copy link
Copy Markdown
Contributor

dveeden commented Apr 18, 2026

/ok-to-test

@ti-chi-bot ti-chi-bot Bot added the ok-to-test Indicates a PR is ready to be tested. label Apr 18, 2026
@dveeden
Copy link
Copy Markdown
Contributor

dveeden commented Apr 18, 2026

/check-issue-triage-completed

@dveeden
Copy link
Copy Markdown
Contributor

dveeden commented Apr 18, 2026

/check-issue-triage-complete

@yinsustart
Copy link
Copy Markdown

/retest

Comment on lines +238 to +269
func (t *testEnv) setTimingHooks(beforeStores, beforeScan func()) {
t.hookMu.Lock()
defer t.hookMu.Unlock()
t.beforeStores = beforeStores
t.beforeScan = beforeScan
}

func (t *testEnv) Stores(ctx context.Context) ([]streamhelper.Store, error) {
t.hookMu.Lock()
hook := t.beforeStores
t.hookMu.Unlock()
if hook != nil {
hook()
}
return t.Cluster.Stores(ctx)
}

func (t *testEnv) RegionScan(
ctx context.Context,
key []byte,
endKey []byte,
limit int,
) ([]streamhelper.RegionWithLeader, error) {
t.hookMu.Lock()
hook := t.beforeScan
t.hookMu.Unlock()
if hook != nil {
hook()
}
return t.Cluster.RegionScan(ctx, key, endKey, limit)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make your hooks immutable. Remove hookMu and setTimingHooks.

Comment thread br/pkg/streamhelper/advancer_test.go Outdated
Comment on lines +444 to +454
select {
case getSubscriberReached <- struct{}{}:
default:
}
<-releaseSubscriber
}, func() {
select {
case beforeManualPollReached <- struct{}{}:
default:
}
<-releaseManualPoll
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain why these selects MUST be here in comment or remove them.

adv.OnStart(ctx)
adv.SpawnSubscriptionHandler(ctx)
require.NoError(t, adv.OnTick(ctx))
failpoint.Enable(fp, "pause")
Copy link
Copy Markdown
Contributor

@YuJuncen YuJuncen Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain why remove failpoint and impl it in env, what is the advantage?

@YuJuncen
Copy link
Copy Markdown
Contributor

@flaky-claw Would you show me the origin flaky test result? Honestly say I'm not sure what problem you are trying to fix...

@YuJuncen
Copy link
Copy Markdown
Contributor

YuJuncen commented Apr 27, 2026

@flaky-claw This case was introduced by #47537 which focus on verifing no panic but NOT checkpoint advanced.
You'd better DON'T modify event sequnce or hook system but relax the assertion from EQ to LE.

@flaky-claw
Copy link
Copy Markdown
Contributor Author

FlakyFixer PR update summary

Fix

  • The new stopDone handoff is necessary to make the existing test-local timing repro actually serialize owner drop before manual polling while preserving the immutable-hook cleanup requested in review.

Verification

Spec:

  • target: br/pkg/streamhelper :: TestOwnerDropped
  • strategy: tidb.go_flaky.default
  • plan mode: BASELINE_PLUS_FEEDBACK_DELTA
  • requirements: required case must execute; no skip; repeat count = 1
  • baseline gates: required_flaky_gate, build_safety_gate, intent_guard_gate
  • feedback delta: 3 finding(s) via feedback_specific_gate (non-blocking fallback)
  • feedback surface source: baseline_only

Observed result:

  • status: passed
  • required case executed: yes
  • submission decision: ALLOWED
  • scope debt present: yes

Gate checklist:

  • Required flaky gate: PASS
  • Build safety gate: PASS
  • Intent guard gate: PASS
  • Repo-wide advisory gate: SKIPPED
  • Feedback specific gate: SKIPPED

Commands:

  • go test -json ./br/pkg/streamhelper -run '^TestOwnerDropped$' -count=1
  • go test -json ./br/pkg/streamhelper -count=1
  • make build

@ti-chi-bot ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 27, 2026
@tiprow
Copy link
Copy Markdown

tiprow Bot commented Apr 27, 2026

@flaky-claw: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
fast_test_tiprow 388fdf1 link true /test fast_test_tiprow

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
br/pkg/streamhelper/advancer_test.go (1)

428-458: Optional: consider encapsulating the hook coordination state.

Five channels, two sync.Onces, and an atomic flag at the top of the test are a lot of moving parts to track. A small helper struct (e.g., ownerDropCoord with subscriberReached, manualPollReached, releaseSubscriber, releaseManualPoll, stopDone and a hooks() (func(), func()) method) would make the orchestration narrative at lines 475-488 read more linearly and would localize the timingHooksEnabled gate. Purely a readability nit — feel free to skip if you'd rather keep the test self-contained.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@br/pkg/streamhelper/advancer_test.go` around lines 428 - 458, The test
currently uses many loose coordination variables (getSubscriberReached,
beforeManualPollReached, releaseSubscriber, releaseManualPoll, stopDone,
getSubscriberReachedOnce, beforeManualPollReachedOnce, timingHooksEnabled)
passed into newTestEnv via withTestEnvTimingHooks, which makes the orchestration
hard to follow; refactor by encapsulating those into a small helper struct
(e.g., ownerDropCoord) that holds subscriberReached, manualPollReached,
releaseSubscriber, releaseManualPoll, stopDone, the two sync.Once fields and the
atomic timingHooksEnabled, and expose a hooks() (func(), func()) method that
returns the two hook functions to pass to withTestEnvTimingHooks, then replace
the scattered variables with a single instantiation of ownerDropCoord and call
ownerDropCoord.hooks() when constructing newTestEnv/newTestEnv
withTestEnvTimingHooks to simplify and localize coordination state.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@br/pkg/streamhelper/advancer_test.go`:
- Around line 428-458: The test currently uses many loose coordination variables
(getSubscriberReached, beforeManualPollReached, releaseSubscriber,
releaseManualPoll, stopDone, getSubscriberReachedOnce,
beforeManualPollReachedOnce, timingHooksEnabled) passed into newTestEnv via
withTestEnvTimingHooks, which makes the orchestration hard to follow; refactor
by encapsulating those into a small helper struct (e.g., ownerDropCoord) that
holds subscriberReached, manualPollReached, releaseSubscriber,
releaseManualPoll, stopDone, the two sync.Once fields and the atomic
timingHooksEnabled, and expose a hooks() (func(), func()) method that returns
the two hook functions to pass to withTestEnvTimingHooks, then replace the
scattered variables with a single instantiation of ownerDropCoord and call
ownerDropCoord.hooks() when constructing newTestEnv/newTestEnv
withTestEnvTimingHooks to simplify and localize coordination state.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ee9723fa-1812-422b-9f2a-36a0ced7658a

📥 Commits

Reviewing files that changed from the base of the PR and between 0da05e7 and 388fdf1.

📒 Files selected for processing (2)
  • br/pkg/streamhelper/advancer_test.go
  • br/pkg/streamhelper/basic_lib_for_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • br/pkg/streamhelper/basic_lib_for_test.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: TestOwnerDropped in br/pkg/streamhelper

4 participants