br/pkg/streamhelper: stabilize flaky TestOwnerDropped#67791
br/pkg/streamhelper: stabilize flaky TestOwnerDropped#67791flaky-claw wants to merge 2 commits intopingcap:masterfrom
Conversation
|
@flaky-claw I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details. ⏳ This process typically takes 10-30 minutes depending on the complexity of the changes. ℹ️ Learn more details on Pantheon AI. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @flaky-claw. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
📝 WalkthroughWalkthroughReplace failpoint-based test synchronization with per-test timing hooks and hook-aware testEnv methods; update TestOwnerDropped to coordinate subscriber refresh, OnStop, and manual poll using channels and sync.Once, and run tick execution in a goroutine reporting via a buffered channel. Changes
Sequence Diagram(s)Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)Command failed Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
br/pkg/streamhelper/advancer_test.go (1)
462-475: Add bounded waits for sync channels to avoid long hangs.Line 462, Line 469, Line 470, and Line 474 use unbounded receives. If hook ordering regresses, this can stall until suite timeout instead of failing quickly.
⏱️ Proposed refactor (fail-fast waits)
- <-getSubscriberReached + select { + case <-getSubscriberReached: + case <-time.After(5 * time.Second): + t.Fatal("timed out waiting for subscriber hook") + } @@ - <-stopDone - <-beforeManualPollReached + select { + case <-stopDone: + case <-time.After(5 * time.Second): + t.Fatal("timed out waiting for OnStop") + } + select { + case <-beforeManualPollReached: + case <-time.After(5 * time.Second): + t.Fatal("timed out waiting for manual poll hook") + } @@ - require.NoError(t, <-tickDone) + select { + case err := <-tickDone: + require.NoError(t, err) + case <-time.After(5 * time.Second): + t.Fatal("timed out waiting for tick completion") + }As per coding guidelines "Keep test changes minimal and deterministic; avoid broad golden/testdata churn unless required."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@br/pkg/streamhelper/advancer_test.go` around lines 462 - 475, The test uses unbounded receives on sync channels (getSubscriberReached, stopDone, beforeManualPollReached, tickDone) which can hang if hooks mis-order; update each receive in advancer_test.go to a fail-fast bounded wait by using a select that waits for the channel OR a time.After timeout (short, e.g. a few hundred ms) and fail the test immediately (require.FailNow or require.NoError with a timeout error) if the timeout fires; apply the same pattern for the receives after closing releaseSubscriber and releaseManualPoll so the test deterministically fails fast on regressions instead of stalling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@br/pkg/streamhelper/advancer_test.go`:
- Around line 462-475: The test uses unbounded receives on sync channels
(getSubscriberReached, stopDone, beforeManualPollReached, tickDone) which can
hang if hooks mis-order; update each receive in advancer_test.go to a fail-fast
bounded wait by using a select that waits for the channel OR a time.After
timeout (short, e.g. a few hundred ms) and fail the test immediately
(require.FailNow or require.NoError with a timeout error) if the timeout fires;
apply the same pattern for the receives after closing releaseSubscriber and
releaseManualPoll so the test deterministically fails fast on regressions
instead of stalling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: a0b17285-b005-4958-a245-57743a7d0ceb
📒 Files selected for processing (2)
br/pkg/streamhelper/advancer_test.gobr/pkg/streamhelper/basic_lib_for_test.go
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #67791 +/- ##
================================================
+ Coverage 77.6156% 78.6260% +1.0104%
================================================
Files 1982 1993 +11
Lines 548909 562473 +13564
================================================
+ Hits 426039 442250 +16211
+ Misses 122060 118500 -3560
- Partials 810 1723 +913
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
/ok-to-test |
|
/check-issue-triage-completed |
|
/check-issue-triage-complete |
|
/retest |
| func (t *testEnv) setTimingHooks(beforeStores, beforeScan func()) { | ||
| t.hookMu.Lock() | ||
| defer t.hookMu.Unlock() | ||
| t.beforeStores = beforeStores | ||
| t.beforeScan = beforeScan | ||
| } | ||
|
|
||
| func (t *testEnv) Stores(ctx context.Context) ([]streamhelper.Store, error) { | ||
| t.hookMu.Lock() | ||
| hook := t.beforeStores | ||
| t.hookMu.Unlock() | ||
| if hook != nil { | ||
| hook() | ||
| } | ||
| return t.Cluster.Stores(ctx) | ||
| } | ||
|
|
||
| func (t *testEnv) RegionScan( | ||
| ctx context.Context, | ||
| key []byte, | ||
| endKey []byte, | ||
| limit int, | ||
| ) ([]streamhelper.RegionWithLeader, error) { | ||
| t.hookMu.Lock() | ||
| hook := t.beforeScan | ||
| t.hookMu.Unlock() | ||
| if hook != nil { | ||
| hook() | ||
| } | ||
| return t.Cluster.RegionScan(ctx, key, endKey, limit) | ||
| } | ||
|
|
There was a problem hiding this comment.
Make your hooks immutable. Remove hookMu and setTimingHooks.
| select { | ||
| case getSubscriberReached <- struct{}{}: | ||
| default: | ||
| } | ||
| <-releaseSubscriber | ||
| }, func() { | ||
| select { | ||
| case beforeManualPollReached <- struct{}{}: | ||
| default: | ||
| } | ||
| <-releaseManualPoll |
There was a problem hiding this comment.
Explain why these selects MUST be here in comment or remove them.
| adv.OnStart(ctx) | ||
| adv.SpawnSubscriptionHandler(ctx) | ||
| require.NoError(t, adv.OnTick(ctx)) | ||
| failpoint.Enable(fp, "pause") |
There was a problem hiding this comment.
Explain why remove failpoint and impl it in env, what is the advantage?
|
@flaky-claw Would you show me the origin flaky test result? Honestly say I'm not sure what problem you are trying to fix... |
|
@flaky-claw This case was introduced by #47537 which focus on verifing no panic but NOT checkpoint advanced. |
|
FlakyFixer PR update summary Fix
Verification Spec:
Observed result:
Gate checklist:
Commands:
|
|
@flaky-claw: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
br/pkg/streamhelper/advancer_test.go (1)
428-458: Optional: consider encapsulating the hook coordination state.Five channels, two
sync.Onces, and an atomic flag at the top of the test are a lot of moving parts to track. A small helper struct (e.g.,ownerDropCoordwithsubscriberReached,manualPollReached,releaseSubscriber,releaseManualPoll,stopDoneand ahooks() (func(), func())method) would make the orchestration narrative at lines 475-488 read more linearly and would localize thetimingHooksEnabledgate. Purely a readability nit — feel free to skip if you'd rather keep the test self-contained.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@br/pkg/streamhelper/advancer_test.go` around lines 428 - 458, The test currently uses many loose coordination variables (getSubscriberReached, beforeManualPollReached, releaseSubscriber, releaseManualPoll, stopDone, getSubscriberReachedOnce, beforeManualPollReachedOnce, timingHooksEnabled) passed into newTestEnv via withTestEnvTimingHooks, which makes the orchestration hard to follow; refactor by encapsulating those into a small helper struct (e.g., ownerDropCoord) that holds subscriberReached, manualPollReached, releaseSubscriber, releaseManualPoll, stopDone, the two sync.Once fields and the atomic timingHooksEnabled, and expose a hooks() (func(), func()) method that returns the two hook functions to pass to withTestEnvTimingHooks, then replace the scattered variables with a single instantiation of ownerDropCoord and call ownerDropCoord.hooks() when constructing newTestEnv/newTestEnv withTestEnvTimingHooks to simplify and localize coordination state.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@br/pkg/streamhelper/advancer_test.go`:
- Around line 428-458: The test currently uses many loose coordination variables
(getSubscriberReached, beforeManualPollReached, releaseSubscriber,
releaseManualPoll, stopDone, getSubscriberReachedOnce,
beforeManualPollReachedOnce, timingHooksEnabled) passed into newTestEnv via
withTestEnvTimingHooks, which makes the orchestration hard to follow; refactor
by encapsulating those into a small helper struct (e.g., ownerDropCoord) that
holds subscriberReached, manualPollReached, releaseSubscriber,
releaseManualPoll, stopDone, the two sync.Once fields and the atomic
timingHooksEnabled, and expose a hooks() (func(), func()) method that returns
the two hook functions to pass to withTestEnvTimingHooks, then replace the
scattered variables with a single instantiation of ownerDropCoord and call
ownerDropCoord.hooks() when constructing newTestEnv/newTestEnv
withTestEnvTimingHooks to simplify and localize coordination state.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: ee9723fa-1812-422b-9f2a-36a0ced7658a
📒 Files selected for processing (2)
br/pkg/streamhelper/advancer_test.gobr/pkg/streamhelper/basic_lib_for_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- br/pkg/streamhelper/basic_lib_for_test.go
What problem does this PR solve?
Issue Number: close #67556
Problem Summary:
Flaky test
TestOwnerDroppedinbr/pkg/streamhelperintermittently fails, so this PR stabilizes that path.What changed and how does it work?
Root Cause
TestOwnerDropped is flaky because the test races owner teardown against the in-flight subscriber refresh and fallback manual-poll window, so it can read a stale checkpoint tree even when production behavior is correct.
Fix
The current change set is necessary to preserve the deterministic owner-drop/manual-poll timing control while removing the reviewer-rejected production hook surface and keeping all synchronization on the test side.
Verification
Spec:
br/pkg/streamhelper :: TestOwnerDroppedtidb.go_flaky.defaultBASELINE_ONLYObserved result:
Required flaky gate passed.
Build safety gate passed.
Intent guard gate passed.
Gate checklist:
Commands:
go test -json ./br/pkg/streamhelper -run '^TestOwnerDropped$' -count=1go test -json ./br/pkg/streamhelper -count=1make buildCheck List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Fixes #67556
Summary by CodeRabbit
Release Notes