pkg/util/topsql/reporter: stabilize flaky TestTopRUPipelineInProcessIntegration#67600
pkg/util/topsql/reporter: stabilize flaky TestTopRUPipelineInProcessIntegration#67600flaky-claw wants to merge 4 commits intopingcap:masterfrom
Conversation
|
Review Complete Findings: 0 issues ℹ️ Learn more details on Pantheon AI. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @flaky-claw. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdded RWMutex synchronization and package-level test hooks to TopSQL datamodel and reporter to coordinate timing between in-flight SQL/plan meta registrations and the report worker; added a concurrency test to verify racing registrations are included in emitted payloads; added a legacy Makefile alias. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Registrar as RegisterSQL / RegisterPlan
participant Hook as normalizedMetaRegisterAfterLoadHook
participant Reporter as RemoteTopSQLReporter
participant Map as normalizedSQLMap (sync.Map)
Client->>Registrar: call RegisterSQL (loads map pointer)
Registrar->>Hook: invoke hook after load (may block)
Hook-->>Registrar: unblock to continue
Registrar->>Map: LoadOrStore metadata
Reporter->>Reporter: dequeues payload from channel
Reporter->>reportWorkerBeforeBuildReportDataHook: invoke hook (signals worker started)
reportWorkerBeforeBuildReportDataHook-->>Reporter: unblocks worker
Reporter->>Map: read metas to build ReportData
Reporter->>Client: send ReportData (contains seed + racing metas)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)Command failed Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #67600 +/- ##
================================================
- Coverage 77.5455% 76.9877% -0.5578%
================================================
Files 1963 1946 -17
Lines 544697 548513 +3816
================================================
- Hits 422388 422288 -100
- Misses 121499 125931 +4432
+ Partials 810 294 -516
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/util/topsql/reporter/datamodel.go (1)
640-655:⚠️ Potential issue | 🟠 MajorPost-
take()stale registrations increment the wrong generation's count.Both
register()paths snapshotdatabeforetake()can swap the backing map, butm.length.Add(1)always updates the receiver's current counter afterward. That lets a stale insert land in the old map while inflating the new generation'slength, which can make the next interval hitMaxCollectearly and drop fresh metas. The worker delay only fixes old-map visibility; it does not fix this counter drift.Please swap the map and its count as one generation object, or otherwise wait for pre-
take()registrations before resetting the counter.Also applies to: 708-723
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/util/topsql/reporter/datamodel.go` around lines 640 - 655, The register() path can increment the wrong generation counter because it snapshots m.data but always calls m.length.Add(1) on the current receiver; change the generation model so map+counter are swapped together: introduce a generation struct (e.g., gen { data *sync.Map; length *atomic.Int64 }) and have m.data.Load()/Store() return that generation object; in normalizedSQLMap.register() load the gen atomically, call gen.data.LoadOrStore(...), and if not loaded call gen.length.Add(1) so the increment applies to the same generation whose map you inserted into; apply the same change to the other similar block referenced (lines ~708-723) so all registrations update the correct generation counter.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/util/topsql/reporter/reporter_test.go`:
- Around line 197-208: The test must ensure reportWorker() exits before
restoring package hooks; change the cleanup order so tsr.Close() runs first,
then wait for the worker goroutine to finish via a done channel, and only after
that restore reportWorkerBeforeBuildReportDataHook and
normalizedMetaRegisterAfterLoadHook and close releaseRegister; specifically,
create a done chan (e.g. done := make(chan struct{}), have the worker or test
signal close(done) when reportWorker returns, call tsr.Close() in cleanup, then
block on <-done before resetting the package globals and closing releaseRegister
to avoid races and leaks.
In `@pkg/util/topsql/reporter/reporter.go`:
- Around line 377-383: The 100ms time.Sleep in reportWorker is a brittle timing
workaround; replace it with an explicit generation/in-flight handoff so
RegisterSQL/RegisterPlan cannot race with reportWorker serialization. Implement
a generation token or in-flight counter (e.g., an atomic uint64 or
sync.WaitGroup) that RegisterSQL/RegisterPlan bump or register with when
mutating the maps and that reportWorker reads before building payload (see
reportWorker and reportWorkerBeforeBuildReportDataHook); then have reportWorker
wait deterministically for that generation/in-flight set to quiesce (or perform
a safe map swap under a mutex) instead of sleeping so serialization only
proceeds after all LoadOrStore operations for that generation have completed.
Ensure users of Load/LoadOrStore observe the same generation protocol.
---
Outside diff comments:
In `@pkg/util/topsql/reporter/datamodel.go`:
- Around line 640-655: The register() path can increment the wrong generation
counter because it snapshots m.data but always calls m.length.Add(1) on the
current receiver; change the generation model so map+counter are swapped
together: introduce a generation struct (e.g., gen { data *sync.Map; length
*atomic.Int64 }) and have m.data.Load()/Store() return that generation object;
in normalizedSQLMap.register() load the gen atomically, call
gen.data.LoadOrStore(...), and if not loaded call gen.length.Add(1) so the
increment applies to the same generation whose map you inserted into; apply the
same change to the other similar block referenced (lines ~708-723) so all
registrations update the correct generation counter.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: c98bdfbb-c48a-4bfd-bced-902c85203f5b
📒 Files selected for processing (3)
pkg/util/topsql/reporter/datamodel.gopkg/util/topsql/reporter/reporter.gopkg/util/topsql/reporter/reporter_test.go
| tsr := NewRemoteTopSQLReporter(mockPlanBinaryDecoderFunc, mockPlanBinaryCompressFunc) | ||
| tsr.BindKeyspaceName([]byte("ks-race")) | ||
| t.Cleanup(tsr.Close) | ||
| t.Cleanup(func() { | ||
| select { | ||
| case <-releaseRegister: | ||
| default: | ||
| close(releaseRegister) | ||
| } | ||
| normalizedMetaRegisterAfterLoadHook = origRegisterHook | ||
| reportWorkerBeforeBuildReportDataHook = origReportHook | ||
| }) |
There was a problem hiding this comment.
Wait for reportWorker() to exit before restoring the package hooks.
t.Cleanup runs LIFO, so this currently restores reportWorkerBeforeBuildReportDataHook / normalizedMetaRegisterAfterLoadHook before tsr.Close(). Also, Close() does not join reportWorker(). On an early-failing path, that leaves a real background worker racing with hook restoration and can leak hook state into later tests.
Track the worker with a done channel, close the reporter first, wait for the goroutine to exit, and only then restore the package globals.
Also applies to: 210-212
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/util/topsql/reporter/reporter_test.go` around lines 197 - 208, The test
must ensure reportWorker() exits before restoring package hooks; change the
cleanup order so tsr.Close() runs first, then wait for the worker goroutine to
finish via a done channel, and only after that restore
reportWorkerBeforeBuildReportDataHook and normalizedMetaRegisterAfterLoadHook
and close releaseRegister; specifically, create a done chan (e.g. done :=
make(chan struct{}), have the worker or test signal close(done) when
reportWorker returns, call tsr.Close() in cleanup, then block on <-done before
resetting the package globals and closing releaseRegister to avoid races and
leaks.
2d64e36 to
7a687fc
Compare
|
/retest-required |
|
@flaky-claw: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest-required |
|
@flaky-claw: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest-required |
|
@flaky-claw: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest-required |
|
@flaky-claw: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
FlakyFixer PR update summary Fix
Verification Spec:
Observed result:
Commands:
|
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/util/topsql/reporter/reporter_test.go (1)
198-213:⚠️ Potential issue | 🟠 MajorRestore the package hooks only after the worker is gone.
Because
t.Cleanupis LIFO, the cleanup on Lines 201-209 runs before thetsr.Closeregistered on Line 200. If this test fails early, the goroutine started on Line 213 can still run whilereportWorkerBeforeBuildReportDataHook/normalizedMetaRegisterAfterLoadHookare being reset, which can leak package-global state into later tests.Suggested fix
tsr := NewRemoteTopSQLReporter(mockPlanBinaryDecoderFunc, mockPlanBinaryCompressFunc) tsr.BindKeyspaceName([]byte("ks-race")) - t.Cleanup(tsr.Close) + reportWorkerDone := make(chan struct{}) + close(reportWorkerDone) // cleanup-safe before the worker starts t.Cleanup(func() { + tsr.Close() + <-reportWorkerDone + normalizedMetaRegisterAfterLoadHook = origRegisterHook + reportWorkerBeforeBuildReportDataHook = origReportHook select { case <-releaseRegister: default: close(releaseRegister) } - normalizedMetaRegisterAfterLoadHook = origRegisterHook - reportWorkerBeforeBuildReportDataHook = origReportHook }) ch := make(chan *ReportData, 1) require.NoError(t, tsr.Register(newMockDataSink(ch))) - go tsr.reportWorker() + reportWorkerDone = make(chan struct{}) + go func() { + defer close(reportWorkerDone) + tsr.reportWorker() + }()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/util/topsql/reporter/reporter_test.go` around lines 198 - 213, The package-level hooks (reportWorkerBeforeBuildReportDataHook, normalizedMetaRegisterAfterLoadHook) are being restored before tsr.Close runs due to t.Cleanup LIFO ordering; register the cleanup that restores those hooks (and closes releaseRegister) before registering t.Cleanup(tsr.Close) so the tsr.Close (and the goroutine shutdown) runs first, then the hooks are reset afterward; reference the existing tsr.Close call, the hook vars reportWorkerBeforeBuildReportDataHook and normalizedMetaRegisterAfterLoadHook, and the releaseRegister channel to locate where to reorder the t.Cleanup registrations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/util/topsql/reporter/reporter_test.go`:
- Around line 198-213: The package-level hooks
(reportWorkerBeforeBuildReportDataHook, normalizedMetaRegisterAfterLoadHook) are
being restored before tsr.Close runs due to t.Cleanup LIFO ordering; register
the cleanup that restores those hooks (and closes releaseRegister) before
registering t.Cleanup(tsr.Close) so the tsr.Close (and the goroutine shutdown)
runs first, then the hooks are reset afterward; reference the existing tsr.Close
call, the hook vars reportWorkerBeforeBuildReportDataHook and
normalizedMetaRegisterAfterLoadHook, and the releaseRegister channel to locate
where to reorder the t.Cleanup registrations.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: ec12e1a0-238f-456b-9874-c31eeac12feb
📒 Files selected for processing (2)
pkg/util/topsql/reporter/datamodel.gopkg/util/topsql/reporter/reporter_test.go
|
FlakyFixer PR update summary Fix
Verification Spec:
Observed result:
Gate checklist:
Commands:
|
|
FlakyFixer PR update summary Fix
Verification Spec:
Observed result:
Gate checklist:
Commands:
|
|
@flaky-claw: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/util/topsql/reporter/reporter_test.go (1)
200-213:⚠️ Potential issue | 🟠 MajorJoin the manual
reportWorkerbefore restoring package hooks.Line 213 starts a background worker, but the cleanup on Lines 200-209 still only relies on
tsr.Close()before the globals are reset. The rest of this file already treatsClose()as non-joining by waiting onreportWorkerDone, so an early-failing path here can still leave this worker alive whilereportWorkerBeforeBuildReportDataHook/normalizedMetaRegisterAfterLoadHookare being restored.🔧 Suggested cleanup pattern
tsr := NewRemoteTopSQLReporter(mockPlanBinaryDecoderFunc, mockPlanBinaryCompressFunc) tsr.BindKeyspaceName([]byte("ks-race")) + reportWorkerDone := make(chan struct{}) t.Cleanup(func() { + tsr.Close() + <-reportWorkerDone select { case <-releaseRegister: default: close(releaseRegister) } normalizedMetaRegisterAfterLoadHook = origRegisterHook reportWorkerBeforeBuildReportDataHook = origReportHook }) - t.Cleanup(tsr.Close) @@ - go tsr.reportWorker() + go func() { + defer close(reportWorkerDone) + tsr.reportWorker() + }()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/util/topsql/reporter/reporter_test.go` around lines 200 - 213, The test starts a background goroutine via reportWorker but the t.Cleanup restores global hooks and calls tsr.Close without ensuring the worker has exited; change the cleanup to call tsr.Close() then wait for the worker to finish (e.g. wait on reportWorkerDone or otherwise join the reportWorker) before restoring normalizedMetaRegisterAfterLoadHook and reportWorkerBeforeBuildReportDataHook and closing releaseRegister so the background goroutine cannot run while globals are reset.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/util/topsql/reporter/reporter_test.go`:
- Around line 200-213: The test starts a background goroutine via reportWorker
but the t.Cleanup restores global hooks and calls tsr.Close without ensuring the
worker has exited; change the cleanup to call tsr.Close() then wait for the
worker to finish (e.g. wait on reportWorkerDone or otherwise join the
reportWorker) before restoring normalizedMetaRegisterAfterLoadHook and
reportWorkerBeforeBuildReportDataHook and closing releaseRegister so the
background goroutine cannot run while globals are reset.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: c191df9a-c506-45bc-8e7e-ca3b0d26f5d0
📒 Files selected for processing (2)
pkg/util/topsql/reporter/datamodel.gopkg/util/topsql/reporter/reporter_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/util/topsql/reporter/datamodel.go
What problem does this PR solve?
Issue Number: close #67578
Problem Summary:
Flaky test
TestTopRUPipelineInProcessIntegrationinpkg/util/topsql/reporterintermittently fails, so this PR stabilizes that path.What changed and how does it work?
Root Cause
reportWorker was changed to treat taken SQL/plan meta maps as immutable even though in-flight RegisterSQL/RegisterPlan calls can still write into the old map after enqueue via a stale loaded pointer
Fix
restored the reportWorker settle delay and replaced the prior regression with a deterministic stale-pointer timing repro that validates the real in-flight meta registration race
Verification
native flaky test stayed weak pre-fix, the new timing-only repro failed before the fix, and after the fix the repro plus TestTopRUPipelineInProcessIntegration passed and make lint succeeded
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Fixes #67578
Summary by CodeRabbit