Skip to content

WIP: statistics, executor: collect singleton sketches for row sampling#68157

Open
0xPoe wants to merge 4 commits intopingcap:masterfrom
0xPoe:stats-row-sampler-singleton-sketch
Open

WIP: statistics, executor: collect singleton sketches for row sampling#68157
0xPoe wants to merge 4 commits intopingcap:masterfrom
0xPoe:stats-row-sampler-singleton-sketch

Conversation

@0xPoe
Copy link
Copy Markdown
Member

@0xPoe 0xPoe commented May 4, 2026

What problem does this PR solve?

Issue Number: ref #67449

Problem Summary:

Row-sampling analyze needs singleton sketches to improve NDV estimation from distributed samples.

What changed and how does it work?

  • Bump tipb to include row-sample singleton sketch fields.
  • Collect and serialize singleton sketches in row sampler.
  • Use per-worker singleton sketches to estimate NDV during analyze stats building.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Manual test:

  • make bazel_prepare
  • make lint

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Improve NDV estimation for row-sampling ANALYZE by collecting singleton sketches.

Summary by CodeRabbit

  • Improvements

    • Enhanced number-of-distinct-values estimation for column statistics using sketch-based methods.
    • Improved row sampling accuracy with singleton sketch mechanism for statistics collection.
  • Dependencies

    • Updated github.com/pingcap/tipb dependency.
  • Tests

    • Added tests for sketch-based NDV estimation and singleton sketch sampling functionality.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 4, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels May 4, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 4, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bb7133, time-and-fate for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tiprow
Copy link
Copy Markdown

tiprow Bot commented May 4, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

📝 Walkthrough

Walkthrough

This PR updates the github.com/pingcap/tipb dependency and implements a singleton FM-sketch based mechanism for estimating column NDV (number of distinct values) in histogram sampling, integrating sketch collection, merging, and NDV estimation across the statistics and executor layers.

Changes

Dependency Update

Layer / File(s) Summary
Go Toolchain
DEPS.bzl, go.mod
github.com/pingcap/tipb pinned to v0.0.0-20260414032333-da912b84de6f (updated from 20260210113932-1447c9d7e9fe).

Singleton FM-Sketch NDV Estimation

Layer / File(s) Summary
Sketch Hashing
pkg/statistics/fmsketch.go
InsertValue and InsertRowValue refactored to delegate hashing to new hashDatum and hashRow helpers, centralizing error propagation for encode/hash failures.
Sketch Collection & Merging
pkg/statistics/row_sampler.go
baseCollector extended with singleton sketch builders (singletonBuilders, SingletonSketches) and warm-up–throttled sketch sampling via sketchSampleRate. Updated Collect, collectColumns, collectColumnGroups, and merge methods to build and merge singleton sketches; protobuf serialization/deserialization updated to include singleton sketches and sample counts.
NDV Estimation Integration
pkg/executor/analyze_col_sampling.go
buildSamplingStats accumulates per-node sketch data and computes sketch-based NDV estimates via new estimateNDVsBySketch, which skips "special" indexes to avoid overwriting pushdown results. subBuildWorker accepts estimateNDVs and conditionally overrides histogram NDV when sketch estimates exceed sample-derived NDV. New helpers copySketches and estimateNDVsBySketch extract and estimate NDV from collected sketches.
Tests
pkg/executor/analyze_utils_test.go, pkg/statistics/sample_test.go
TestEstimateNDVsBySketch validates sketch-based NDV estimation with singleton sketches and special-index handling. SubTestRowSampleSingletonSketches exercises row sampling with singleton sketches, validates proto round-trip, and asserts NDV correctness. Helper mustBuildFMSketch constructs FM sketches for testing.

Sequence Diagram

sequenceDiagram
    participant Exec as Executor
    participant Sampler as Row Sampler
    participant Builder as Sketch Builder
    participant Merger as Sketch Merger
    participant NDVEst as NDV Estimator

    Exec->>Sampler: Collect rows with FMSketch
    Sampler->>Builder: Sample & insert into singleton builders (warm-up throttled)
    Builder->>Sampler: Accumulate hashed values per column/group
    Sampler->>Merger: BuildSingletonSketches() after iteration
    Merger->>Sampler: Populate SingletonSketches from builders
    Exec->>Merger: MergeCollector() to combine node results
    Merger->>Sampler: Merge singleton sketches & sample counts
    Exec->>NDVEst: Compute estimateNDVsBySketch(root sketches, node data, special index map)
    NDVEst->>Exec: Return per-column NDV estimates (skip special indexes)
    Exec->>Exec: Override hist.NDV when sketch estimate > sample NDV
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • time-and-fate
  • mjonss
  • henrybw

🐰 Sketches now dance in pairs, both FM and singletons fair,
NDV flows through warm-up gates, with warm embrace it calculates,
No special index left behind—just merged and merged so fine!
Histogram histograms bloom, as singleton sketches light the room.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description follows the template with required sections completed: issue reference, problem summary, changes explained, tests checked, and release note provided.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title 'WIP: statistics, executor: collect singleton sketches for row sampling' clearly and specifically describes the main changes: collecting singleton sketches within the statistics/executor packages for row sampling.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 4, 2026
@0xPoe 0xPoe marked this pull request as ready for review May 4, 2026 09:35
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 4, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai Bot commented May 4, 2026

@0xPoe I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

ℹ️ Learn more details on Pantheon AI.

@0xPoe 0xPoe changed the title statistics, executor: collect singleton sketches for row sampling WIP: statistics, executor: collect singleton sketches for row sampling May 4, 2026
@ti-chi-bot ti-chi-bot Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 4, 2026
@0xPoe 0xPoe force-pushed the stats-row-sampler-singleton-sketch branch from 8a78790 to 1d78d22 Compare May 4, 2026 09:42
@0xPoe 0xPoe marked this pull request as draft May 4, 2026 09:42
@0xPoe 0xPoe changed the title WIP: statistics, executor: collect singleton sketches for row sampling statistics, executor: collect singleton sketches for row sampling May 4, 2026
@0xPoe 0xPoe marked this pull request as ready for review May 4, 2026 09:44
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 4, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai Bot commented May 4, 2026

Review Complete

Findings: 0 issues
Posted: 0
Duplicates/Skipped: 0

ℹ️ Learn more details on Pantheon AI.

@0xPoe 0xPoe changed the title statistics, executor: collect singleton sketches for row sampling WIP: statistics, executor: collect singleton sketches for row sampling May 4, 2026
@ti-chi-bot ti-chi-bot Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 4, 2026
@tiprow
Copy link
Copy Markdown

tiprow Bot commented May 4, 2026

@0xPoe: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
fast_test_tiprow 1d78d22 link true /test fast_test_tiprow

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Copy Markdown

@pantheon-ai pantheon-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Code looks good. No issues found.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
pkg/statistics/sample_test.go (1)

275-316: ⚡ Quick win

Please cover the merge path too.

This subtest only exercises build + proto round-trip. The analyze path consumes singleton sketches after FromProto() and MergeCollector(), so a deterministic case where the same value is singleton in two children and must disappear after merge would protect the new behavior much better.

As per coding guidelines, "Prefer extending existing test suites and fixtures over creating new scaffolding."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/statistics/sample_test.go` around lines 275 - 316, The test
SubTestRowSampleSingletonSketches only checks Build + proto round-trip; add a
merge/analysis path to cover the case where the same value is singleton in two
child collectors and must be removed after MergeCollector/Analyze. Create two
ReservoirRowSampleCollector instances via NewReservoirRowSampleCollector, feed
them deterministic rows so a particular value is singleton in both, call
BuildSingletonSketches on each, use ToProto/FromProto or directly call
MergeCollector to combine them, then run the same analysis path that consumes
singleton sketches and assert that the merged collector no longer treats that
value as singleton (NDV decreases/removes it) and that downstream
FromProto/MergeCollector round-trip preserves this behavior; reuse the existing
rows/colGroups and assertions in SubTestRowSampleSingletonSketches rather than
adding new scaffolding.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@DEPS.bzl`:
- Around line 6585-6591: The Bazel fetch is failing because the module zip for
com_github_pingcap_tipb-v0.0.0-20260414032333-da912b84de6f referenced in
DEPS.bzl (sha256 =
"68768a27ed6c35716fcb01a0b4a15ff13e5c1a5dc11acc7a3d44ba02a2742077", strip_prefix
= "github.com/pingcap/tipb@v0.0.0-20260414032333-da912b84de6f") is not available
on the configured mirrors (the listed URLs), so either upload the module zip for
that pseudo-version to all mirrors (http://bazel-cache.pingcap.net,
http://ats.apps.svc, https://cache.hawkingrei.com,
https://storage.googleapis.com/pingcapmirror) and confirm go.sum contains the
matching hash, or revert/remove this DEPS.bzl entry until the artifact is
mirrored; after uploading or reverting, re-run the CI pipeline to verify the
fetch succeeds.

In `@pkg/executor/analyze_col_sampling.go`:
- Around line 965-973: The code currently computes a single aggregated
sampleSize from nodeSketchSampleCounts/nodeSampleSizes and passes it to
EstimateNDVByGEE(), but sketches are populated per-slice (collectColumns() may
skip nulls), so you must track and pass a per-slice sample count: change the
loop that builds nodeSketchSampleCounts to also compute per-index
actualSampleCounts[i] (the number of rows that contributed to each sketch, using
the same logic collectColumns() uses to skip nulls), and replace uses of the
aggregated sampleSize with actualSampleCounts[i] when computing
sampleNDV/singletonItems and when calling EstimateNDVByGEE() (also apply the
same fix in the similar block around lines 992-999). Ensure variable names
referenced are nodeSketchSampleCounts, nodeSampleSizes, sampleSize,
collectColumns(), EstimateNDVByGEE(), sampleNDV, and singletonItems so the
correct per-slice counts are used.

In `@pkg/statistics/row_sampler.go`:
- Around line 435-450: The current mergeSingletonSketches uses MergeFMSketch
(union), which incorrectly treats "seen-once" items as still singletons after
merge; instead preserve singleton sketches at the original child-collector
granularity by changing the storage and merge behavior: stop calling
MergeFMSketch and instead append a copy of each incoming singletonSketch to a
per-child list (e.g., change baseCollector.SingletonSketches from []*FMSketch to
[][]*FMSketch or otherwise store sketches by child index) inside
mergeSingletonSketches (use singletonSketch.Copy()), and update
buildSamplingStats/estimateNDVsBySketch consumers to iterate the per-child
sketches rather than a single merged sketch so singletonItems aren’t
double-counted.
- Around line 485-487: When deserializing singleton sketches, FMSketchFromProto
leaves maxSize == 0 which later causes shrink logic/corruption when those
sketches are merged; after calling FMSketchFromProto(pbSketch) in the loop that
populates s.SingletonSketches, set the returned sketch's maxSize to the expected
non-zero capacity (e.g. the same maxSize used for new sketches in this package
or derived from s or pbCollector) so that MergeFMSketch and
mergeSingletonSketches operate on a properly initialized sketch; update the loop
that builds s.SingletonSketches to normalize maxSize on each sketch returned by
FMSketchFromProto before appending.

---

Nitpick comments:
In `@pkg/statistics/sample_test.go`:
- Around line 275-316: The test SubTestRowSampleSingletonSketches only checks
Build + proto round-trip; add a merge/analysis path to cover the case where the
same value is singleton in two child collectors and must be removed after
MergeCollector/Analyze. Create two ReservoirRowSampleCollector instances via
NewReservoirRowSampleCollector, feed them deterministic rows so a particular
value is singleton in both, call BuildSingletonSketches on each, use
ToProto/FromProto or directly call MergeCollector to combine them, then run the
same analysis path that consumes singleton sketches and assert that the merged
collector no longer treats that value as singleton (NDV decreases/removes it)
and that downstream FromProto/MergeCollector round-trip preserves this behavior;
reuse the existing rows/colGroups and assertions in
SubTestRowSampleSingletonSketches rather than adding new scaffolding.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f100b334-ed63-4074-9cae-e68df8946d0b

📥 Commits

Reviewing files that changed from the base of the PR and between 33ae9e3 and 1d78d22.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (7)
  • DEPS.bzl
  • go.mod
  • pkg/executor/analyze_col_sampling.go
  • pkg/executor/analyze_utils_test.go
  • pkg/statistics/fmsketch.go
  • pkg/statistics/row_sampler.go
  • pkg/statistics/sample_test.go

Comment thread DEPS.bzl
Comment on lines +6585 to +6591
sha256 = "68768a27ed6c35716fcb01a0b4a15ff13e5c1a5dc11acc7a3d44ba02a2742077",
strip_prefix = "github.com/pingcap/tipb@v0.0.0-20260414032333-da912b84de6f",
urls = [
"http://bazel-cache.pingcap.net:8080/gomod/github.com/pingcap/tipb/com_github_pingcap_tipb-v0.0.0-20260210113932-1447c9d7e9fe.zip",
"http://ats.apps.svc/gomod/github.com/pingcap/tipb/com_github_pingcap_tipb-v0.0.0-20260210113932-1447c9d7e9fe.zip",
"https://cache.hawkingrei.com/gomod/github.com/pingcap/tipb/com_github_pingcap_tipb-v0.0.0-20260210113932-1447c9d7e9fe.zip",
"https://storage.googleapis.com/pingcapmirror/gomod/github.com/pingcap/tipb/com_github_pingcap_tipb-v0.0.0-20260210113932-1447c9d7e9fe.zip",
"http://bazel-cache.pingcap.net:8080/gomod/github.com/pingcap/tipb/com_github_pingcap_tipb-v0.0.0-20260414032333-da912b84de6f.zip",
"http://ats.apps.svc/gomod/github.com/pingcap/tipb/com_github_pingcap_tipb-v0.0.0-20260414032333-da912b84de6f.zip",
"https://cache.hawkingrei.com/gomod/github.com/pingcap/tipb/com_github_pingcap_tipb-v0.0.0-20260414032333-da912b84de6f.zip",
"https://storage.googleapis.com/pingcapmirror/gomod/github.com/pingcap/tipb/com_github_pingcap_tipb-v0.0.0-20260414032333-da912b84de6f.zip",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Bazel build is broken — mirror artifacts return 404 for the new tipb version.

The pipeline confirms that the zip for com_github_pingcap_tipb-v0.0.0-20260414032333-da912b84de6f does not exist at either the cache.hawkingrei.com or storage.googleapis.com mirror URLs. Bazel resolves URLs in order, so once all four fail the build is completely blocked.

This typically means the artifact still needs to be uploaded/mirrored before the DEPS.bzl entry can be merged. The steps are usually:

  1. Ensure go.sum contains the correct hash for the new pseudo-version.
  2. Upload the module zip to all four mirror locations (internal cluster mirrors + the two public caches).
  3. Re-run the pipeline to confirm the fetch succeeds before removing the do-not-merge label.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@DEPS.bzl` around lines 6585 - 6591, The Bazel fetch is failing because the
module zip for com_github_pingcap_tipb-v0.0.0-20260414032333-da912b84de6f
referenced in DEPS.bzl (sha256 =
"68768a27ed6c35716fcb01a0b4a15ff13e5c1a5dc11acc7a3d44ba02a2742077", strip_prefix
= "github.com/pingcap/tipb@v0.0.0-20260414032333-da912b84de6f") is not available
on the configured mirrors (the listed URLs), so either upload the module zip for
that pseudo-version to all mirrors (http://bazel-cache.pingcap.net,
http://ats.apps.svc, https://cache.hawkingrei.com,
https://storage.googleapis.com/pingcapmirror) and confirm go.sum contains the
matching hash, or revert/remove this DEPS.bzl entry until the artifact is
mirrored; after uploading or reverting, re-run the CI pipeline to verify the
fetch succeeds.

Comment on lines +965 to +973
var sampleSize uint64
for _, size := range nodeSketchSampleCounts {
sampleSize += uint64(size)
}
if sampleSize == 0 {
for _, size := range nodeSampleSizes {
sampleSize += uint64(size)
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Use per-slice sketch sample counts here.

sampleSize is aggregated once for the whole collector, but the sketches are not populated uniformly: collectColumns() skips null single-column values while multi-column groups still hash them. That means sampleNDV/singletonItems for a nullable slice can be computed from fewer sampled rows than the shared sampleSize passed to EstimateNDVByGEE(), which skews the estimate. Please track the number of rows that actually contributed to each sketch and use that per i.

Also applies to: 992-999

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/analyze_col_sampling.go` around lines 965 - 973, The code
currently computes a single aggregated sampleSize from
nodeSketchSampleCounts/nodeSampleSizes and passes it to EstimateNDVByGEE(), but
sketches are populated per-slice (collectColumns() may skip nulls), so you must
track and pass a per-slice sample count: change the loop that builds
nodeSketchSampleCounts to also compute per-index actualSampleCounts[i] (the
number of rows that contributed to each sketch, using the same logic
collectColumns() uses to skip nulls), and replace uses of the aggregated
sampleSize with actualSampleCounts[i] when computing sampleNDV/singletonItems
and when calling EstimateNDVByGEE() (also apply the same fix in the similar
block around lines 992-999). Ensure variable names referenced are
nodeSketchSampleCounts, nodeSampleSizes, sampleSize, collectColumns(),
EstimateNDVByGEE(), sampleNDV, and singletonItems so the correct per-slice
counts are used.

Comment on lines +435 to +450
func (s *baseCollector) mergeSingletonSketches(singletonSketches []*FMSketch) {
if len(singletonSketches) == 0 {
return
}
if len(s.SingletonSketches) < len(singletonSketches) {
s.SingletonSketches = append(s.SingletonSketches, make([]*FMSketch, len(singletonSketches)-len(s.SingletonSketches))...)
}
for i, singletonSketch := range singletonSketches {
if singletonSketch == nil {
continue
}
if s.SingletonSketches[i] == nil {
s.SingletonSketches[i] = singletonSketch.Copy()
} else {
s.SingletonSketches[i].MergeFMSketch(singletonSketch)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't merge singleton sketches with FM union.

singletonSketch means “seen exactly once in this partition”, but MergeFMSketch() only unions hash membership. If a value is singleton in two child collectors, it still survives in the merged sketch even though it is no longer a singleton for the merged partition. buildSamplingStats() later feeds these merged sketches into estimateNDVsBySketch(), so singletonItems is biased upward and NDV can be overstated. Please preserve singleton sketches at the original collector granularity or merge them with once/multiple state instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/statistics/row_sampler.go` around lines 435 - 450, The current
mergeSingletonSketches uses MergeFMSketch (union), which incorrectly treats
"seen-once" items as still singletons after merge; instead preserve singleton
sketches at the original child-collector granularity by changing the storage and
merge behavior: stop calling MergeFMSketch and instead append a copy of each
incoming singletonSketch to a per-child list (e.g., change
baseCollector.SingletonSketches from []*FMSketch to [][]*FMSketch or otherwise
store sketches by child index) inside mergeSingletonSketches (use
singletonSketch.Copy()), and update buildSamplingStats/estimateNDVsBySketch
consumers to iterate the per-child sketches rather than a single merged sketch
so singletonItems aren’t double-counted.

Comment on lines +485 to +487
s.SingletonSketches = make([]*FMSketch, 0, len(pbCollector.GetSingletonSketch()))
for _, pbSketch := range pbCollector.GetSingletonSketch() {
s.SingletonSketches = append(s.SingletonSketches, FMSketchFromProto(pbSketch))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Initialize deserialized singleton sketches with a non-zero maxSize.

FMSketchFromProto() leaves maxSize == 0. Here that sketch can later be copied into s.SingletonSketches and become the destination of MergeFMSketch() in mergeSingletonSketches(), which makes every insert trigger shrink logic and corrupts the sketch. Please normalize maxSize during deserialization before reusing these sketches.

Suggested fix
for _, pbSketch := range pbCollector.GetSingletonSketch() {
-	s.SingletonSketches = append(s.SingletonSketches, FMSketchFromProto(pbSketch))
+	sketch := FMSketchFromProto(pbSketch)
+	if sketch != nil && sketch.maxSize == 0 {
+		sketch.maxSize = MaxSketchSize
+	}
+	s.SingletonSketches = append(s.SingletonSketches, sketch)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/statistics/row_sampler.go` around lines 485 - 487, When deserializing
singleton sketches, FMSketchFromProto leaves maxSize == 0 which later causes
shrink logic/corruption when those sketches are merged; after calling
FMSketchFromProto(pbSketch) in the loop that populates s.SingletonSketches, set
the returned sketch's maxSize to the expected non-zero capacity (e.g. the same
maxSize used for new sketches in this package or derived from s or pbCollector)
so that MergeFMSketch and mergeSingletonSketches operate on a properly
initialized sketch; update the loop that builds s.SingletonSketches to normalize
maxSize on each sketch returned by FMSketchFromProto before appending.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 4, 2026

@0xPoe: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-build-next-gen 1d78d22 link true /test pull-build-next-gen
idc-jenkins-ci-tidb/build 1d78d22 link true /test build
idc-jenkins-ci-tidb/check_dev 1d78d22 link true /test check-dev
idc-jenkins-ci-tidb/unit-test 1d78d22 link true /test unit-test
pull-unit-test-next-gen 1d78d22 link true /test pull-unit-test-next-gen
idc-jenkins-ci-tidb/check_dev_2 1d78d22 link true /test check-dev2
pull-integration-realcluster-test-next-gen 1d78d22 link true /test pull-integration-realcluster-test-next-gen
idc-jenkins-ci-tidb/mysql-test 1d78d22 link true /test mysql-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.2950%. Comparing base (33ae9e3) to head (1d78d22).
⚠️ Report is 18 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #68157        +/-   ##
================================================
- Coverage   77.7624%   77.2950%   -0.4675%     
================================================
  Files          1990       1984         -6     
  Lines        551788     555565      +3777     
================================================
+ Hits         429084     429424       +340     
- Misses       121784     125397      +3613     
+ Partials        920        744       -176     
Flag Coverage Δ
integration 50.7858% <ø> (+10.9839%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 60.4888% <ø> (ø)
parser ∅ <ø> (∅)
br 50.0549% <ø> (-13.0386%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/statistics do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/planner SIG: Planner size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant