statistics: add global singleton estimation for sample-based NDV#67593
statistics: add global singleton estimation for sample-based NDV#67593ti-chi-bot[bot] merged 21 commits intopingcap:masterfrom
Conversation
Signed-off-by: 0xPoe <techregister@pm.me>
Signed-off-by: 0xPoe <techregister@pm.me>
Signed-off-by: 0xPoe <techregister@pm.me>
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdded exported function Changes
Sequence Diagram(s)(omitted) Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Skipping CI for Draft Pull Request. |
Signed-off-by: 0xPoe <techregister@pm.me>
|
@0xPoe I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details. ⏳ This process typically takes 10-30 minutes depending on the complexity of the changes. ℹ️ Learn more details on Pantheon AI. |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/statistics/estimate.go (1)
119-156: Skip regions with nil singleton sketches to avoid unnecessary O(k) work.At Line 119-156, when
singletonSketches[i] == nil, contribution is always 0, so buildingotheris wasted work. A fast-pathcontinuetrims needless merges without changing behavior.♻️ Proposed refactor
func EstimateGlobalSingletonBySketches(ndvSketches, singletonSketches []*FMSketch) uint64 { @@ var globalSingleton int64 for i := range ndvSketches { + if singletonSketches[i] == nil { + // No local singleton contribution from this region. + continue + } + // Merge NDV sketches from all regions except region i. var other *FMSketch for j, ns := range ndvSketches { @@ - if other != nil { - if singletonSketches[i] != nil { - other.MergeFMSketch(singletonSketches[i]) - } - } else if singletonSketches[i] != nil { - other = singletonSketches[i].Copy() + if other != nil { + other.MergeFMSketch(singletonSketches[i]) + } else { + other = singletonSketches[i].Copy() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/statistics/estimate.go` around lines 119 - 156, The loop over ndvSketches does unnecessary work when singletonSketches[i] is nil; add a fast-path at the start of the loop to continue if singletonSketches[i] == nil so we skip building/merging the `other` FMSketch for that region. Update the loop in the function using ndvSketches/singletonSketches (references: ndvSketches, singletonSketches, FMSketch, MergeFMSketch, NDV, globalSingleton) to check singletonSketches[i] first and only perform the existing merging logic when a non-nil singleton exists.pkg/statistics/estimate_test.go (1)
148-160: Consider adding coverage for the non-assert fallback path (return 0).Line 148-160 validates panic behavior, but the function also intentionally returns
0for invalid input when assertions are not active. A small extra test for that path would make behavior guarantees clearer across build modes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/statistics/estimate_test.go` around lines 148 - 160, Add a test that exercises the non-panic fallback of EstimateGlobalSingletonBySketches (returns 0 when inputs are invalid) by creating a separate test file compiled under a custom build tag (e.g., //go:build noassert) so it runs in a build where assertions are disabled; in that file implement TestEstimateGlobalSingletonBySketches_Fallback which calls EstimateGlobalSingletonBySketches(nil, nil) and asserts the result equals 0 (use require.Equal), and document that the test is executed with go test -tags noassert.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/statistics/estimate_test.go`:
- Around line 148-160: Add a test that exercises the non-panic fallback of
EstimateGlobalSingletonBySketches (returns 0 when inputs are invalid) by
creating a separate test file compiled under a custom build tag (e.g.,
//go:build noassert) so it runs in a build where assertions are disabled; in
that file implement TestEstimateGlobalSingletonBySketches_Fallback which calls
EstimateGlobalSingletonBySketches(nil, nil) and asserts the result equals 0 (use
require.Equal), and document that the test is executed with go test -tags
noassert.
In `@pkg/statistics/estimate.go`:
- Around line 119-156: The loop over ndvSketches does unnecessary work when
singletonSketches[i] is nil; add a fast-path at the start of the loop to
continue if singletonSketches[i] == nil so we skip building/merging the `other`
FMSketch for that region. Update the loop in the function using
ndvSketches/singletonSketches (references: ndvSketches, singletonSketches,
FMSketch, MergeFMSketch, NDV, globalSingleton) to check singletonSketches[i]
first and only perform the existing merging logic when a non-nil singleton
exists.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 5ae4a6e0-710e-41eb-945d-e1d3da38334c
📒 Files selected for processing (3)
pkg/statistics/BUILD.bazelpkg/statistics/estimate.gopkg/statistics/estimate_test.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #67593 +/- ##
================================================
+ Coverage 77.6342% 78.3342% +0.7000%
================================================
Files 1961 1974 +13
Lines 543867 552850 +8983
================================================
+ Hits 422227 433071 +10844
+ Misses 120830 119300 -1530
+ Partials 810 479 -331
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
/retest |
The algorithm is not region-specific; use the more general term "node" to describe each partition that contributes local sketches.
Signed-off-by: 0xPoe <techregister@pm.me>
0xPoe
left a comment
There was a problem hiding this comment.
🔢 Self-check (PR reviewed by myself and ready for feedback)
-
Code compiles successfully
-
Unit tests added
-
No AI-generated elegant nonsense in PR.
-
Bazel files updated
-
Comments added where necessary
-
PR title and description updated
-
Documentation PR created (or confirmed not needed)
-
PR size is reasonable
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/statistics/estimate.go (1)
157-161: Consider whether negative globalSingleton indicates a bug.The clamping to 0 silently handles cases where FM sketch estimation errors cause
ndvUnion - ndvOther < 0. This is reasonable for probabilistic sketches, but you might want to add a brief comment explaining why this can happen (sketch approximation errors) to avoid future confusion.📝 Optional: Add clarifying comment
+ // FM sketch estimates can have approximation errors; if the sum goes + // negative, clamp to zero. if globalSingleton < 0 { return 0 }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/statistics/estimate.go` around lines 157 - 161, The code currently clamps negative globalSingleton to zero in the function returning uint64(globalSingleton); add a brief clarifying comment immediately above this check explaining that negative values can arise from probabilistic sketch estimation error (e.g., FM sketch where ndvUnion - ndvOther may be slightly negative) and that clamping to 0 is intentional to avoid returning a negative count; reference the variable name globalSingleton and the surrounding logic that computes ndvUnion/ndvOther so future readers understand this is an approximation artifact, not a real negative cardinality.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/statistics/estimate.go`:
- Around line 157-161: The code currently clamps negative globalSingleton to
zero in the function returning uint64(globalSingleton); add a brief clarifying
comment immediately above this check explaining that negative values can arise
from probabilistic sketch estimation error (e.g., FM sketch where ndvUnion -
ndvOther may be slightly negative) and that clamping to 0 is intentional to
avoid returning a negative count; reference the variable name globalSingleton
and the surrounding logic that computes ndvUnion/ndvOther so future readers
understand this is an approximation artifact, not a real negative cardinality.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 4e46169a-3fce-4c20-90f5-e69a16e2f8fa
📒 Files selected for processing (2)
pkg/statistics/estimate.gopkg/statistics/estimate_test.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/statistics/estimate.go (1)
111-120: Defensive nil-entry handling is incomplete.The
intest.Assertcalls on lines 113-116 enforce that nil entries are disallowed, but the defensive runtime checks (lines 118-120) only cover empty slices and length mismatches—not nil entries. This means:
- In test builds (with
intestenabled): nil entries panic immediately.- In production builds: nil entries are silently skipped in the loop (lines 127, 148, 151), potentially masking caller bugs.
Consider adding a defensive runtime check that returns
0(or logs a warning) if any entry is nil, to match the assertion intent and maintain consistent behavior across build modes.💡 Optional fix to add defensive nil-entry check
// Defensive checks. if len(ndvSketches) == 0 || len(ndvSketches) != len(singletonSketches) { return 0 } + for i := range ndvSketches { + if ndvSketches[i] == nil || singletonSketches[i] == nil { + return 0 + } + }If this is added, the nil checks on lines 127, 148, and 151 become truly unreachable (dead code) and could be simplified, though keeping them as extra defense-in-depth is also acceptable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/statistics/estimate.go` around lines 111 - 120, The code currently asserts that ndvSketches and singletonSketches contain no nil entries via intest.Assert, but the runtime defensive checks only handle empty slices and length mismatches; add an explicit runtime nil-entry check over ndvSketches and singletonSketches (iterate and if any element is nil) and return 0 (or log and return 0) to mirror the assertion behavior for production builds; update any later branches in the same function that check for nil entries (e.g., the checks around the loops referencing ndvSketches[i] and singletonSketches[i]) as they become unreachable or can be simplified after this new guard.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/statistics/estimate.go`:
- Around line 111-120: The code currently asserts that ndvSketches and
singletonSketches contain no nil entries via intest.Assert, but the runtime
defensive checks only handle empty slices and length mismatches; add an explicit
runtime nil-entry check over ndvSketches and singletonSketches (iterate and if
any element is nil) and return 0 (or log and return 0) to mirror the assertion
behavior for production builds; update any later branches in the same function
that check for nil entries (e.g., the checks around the loops referencing
ndvSketches[i] and singletonSketches[i]) as they become unreachable or can be
simplified after this new guard.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 1bc16bca-1c5c-4040-be59-fcf5e17a3be4
📒 Files selected for processing (2)
pkg/statistics/estimate.gopkg/statistics/estimate_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/statistics/estimate_test.go
| // Summing these per-node contributions gives the global singleton estimate. | ||
| // | ||
| // The current implementation rebuilds "all except i" from scratch for each | ||
| // node, which is O(k²) in the number of nodes. A prefix-suffix approach |
There was a problem hiding this comment.
For now, I believe because we don't really do real sampling, if I understand correctly, the number of sketches we get from TiKV equals the region numbers of the tables.
Because the region becomes bigger and bigger, I think the current performance would be acceptable, but we do need to re-evaluate in the future if we really want to do some random key range sampling specified within the coprocessor requests. For now, I think it would be good enough.
There was a problem hiding this comment.
So, if I understand correctly, there are sketches for all distinct values in the samples, which is needed for evaluating if a local singleton would add to the global singleton? I.e. not the nice feature of an FMSketch, where only a fixed level of distinct values are kept.
If so, then why not simply have a map[<column value or hash of column value>]uint64 directly then?
so if there are very few distinct values, the memory is small, and if all distinct values it will grow to the number of samples?
Easy to merge (and one can even delete from it) and you can just iterate over it once to see global singletons.
Even if we would re-use FMSketches for f1 values, we could add a counter to the FMSketch, to see if the 'survived' values was singletons or not. Maybe we could simply use that for TopN and Histograms as well as a unified model somehow? (we most likely would need to store the full column value for that, but that would replace the sample anyway, so nothing extra to be stored, but rather limited to a single copy of each distinct value. Not sure how correlation is calculated though).
There was a problem hiding this comment.
So, if I understand correctly, there are sketches for all distinct values in the samples, which is needed for evaluating if a local singleton would add to the global singleton? I.e. not the nice feature of an FMSketch, where only a fixed level of distinct values are kept.
No, we do use that feature. The single value remains the estimated value. We couldn't send all values back; for example, if the column contains almost unique values. In that case, for a table with 1 billion entries, we would need to send back 50 million hash values. Therefore, we have to use the FM sketch here.
There was a problem hiding this comment.
OK, so will it be two sets, one for singletons and one for multiton (> 1 entry)?
Why not generalize it to instead of a set (or map without value) be a map with bool multipleValues (false for singleton, true for multiton). Not sure if that would save much space or processing though :) Feel free to ignore, it is mostly for my understanding.
There was a problem hiding this comment.
OK, so will it be two sets, one for singletons and one for multiton (> 1 entry)?
Why not generalize it to instead of a set (or map without value) be a map with boolmultipleValues(false for singleton, true for multiton). Not sure if that would save much space or processing though :) Feel free to ignore, it is mostly for my understanding.
Yes, we can do that.
This calculation is on the TiKV side. I assume this is not related to the FMSketch itself?
Signed-off-by: 0xPoe <techregister@pm.me>
Signed-off-by: 0xPoe <techregister@pm.me>
Signed-off-by: 0xPoe <techregister@pm.me>
Signed-off-by: 0xPoe <techregister@pm.me>
Signed-off-by: 0xPoe <techregister@pm.me>
0xPoe
left a comment
There was a problem hiding this comment.
🔢 Self-check (PR reviewed by myself and ready for feedback)
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mjonss, time-and-fate The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
|
/test all |
|
/retest |
9 similar comments
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
What problem does this PR solve?
Issue Number: ref #67449
Problem Summary:
Sample-based NDV collection needs a way to estimate how many sampled singleton values remain globally unique after data is merged across nodes.
What changed and how does it work?
EstimateGlobalSingletonBySketchesto estimate globally unique singleton values by merging each node's singleton sketch against the union of the other nodes' NDV sketches.Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
New Features
Tests
Chores