Skip to content

metrics: enhance diagnostic capabilities for gRPC network issues#67811

Open
zyguan wants to merge 9 commits intopingcap:masterfrom
zyguan:dev/bump-client-go
Open

metrics: enhance diagnostic capabilities for gRPC network issues#67811
zyguan wants to merge 9 commits intopingcap:masterfrom
zyguan:dev/bump-client-go

Conversation

@zyguan
Copy link
Copy Markdown
Contributor

@zyguan zyguan commented Apr 16, 2026

What problem does this PR solve?

Issue Number: close #67810

Problem Summary: ref #67810

What changed and how does it work?

Bump client-go and register channelz collector.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

  • Chores

    • Updated pinned third-party Go/Bazel dependency version.
  • New Features

    • Added gRPC Channelz metrics collection to report channel/socket health.
  • Tests

    • Added unit tests for the Channelz collector and metrics gathering.
    • Improved test cleanup and harness (explicit collector teardown, added leak-ignore rules for gRPC/bufconn, increased some test shard counts).

Signed-off-by: zyguan <zhongyangguan@gmail.com>
@ti-chi-bot ti-chi-bot Bot added the release-note-none Denotes a PR that doesn't merit a release note. label Apr 16, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai Bot commented Apr 16, 2026

@zyguan I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot ti-chi-bot Bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 16, 2026
@tiprow
Copy link
Copy Markdown

tiprow Bot commented Apr 16, 2026

Hi @zyguan. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a mutex-guarded gRPC Channelz Prometheus collector (with bufconn server/client and test cleanup), updates pinned github.com/tikv/client-go/v2 pseudo-version/sha256, adjusts Bazel build/test deps and sharding, and adds unit tests and goleak ignore entries across tests.

Changes

Cohort / File(s) Summary
Dependency Metadata
DEPS.bzl, go.mod
Bumped pinned pseudo-version for github.com/tikv/client-go/v2 (updated strip_prefix, urls, and sha256) and updated go.mod requirement to the new pseudo-version.
Metrics Implementation
pkg/metrics/metrics.go
Introduces Channelz collector: mutex-guarded singleton state, bufconn-based local gRPC server, client dialer, collector creation with filters, Prometheus registration, and stop/cleanup helpers (test short-circuit).
Metrics Tests & Hooks
pkg/metrics/main_test.go, pkg/metrics/metrics_internal_test.go
Adds goleak cleanup hook and new unit tests for singleton init, test-mode skipping, cleanup/reset, and Prometheus Gather assertions; includes helper functions to inspect metric families.
Build / Test Config
pkg/importsdk/BUILD.bazel, pkg/metrics/BUILD.bazel, br/pkg/metautil/BUILD.bazel
Added //pkg/parser/ast to importsdk_test; added //pkg/util/intest, tikv Channelz collectors and gRPC bufconn/insecure/channelz deps to pkg/metrics library; increased shard_count for metrics_test 5→8 and metautil_test 13→15.
goleak Ignore Additions
multiple test mains (e.g., br/cmd/br/main_test.go, pkg/server/.../main_test.go, pkg/server/tests/.../main_test.go)
Added goleak.IgnoreTopFunction entries for google.golang.org/grpc/internal/grpcsync.(*CallbackSerializer).run and google.golang.org/grpc/test/bufconn.(*Listener).Accept across many TestMain files to suppress bufconn/grpc-related goroutine reports.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application / Test
    participant Metrics as pkg/metrics
    participant Mutex as grpcChannelzCollector.mu
    participant Server as bufconn gRPC Server
    participant Client as gRPC ClientConn
    participant Channelz as ChannelzCollector
    participant Prom as Prometheus Registry

    App->>Metrics: setupChannelzCollector()
    rect rgba(100,150,200,0.5)
        Metrics->>Mutex: Lock
        Metrics->>Metrics: check intest.InTest / registered
    end

    alt Not in test and not registered
        Metrics->>Server: start bufconn server + register Channelz service
        Metrics->>Client: dial via bufconn dialer
        Metrics->>Channelz: NewChannelzCollector(Client, opts)
        Metrics->>Prom: prometheus.MustRegister(Channelz)
        Metrics->>Metrics: set registered = true
    end

    rect rgba(100,150,200,0.5)
        Metrics->>Mutex: Unlock
    end

    App->>Prom: Gather()
    Prom->>Channelz: Collect()
    Channelz-->>Prom: MetricFamilies
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

component/statistics, ok-to-test

Suggested reviewers

  • yibin87

Poem

🐰 I tunneled bufconn lanes below,
I guarded metrics with a mutex glow,
Prom counts hops where diagnostics go,
I clean the burrow after tests run slow,
Hop — channelz stories start to show.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title is specific and directly related to the main changeset, which implements a gRPC channelz collector for improved diagnostics of network issues.
Description check ✅ Passed The description follows the template and includes a properly formatted issue reference (close #67810), a concise explanation of changes, and appropriate test/release note declarations.
Linked Issues check ✅ Passed The PR implements both key objectives from #67810: exporting gRPC internal metrics to Prometheus via channelz collector registration and improving observability for connection-level diagnostics.
Out of Scope Changes check ✅ Passed All changes are scoped to the objectives: bumping client-go dependency and implementing channelz collector setup with supporting test infrastructure and goleak configuration updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/metrics/metrics.go (1)

499-531: Consider adding a brief startup synchronization or documenting the bufconn behavior.

The goroutine starting the gRPC server (line 509-513) runs asynchronously. While bufconn makes this safe because listener.DialContext will work immediately, a brief comment explaining why no explicit synchronization is needed would help future readers understand the design choice.

📝 Suggested documentation improvement
 	grpcChannelzCollector.server = grpc.NewServer()
 	service.RegisterChannelzServiceToServer(grpcChannelzCollector.server)
+	// The server is started asynchronously, but bufconn.Listener.DialContext works
+	// immediately without waiting for Serve() to be called, so no synchronization is needed.
 	go func(listener *bufconn.Listener, server *grpc.Server) {
 		if err := server.Serve(listener); err != nil {
 			logutil.BgLogger().Warn("internal channelz grpc server stopped", zap.Error(err))
 		}
 	}(grpcChannelzCollector.listener, grpcChannelzCollector.server)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metrics/metrics.go` around lines 499 - 531, Add a short inline comment in
initGrpcChannelzCollectorLocked near the goroutine that starts the in-memory
gRPC server explaining that no explicit synchronization is required because
bufconn.Listen returns a ready listener and listener.DialContext will succeed
immediately (so dialing from the client goroutine is safe), and note that the
goroutine is only for Serve's lifecycle and errors are logged — reference the
goroutine that launches server.Serve(listener), the local variable listener
(grpcChannelzCollector.listener), and the DialContext usage in the
grpc.WithContextDialer closure to make the rationale easy to find.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/metrics/metrics.go`:
- Around line 499-531: Add a short inline comment in
initGrpcChannelzCollectorLocked near the goroutine that starts the in-memory
gRPC server explaining that no explicit synchronization is required because
bufconn.Listen returns a ready listener and listener.DialContext will succeed
immediately (so dialing from the client goroutine is safe), and note that the
goroutine is only for Serve's lifecycle and errors are logged — reference the
goroutine that launches server.Serve(listener), the local variable listener
(grpcChannelzCollector.listener), and the DialContext usage in the
grpc.WithContextDialer closure to make the rationale easy to find.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 15a22892-57e7-441c-88eb-fb0d72cefd36

📥 Commits

Reviewing files that changed from the base of the PR and between 7762bc6 and 0a2b8ef.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (7)
  • DEPS.bzl
  • go.mod
  • pkg/importsdk/BUILD.bazel
  • pkg/metrics/BUILD.bazel
  • pkg/metrics/main_test.go
  • pkg/metrics/metrics.go
  • pkg/metrics/metrics_internal_test.go

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.1970%. Comparing base (84d8269) to head (d9bbcac).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #67811        +/-   ##
================================================
+ Coverage   77.7288%   78.1970%   +0.4681%     
================================================
  Files          1990       1990                
  Lines        551970     552755       +785     
================================================
+ Hits         429040     432238      +3198     
+ Misses       122010     119515      -2495     
- Partials        920       1002        +82     
Flag Coverage Δ
integration 44.5976% <ø> (+4.7958%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 60.4888% <ø> (ø)
parser ∅ <ø> (∅)
br 65.2716% <ø> (+2.1781%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented Apr 16, 2026

/retest

@tiprow
Copy link
Copy Markdown

tiprow Bot commented Apr 16, 2026

@zyguan: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented Apr 16, 2026

/retest

@tiprow
Copy link
Copy Markdown

tiprow Bot commented Apr 16, 2026

@zyguan: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Signed-off-by: zyguan <zhongyangguan@gmail.com>
@ti-chi-bot ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 16, 2026
Comment thread pkg/metrics/metrics.go
prometheus.MustRegister(StmtSummaryWindowEvictedCount)

// Channelz
setupChannelzCollector()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would other goleak-based suites calling metrics.RegisterMetrics() directly get go leak check error in some caces?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved by c33b6e1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case some of integration tests which call RegisterMetrics but are NOT compiled with -tags=intest, a6303d8 added the goroutine to the goleak whitelist.

Comment thread pkg/metrics/metrics.go
Comment on lines +538 to +542
func channelzCollectorOpts() tikvcollectors.ChannelzCollectorOpts {
return tikvcollectors.ChannelzCollectorOpts{
Namespace: namespace,
DisableLocalLabel: true,
Filter: func(node any) (collect bool, walkChildren bool) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This filter seems to include the collector’s own internal bufnet connection, so scraping
may inflate tidb_grpc_channelz_* by itself. Should we exclude it here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved by c5f3b5b

Signed-off-by: zyguan <zhongyangguan@gmail.com>
@ti-chi-bot ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 16, 2026
zyguan added 3 commits April 16, 2026 10:27
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Copy link
Copy Markdown
Contributor

@cfzjywxk cfzjywxk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most lgtm

Comment thread pkg/metrics/metrics.go
grpcChannelzCollector.listener = bufconn.Listen(1 << 20)
grpcChannelzCollector.server = grpc.NewServer()
service.RegisterChannelzServiceToServer(grpcChannelzCollector.server)
go func(listener *bufconn.Listener, server *grpc.Server) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is graceful shutdown of tidb needs to be considered here to close this background thread properly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, it's just for collecting channelz data and won't block graceful shutdown.

Comment thread pkg/metrics/metrics.go
return target == "bufnet" || target == "passthrough:///bufnet"
}

func isInternalChannelzSocket(socket *grpc_channelz_v1.Socket) bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better adding comments to explain the meaning of internal channel.

Signed-off-by: zyguan <zhongyangguan@gmail.com>
Copy link
Copy Markdown
Contributor

@cfzjywxk cfzjywxk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 17, 2026
Comment thread pkg/metrics/metrics.go
Comment on lines +565 to +567
if isInternalChannelzSocket(n) {
return false, false
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check necessary?

Since the parent channel is already filtered by isInternalChannelzTarget (with walkChildren=false), child sockets should never be visited during the walk, making this socket-level filter redundant.

AI suggests that it it is kept, remote == nil && remoteName == "" could accidentally match legitimate sockets in transient states (e.g., connection handshaking).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the leaf socket nodes should never be visited, and I think it is reasonable to filter out those sockets in transient states, otherwise, we may still write related PromQLs like ...{..., remote!=""}.

@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented Apr 22, 2026

/retest

@tiprow
Copy link
Copy Markdown

tiprow Bot commented Apr 22, 2026

@zyguan: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented Apr 22, 2026

/retest

@tiprow
Copy link
Copy Markdown

tiprow Bot commented Apr 22, 2026

@zyguan: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Signed-off-by: zyguan <zhongyangguan@gmail.com>
@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented May 7, 2026

/retest

1 similar comment
@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented May 7, 2026

/retest

Signed-off-by: zyguan <zhongyangguan@gmail.com>
@zyguan zyguan force-pushed the dev/bump-client-go branch from a4ff899 to d9bbcac Compare May 7, 2026 06:30
@tiprow
Copy link
Copy Markdown

tiprow Bot commented May 7, 2026

@zyguan: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tiprow
Copy link
Copy Markdown

tiprow Bot commented May 7, 2026

@zyguan: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 7, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cfzjywxk, lcwangchao
Once this PR has been reviewed and has the lgtm label, please assign 3pointer, nolouch for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels May 7, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 7, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-17 10:48:44.710005708 +0000 UTC m=+1730929.915365755: ☑️ agreed by cfzjywxk.
  • 2026-05-07 08:18:40.48480407 +0000 UTC m=+342193.358154052: ☑️ agreed by lcwangchao.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

metrics: enhance diagnostic capabilities for gRPC network issues

4 participants