Prometheus Metrics for SandboxManager by rishisulakhe · Pull Request #270 · openkruise/agents

rishisulakhe · 2026-04-13T19:18:45Z

Ⅰ. Describe what this PR does

This PR adds comprehensive Prometheus metrics to the SandboxManager to improve observability across all major sandbox lifecycle operations. Prior to this change, only two metrics existed (sandbox_creation_latency_ms, sandbox_creation_responses), leaving pause/resume, claim stages, clone stages, snapshot creation, and route synchronization completely unobservable.

New metrics added (20 total across 6 categories):

Metric	Type	Description
`sandboxmanager_pause_duration_ms`	Histogram	Pause operation latency
`sandboxmanager_pause_max_duration_ms`	Gauge	All-time max pause duration
`sandboxmanager_pause_total`	Counter	Pause operation count by outcome
`sandboxmanager_resume_duration_ms`	Histogram	Resume operation latency
`sandboxmanager_resume_max_duration_ms`	Gauge	All-time max resume duration
`sandboxmanager_resume_total`	Counter	Resume operation count by outcome
`sandboxmanager_claim_duration_ms`	Histogram	Total claim operation latency
`sandboxmanager_claim_stage_duration_ms`	Histogram	Per-stage claim breakdown
`sandboxmanager_claim_total`	Counter	Claim count by outcome and lock strategy
`sandboxmanager_claim_retries`	Histogram	Retries per claim operation
`sandboxmanager_clone_duration_ms`	Histogram	Total clone operation latency
`sandboxmanager_clone_stage_duration_ms`	Histogram	Per-stage clone breakdown
`sandboxmanager_clone_total`	Counter	Clone count by outcome
`sandboxmanager_snapshot_duration_ms`	Histogram	Snapshot creation latency
`sandboxmanager_snapshot_total`	Counter	Snapshot count by outcome
`sandboxmanager_route_sync_duration_ms`	Histogram	Route sync latency
`sandboxmanager_route_sync_total`	Counter	Route sync count by type and outcome
`sandboxmanager_route_sync_delay_ms`	Gauge	Last observed route sync delay
`sandboxmanager_routes_total`	Gauge	Current number of managed routes
`sandboxmanager_peers_total`	Gauge	Current number of connected peers

Ⅱ. Does this pull request fix one issue?

fixes #263

Ⅲ. Describe how to verify it

All unit tests are passing.
Example queries once scraped by Prometheus:

# Pause success rate
rate(sandboxmanager_pause_total{status="success"}[5m])
  / rate(sandboxmanager_pause_total[5m])

# Claim p99 latency per stage
histogram_quantile(0.99,
  rate(sandboxmanager_claim_stage_duration_ms_bucket[5m]))

# Route sync p99
histogram_quantile(0.99,
  rate(sandboxmanager_route_sync_duration_ms_bucket[5m]))

# Current routing table size
sandboxmanager_routes_total

kruise-bot · 2026-04-13T19:18:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kruise-bot · 2026-04-13T19:19:04Z

Welcome @rishisulakhe! It looks like this is your first PR to openkruise/agents 🎉

rishisulakhe · 2026-04-13T19:25:32Z

 func (s *Server) DeleteRoute(id string) {
-	s.routes.Delete(id)
+	if _, loaded := s.routes.LoadAndDelete(id); loaded {
+		RoutesTotal.Dec()
+	}


Now uses LoadAndDelete instead of Delete. sync.Map.LoadAndDelete returns whether the key existed, which is required to conditionally decrement RoutesTotal

rishisulakhe · 2026-04-13T21:29:32Z

Hi @zmberg @AiRanthem When you have a time, please review it. Failing CI test seems to be flanky.

codecov · 2026-04-16T06:54:28Z

Codecov Report

❌ Patch coverage is 98.78049% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 67.17%. Comparing base (30f0c9d) to head (f012607).
⚠️ Report is 40 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/sandbox-manager/api.go	97.77%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #270      +/-   ##
==========================================
+ Coverage   66.48%   67.17%   +0.69%     
==========================================
  Files         112      114       +2     
  Lines        7489     7577      +88     
==========================================
+ Hits         4979     5090     +111     
+ Misses       2222     2201      -21     
+ Partials      288      286       -2

Flag	Coverage Δ
unittests	`67.17% <98.78%> (+0.69%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Adds Prometheus metrics across key SandboxManager lifecycle paths (pause/resume/claim/clone/route sync) and snapshot creation to improve observability (issue #263).

Changes:

Introduces new Prometheus metrics (histograms/counters/gauges) for pause/resume/claim/clone/route sync operations in pkg/sandbox-manager.
Adds snapshot metrics and instruments the snapshot creation handler in pkg/servers/e2b.
Adds proxy routing table / peer count gauges and instruments route add/delete and peer sync path in pkg/proxy, plus basic tests.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
pkg/sandbox-manager/metrics.go	Defines and registers new SandboxManager metrics; adds `observeMax` helper.
pkg/sandbox-manager/api.go	Instruments claim/clone/pause/resume and route sync paths with the new metrics.
pkg/sandbox-manager/metrics_test.go	Adds unit tests for `observeMax` and non-nil metric declarations.
pkg/servers/e2b/metrics.go	Defines and registers snapshot metrics.
pkg/servers/e2b/snapshot.go	Instruments snapshot creation handler with snapshot metrics.
pkg/servers/e2b/metrics_test.go	Adds basic test ensuring snapshot metrics are defined.
pkg/proxy/metrics.go	Defines and registers proxy-level `routes_total` / `peers_total` gauges.
pkg/proxy/routes.go	Updates gauges on route creation/deletion and peer sync path.
pkg/proxy/metrics_test.go	Adds tests for route gauge inc/dec and peers gauge set behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T10:05:33Z

+func observeMax(g prometheus.Gauge, val float64) {
+	m := &dto.Metric{}
+	_ = g.Write(m)
+	if val > m.GetGauge().GetValue() {
+		g.Set(val)
+	}


observeMax does a non-atomic read-modify-write on the Gauge (Write() then Set()). Under concurrent calls, a smaller val can overwrite a larger max (lost update) if both goroutines read the same previous value. Also, Write() error is ignored and m.GetGauge() may be nil if Write fails, risking a panic. Consider tracking the max in an atomic/mutex-protected float64 (or a custom Collector/GaugeFunc) and setting the Gauge from that, while handling the Write() error if you keep this approach.

Copilot · 2026-04-22T10:05:34Z

 	if !m.infra.HasTemplate(opts.Template) {
 		// Requirement: Track failure in API layer
 		SandboxCreationResponses.WithLabelValues("failure").Inc()
+		SandboxClaimTotal.WithLabelValues("failure", "").Inc()
 		return nil, errors.NewError(errors.ErrorNotFound, fmt.Sprintf("template %s not found", opts.Template))


SandboxClaimTotal is emitted with an empty lock_type label on this failure path. That creates an extra time series (lock_type="") and loses the lock strategy dimension, which can break dashboards/alerts expecting only {create,update,speculate}. Consider using a sentinel like "unknown" here (or, if available, the best-known lock type).

Copilot · 2026-04-22T10:05:34Z

 	if err != nil {
-		log.Error(err, "failed to claim sandbox", "metrics", metrics.String())
+		log.Error(err, "failed to claim sandbox", "metrics", claimMetrics.String())
 		// Requirement: Track failure in API layer
 		SandboxCreationResponses.WithLabelValues("failure").Inc()
+		SandboxClaimTotal.WithLabelValues("failure", "").Inc()
 		return nil, errors.NewError(errors.ErrorInternal, fmt.Sprintf("failed to claim sandbox: %v", err))


SandboxClaimTotal is also emitted with lock_type="" when infra.ClaimSandbox returns an error. Since claimMetrics is available here, prefer labeling the failure with string(claimMetrics.LockType) when it’s set (fallback to "unknown"), so failures can still be segmented by lock strategy.

Copilot · 2026-04-22T10:05:34Z

 	if err != nil {
 		log.Error(err, "failed to sync route with peers")
+		RouteSyncTotal.WithLabelValues("sync_with_peers", "failure").Inc()
 		return err
 	}
+	RouteSyncDuration.WithLabelValues("sync_with_peers").Observe(duration)


RouteSyncDuration is only observed on the success path. Since duration is already computed before the error check, consider observing the histogram for failures too (and incrementing RouteSyncTotal with status="failure"), otherwise latency visibility is biased toward successful syncs.

Suggested change

if err != nil {

log.Error(err, "failed to sync route with peers")

RouteSyncTotal.WithLabelValues("sync_with_peers", "failure").Inc()

return err

}

RouteSyncDuration.WithLabelValues("sync_with_peers").Observe(duration)

RouteSyncDuration.WithLabelValues("sync_with_peers").Observe(duration)

if err != nil {

log.Error(err, "failed to sync route with peers")

RouteSyncTotal.WithLabelValues("sync_with_peers", "failure").Inc()

return err

}

Copilot · 2026-04-22T10:05:34Z

+	RouteSyncDuration.WithLabelValues("sync_with_peers").Observe(duration)
+	RouteSyncTotal.WithLabelValues("sync_with_peers", "success").Inc()
+	RouteSyncDelay.Set(duration)


RouteSyncDelay is being set to the sync operation duration, which makes it effectively another latency metric (and overlaps with RouteSyncDuration). If the intent is “delay from route update to peer sync completion”, this should be computed from a route/update timestamp; otherwise consider renaming/adjusting the metric Help text to reflect it’s the last observed sync duration.

Copilot · 2026-04-22T10:05:35Z

 	start := time.Now()
 	route := sbx.GetRoute()
 	m.proxy.SetRoute(ctx, route)
 	err := m.proxy.SyncRouteWithPeers(route)


Route sync metrics are declared with a type label (and the PR description mentions type values like "set" and "sync_with_peers"), but this function only records the "sync_with_peers" type. Consider instrumenting the SetRoute portion as type="set" (duration + total) so both operation types are observable as advertised.

Copilot · 2026-04-22T10:05:35Z

 	var peerList []peers.Peer
 	if s.peersManager != nil {
 		peerList = s.peersManager.GetPeers()
 	}
+	PeersTotal.Set(float64(len(peerList)))



PeersTotal is only updated when SyncRouteWithPeers is called. Since the metric Help says it’s the “current” number of connected peers, it may be stale if peers change without a route sync. Either update this gauge from peer connect/disconnect events (or a periodic poll) or adjust the metric Help/semantics to “last observed peer count during route sync”.

Copilot · 2026-04-22T10:05:35Z

 	sandboxID := r.PathValue("sandboxID")
 	log := klog.FromContext(ctx)
+	start := time.Now()
 	request, parseErr := sc.parseCreateSnapshotRequest(r)
 	if parseErr != nil {


SnapshotDuration’s timer starts before request parsing, sandbox lookup, and state validation, so the histogram measures endpoint latency rather than just snapshot/checkpoint creation time. Consider starting the timer immediately before CreateCheckpoint (or updating the metric Help/name to reflect full handler latency).

kruise-bot · 2026-04-28T01:27:26Z

@rishisulakhe: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

furykerry · 2026-04-28T01:32:09Z

@rishisulakhe thanks for your patch, but recently we have merge a similar patch that is more complete , so i will close this patch

Prometheus Metrics for SandboxManager

f012607

kruise-bot added the size/L label Apr 13, 2026

rishisulakhe commented Apr 13, 2026

View reviewed changes

AiRanthem requested a review from Copilot April 22, 2026 10:00

Copilot started reviewing on behalf of AiRanthem April 22, 2026 10:00 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

kruise-bot added the needs-rebase label Apr 28, 2026

furykerry closed this Apr 28, 2026

Conversation

rishisulakhe commented Apr 13, 2026

Ⅰ. Describe what this PR does

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Uh oh!

kruise-bot commented Apr 13, 2026

Uh oh!

kruise-bot commented Apr 13, 2026

Uh oh!

rishisulakhe Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

rishisulakhe commented Apr 13, 2026

Uh oh!

codecov Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

kruise-bot commented Apr 28, 2026

Uh oh!

furykerry commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Apr 16, 2026 •

edited

Loading