Skip to content

Prometheus Metrics for SandboxManager#270

Closed
rishisulakhe wants to merge 1 commit intoopenkruise:masterfrom
rishisulakhe:feature-metrics
Closed

Prometheus Metrics for SandboxManager#270
rishisulakhe wants to merge 1 commit intoopenkruise:masterfrom
rishisulakhe:feature-metrics

Conversation

@rishisulakhe
Copy link
Copy Markdown

Ⅰ. Describe what this PR does

This PR adds comprehensive Prometheus metrics to the SandboxManager to improve observability across all major sandbox lifecycle operations. Prior to this change, only two metrics existed (sandbox_creation_latency_ms, sandbox_creation_responses), leaving pause/resume, claim stages, clone stages, snapshot creation, and route synchronization completely unobservable.

New metrics added (20 total across 6 categories):

Metric Type Description
sandboxmanager_pause_duration_ms Histogram Pause operation latency
sandboxmanager_pause_max_duration_ms Gauge All-time max pause duration
sandboxmanager_pause_total Counter Pause operation count by outcome
sandboxmanager_resume_duration_ms Histogram Resume operation latency
sandboxmanager_resume_max_duration_ms Gauge All-time max resume duration
sandboxmanager_resume_total Counter Resume operation count by outcome
sandboxmanager_claim_duration_ms Histogram Total claim operation latency
sandboxmanager_claim_stage_duration_ms Histogram Per-stage claim breakdown
sandboxmanager_claim_total Counter Claim count by outcome and lock strategy
sandboxmanager_claim_retries Histogram Retries per claim operation
sandboxmanager_clone_duration_ms Histogram Total clone operation latency
sandboxmanager_clone_stage_duration_ms Histogram Per-stage clone breakdown
sandboxmanager_clone_total Counter Clone count by outcome
sandboxmanager_snapshot_duration_ms Histogram Snapshot creation latency
sandboxmanager_snapshot_total Counter Snapshot count by outcome
sandboxmanager_route_sync_duration_ms Histogram Route sync latency
sandboxmanager_route_sync_total Counter Route sync count by type and outcome
sandboxmanager_route_sync_delay_ms Gauge Last observed route sync delay
sandboxmanager_routes_total Gauge Current number of managed routes
sandboxmanager_peers_total Gauge Current number of connected peers

Ⅱ. Does this pull request fix one issue?

fixes #263

Ⅲ. Describe how to verify it

All unit tests are passing.
Example queries once scraped by Prometheus:

# Pause success rate
rate(sandboxmanager_pause_total{status="success"}[5m])
  / rate(sandboxmanager_pause_total[5m])

# Claim p99 latency per stage
histogram_quantile(0.99,
  rate(sandboxmanager_claim_stage_duration_ms_bucket[5m]))

# Route sync p99
histogram_quantile(0.99,
  rate(sandboxmanager_route_sync_duration_ms_bucket[5m]))

# Current routing table size
sandboxmanager_routes_total

@kruise-bot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kruise-bot
Copy link
Copy Markdown

Welcome @rishisulakhe! It looks like this is your first PR to openkruise/agents 🎉

Comment thread pkg/proxy/routes.go
Comment on lines 150 to +153
func (s *Server) DeleteRoute(id string) {
s.routes.Delete(id)
if _, loaded := s.routes.LoadAndDelete(id); loaded {
RoutesTotal.Dec()
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now uses LoadAndDelete instead of Delete. sync.Map.LoadAndDelete returns whether the key existed, which is required to conditionally decrement RoutesTotal

@rishisulakhe
Copy link
Copy Markdown
Author

Hi @zmberg @AiRanthem When you have a time, please review it. Failing CI test seems to be flanky.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 16, 2026

Codecov Report

❌ Patch coverage is 98.78049% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 67.17%. Comparing base (30f0c9d) to head (f012607).
⚠️ Report is 40 commits behind head on master.

Files with missing lines Patch % Lines
pkg/sandbox-manager/api.go 97.77% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #270      +/-   ##
==========================================
+ Coverage   66.48%   67.17%   +0.69%     
==========================================
  Files         112      114       +2     
  Lines        7489     7577      +88     
==========================================
+ Hits         4979     5090     +111     
+ Misses       2222     2201      -21     
+ Partials      288      286       -2     
Flag Coverage Δ
unittests 67.17% <98.78%> (+0.69%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Prometheus metrics across key SandboxManager lifecycle paths (pause/resume/claim/clone/route sync) and snapshot creation to improve observability (issue #263).

Changes:

  • Introduces new Prometheus metrics (histograms/counters/gauges) for pause/resume/claim/clone/route sync operations in pkg/sandbox-manager.
  • Adds snapshot metrics and instruments the snapshot creation handler in pkg/servers/e2b.
  • Adds proxy routing table / peer count gauges and instruments route add/delete and peer sync path in pkg/proxy, plus basic tests.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
pkg/sandbox-manager/metrics.go Defines and registers new SandboxManager metrics; adds observeMax helper.
pkg/sandbox-manager/api.go Instruments claim/clone/pause/resume and route sync paths with the new metrics.
pkg/sandbox-manager/metrics_test.go Adds unit tests for observeMax and non-nil metric declarations.
pkg/servers/e2b/metrics.go Defines and registers snapshot metrics.
pkg/servers/e2b/snapshot.go Instruments snapshot creation handler with snapshot metrics.
pkg/servers/e2b/metrics_test.go Adds basic test ensuring snapshot metrics are defined.
pkg/proxy/metrics.go Defines and registers proxy-level routes_total / peers_total gauges.
pkg/proxy/routes.go Updates gauges on route creation/deletion and peer sync path.
pkg/proxy/metrics_test.go Adds tests for route gauge inc/dec and peers gauge set behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +168 to +173
func observeMax(g prometheus.Gauge, val float64) {
m := &dto.Metric{}
_ = g.Write(m)
if val > m.GetGauge().GetValue() {
g.Set(val)
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observeMax does a non-atomic read-modify-write on the Gauge (Write() then Set()). Under concurrent calls, a smaller val can overwrite a larger max (lost update) if both goroutines read the same previous value. Also, Write() error is ignored and m.GetGauge() may be nil if Write fails, risking a panic. Consider tracking the max in an atomic/mutex-protected float64 (or a custom Collector/GaugeFunc) and setting the Gauge from that, while handling the Write() error if you keep this approach.

Copilot uses AI. Check for mistakes.
Comment on lines 18 to 22
if !m.infra.HasTemplate(opts.Template) {
// Requirement: Track failure in API layer
SandboxCreationResponses.WithLabelValues("failure").Inc()
SandboxClaimTotal.WithLabelValues("failure", "").Inc()
return nil, errors.NewError(errors.ErrorNotFound, fmt.Sprintf("template %s not found", opts.Template))
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SandboxClaimTotal is emitted with an empty lock_type label on this failure path. That creates an extra time series (lock_type="") and loses the lock strategy dimension, which can break dashboards/alerts expecting only {create,update,speculate}. Consider using a sentinel like "unknown" here (or, if available, the best-known lock type).

Copilot uses AI. Check for mistakes.
Comment on lines 25 to 30
if err != nil {
log.Error(err, "failed to claim sandbox", "metrics", metrics.String())
log.Error(err, "failed to claim sandbox", "metrics", claimMetrics.String())
// Requirement: Track failure in API layer
SandboxCreationResponses.WithLabelValues("failure").Inc()
SandboxClaimTotal.WithLabelValues("failure", "").Inc()
return nil, errors.NewError(errors.ErrorInternal, fmt.Sprintf("failed to claim sandbox: %v", err))
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SandboxClaimTotal is also emitted with lock_type="" when infra.ClaimSandbox returns an error. Since claimMetrics is available here, prefer labeling the failure with string(claimMetrics.LockType) when it’s set (fallback to "unknown"), so failures can still be segmented by lock strategy.

Copilot uses AI. Check for mistakes.
Comment on lines 179 to +184
if err != nil {
log.Error(err, "failed to sync route with peers")
RouteSyncTotal.WithLabelValues("sync_with_peers", "failure").Inc()
return err
}
RouteSyncDuration.WithLabelValues("sync_with_peers").Observe(duration)
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RouteSyncDuration is only observed on the success path. Since duration is already computed before the error check, consider observing the histogram for failures too (and incrementing RouteSyncTotal with status="failure"), otherwise latency visibility is biased toward successful syncs.

Suggested change
if err != nil {
log.Error(err, "failed to sync route with peers")
RouteSyncTotal.WithLabelValues("sync_with_peers", "failure").Inc()
return err
}
RouteSyncDuration.WithLabelValues("sync_with_peers").Observe(duration)
RouteSyncDuration.WithLabelValues("sync_with_peers").Observe(duration)
if err != nil {
log.Error(err, "failed to sync route with peers")
RouteSyncTotal.WithLabelValues("sync_with_peers", "failure").Inc()
return err
}

Copilot uses AI. Check for mistakes.
Comment on lines +184 to +186
RouteSyncDuration.WithLabelValues("sync_with_peers").Observe(duration)
RouteSyncTotal.WithLabelValues("sync_with_peers", "success").Inc()
RouteSyncDelay.Set(duration)
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RouteSyncDelay is being set to the sync operation duration, which makes it effectively another latency metric (and overlaps with RouteSyncDuration). If the intent is “delay from route update to peer sync completion”, this should be computed from a route/update timestamp; otherwise consider renaming/adjusting the metric Help text to reflect it’s the last observed sync duration.

Copilot uses AI. Check for mistakes.
Comment on lines 174 to 177
start := time.Now()
route := sbx.GetRoute()
m.proxy.SetRoute(ctx, route)
err := m.proxy.SyncRouteWithPeers(route)
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Route sync metrics are declared with a type label (and the PR description mentions type values like "set" and "sync_with_peers"), but this function only records the "sync_with_peers" type. Consider instrumenting the SetRoute portion as type="set" (duration + total) so both operation types are observable as advertised.

Copilot uses AI. Check for mistakes.
Comment thread pkg/proxy/routes.go
Comment on lines 81 to 86
var peerList []peers.Peer
if s.peersManager != nil {
peerList = s.peersManager.GetPeers()
}
PeersTotal.Set(float64(len(peerList)))

Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PeersTotal is only updated when SyncRouteWithPeers is called. Since the metric Help says it’s the “current” number of connected peers, it may be stale if peers change without a route sync. Either update this gauge from peer connect/disconnect events (or a periodic poll) or adjust the metric Help/semantics to “last observed peer count during route sync”.

Copilot uses AI. Check for mistakes.
Comment on lines 18 to 22
sandboxID := r.PathValue("sandboxID")
log := klog.FromContext(ctx)
start := time.Now()
request, parseErr := sc.parseCreateSnapshotRequest(r)
if parseErr != nil {
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SnapshotDuration’s timer starts before request parsing, sandbox lookup, and state validation, so the histogram measures endpoint latency rather than just snapshot/checkpoint creation time. Consider starting the timer immediately before CreateCheckpoint (or updating the metric Help/name to reflect full handler latency).

Copilot uses AI. Check for mistakes.
@kruise-bot
Copy link
Copy Markdown

@rishisulakhe: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@furykerry
Copy link
Copy Markdown
Member

@rishisulakhe thanks for your patch, but recently we have merge a similar patch that is more complete , so i will close this patch

@furykerry furykerry closed this Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feature request] Add Comprehensive Prometheus Metrics for SandboxManager

4 participants