Prometheus Metrics for SandboxManager#270
Prometheus Metrics for SandboxManager#270rishisulakhe wants to merge 1 commit intoopenkruise:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @rishisulakhe! It looks like this is your first PR to openkruise/agents 🎉 |
| func (s *Server) DeleteRoute(id string) { | ||
| s.routes.Delete(id) | ||
| if _, loaded := s.routes.LoadAndDelete(id); loaded { | ||
| RoutesTotal.Dec() | ||
| } |
There was a problem hiding this comment.
Now uses LoadAndDelete instead of Delete. sync.Map.LoadAndDelete returns whether the key existed, which is required to conditionally decrement RoutesTotal
|
Hi @zmberg @AiRanthem When you have a time, please review it. Failing CI test seems to be flanky. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #270 +/- ##
==========================================
+ Coverage 66.48% 67.17% +0.69%
==========================================
Files 112 114 +2
Lines 7489 7577 +88
==========================================
+ Hits 4979 5090 +111
+ Misses 2222 2201 -21
+ Partials 288 286 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds Prometheus metrics across key SandboxManager lifecycle paths (pause/resume/claim/clone/route sync) and snapshot creation to improve observability (issue #263).
Changes:
- Introduces new Prometheus metrics (histograms/counters/gauges) for pause/resume/claim/clone/route sync operations in
pkg/sandbox-manager. - Adds snapshot metrics and instruments the snapshot creation handler in
pkg/servers/e2b. - Adds proxy routing table / peer count gauges and instruments route add/delete and peer sync path in
pkg/proxy, plus basic tests.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/sandbox-manager/metrics.go | Defines and registers new SandboxManager metrics; adds observeMax helper. |
| pkg/sandbox-manager/api.go | Instruments claim/clone/pause/resume and route sync paths with the new metrics. |
| pkg/sandbox-manager/metrics_test.go | Adds unit tests for observeMax and non-nil metric declarations. |
| pkg/servers/e2b/metrics.go | Defines and registers snapshot metrics. |
| pkg/servers/e2b/snapshot.go | Instruments snapshot creation handler with snapshot metrics. |
| pkg/servers/e2b/metrics_test.go | Adds basic test ensuring snapshot metrics are defined. |
| pkg/proxy/metrics.go | Defines and registers proxy-level routes_total / peers_total gauges. |
| pkg/proxy/routes.go | Updates gauges on route creation/deletion and peer sync path. |
| pkg/proxy/metrics_test.go | Adds tests for route gauge inc/dec and peers gauge set behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func observeMax(g prometheus.Gauge, val float64) { | ||
| m := &dto.Metric{} | ||
| _ = g.Write(m) | ||
| if val > m.GetGauge().GetValue() { | ||
| g.Set(val) | ||
| } |
There was a problem hiding this comment.
observeMax does a non-atomic read-modify-write on the Gauge (Write() then Set()). Under concurrent calls, a smaller val can overwrite a larger max (lost update) if both goroutines read the same previous value. Also, Write() error is ignored and m.GetGauge() may be nil if Write fails, risking a panic. Consider tracking the max in an atomic/mutex-protected float64 (or a custom Collector/GaugeFunc) and setting the Gauge from that, while handling the Write() error if you keep this approach.
| if !m.infra.HasTemplate(opts.Template) { | ||
| // Requirement: Track failure in API layer | ||
| SandboxCreationResponses.WithLabelValues("failure").Inc() | ||
| SandboxClaimTotal.WithLabelValues("failure", "").Inc() | ||
| return nil, errors.NewError(errors.ErrorNotFound, fmt.Sprintf("template %s not found", opts.Template)) |
There was a problem hiding this comment.
SandboxClaimTotal is emitted with an empty lock_type label on this failure path. That creates an extra time series (lock_type="") and loses the lock strategy dimension, which can break dashboards/alerts expecting only {create,update,speculate}. Consider using a sentinel like "unknown" here (or, if available, the best-known lock type).
| if err != nil { | ||
| log.Error(err, "failed to claim sandbox", "metrics", metrics.String()) | ||
| log.Error(err, "failed to claim sandbox", "metrics", claimMetrics.String()) | ||
| // Requirement: Track failure in API layer | ||
| SandboxCreationResponses.WithLabelValues("failure").Inc() | ||
| SandboxClaimTotal.WithLabelValues("failure", "").Inc() | ||
| return nil, errors.NewError(errors.ErrorInternal, fmt.Sprintf("failed to claim sandbox: %v", err)) |
There was a problem hiding this comment.
SandboxClaimTotal is also emitted with lock_type="" when infra.ClaimSandbox returns an error. Since claimMetrics is available here, prefer labeling the failure with string(claimMetrics.LockType) when it’s set (fallback to "unknown"), so failures can still be segmented by lock strategy.
| if err != nil { | ||
| log.Error(err, "failed to sync route with peers") | ||
| RouteSyncTotal.WithLabelValues("sync_with_peers", "failure").Inc() | ||
| return err | ||
| } | ||
| RouteSyncDuration.WithLabelValues("sync_with_peers").Observe(duration) |
There was a problem hiding this comment.
RouteSyncDuration is only observed on the success path. Since duration is already computed before the error check, consider observing the histogram for failures too (and incrementing RouteSyncTotal with status="failure"), otherwise latency visibility is biased toward successful syncs.
| if err != nil { | |
| log.Error(err, "failed to sync route with peers") | |
| RouteSyncTotal.WithLabelValues("sync_with_peers", "failure").Inc() | |
| return err | |
| } | |
| RouteSyncDuration.WithLabelValues("sync_with_peers").Observe(duration) | |
| RouteSyncDuration.WithLabelValues("sync_with_peers").Observe(duration) | |
| if err != nil { | |
| log.Error(err, "failed to sync route with peers") | |
| RouteSyncTotal.WithLabelValues("sync_with_peers", "failure").Inc() | |
| return err | |
| } |
| RouteSyncDuration.WithLabelValues("sync_with_peers").Observe(duration) | ||
| RouteSyncTotal.WithLabelValues("sync_with_peers", "success").Inc() | ||
| RouteSyncDelay.Set(duration) |
There was a problem hiding this comment.
RouteSyncDelay is being set to the sync operation duration, which makes it effectively another latency metric (and overlaps with RouteSyncDuration). If the intent is “delay from route update to peer sync completion”, this should be computed from a route/update timestamp; otherwise consider renaming/adjusting the metric Help text to reflect it’s the last observed sync duration.
| start := time.Now() | ||
| route := sbx.GetRoute() | ||
| m.proxy.SetRoute(ctx, route) | ||
| err := m.proxy.SyncRouteWithPeers(route) |
There was a problem hiding this comment.
Route sync metrics are declared with a type label (and the PR description mentions type values like "set" and "sync_with_peers"), but this function only records the "sync_with_peers" type. Consider instrumenting the SetRoute portion as type="set" (duration + total) so both operation types are observable as advertised.
| var peerList []peers.Peer | ||
| if s.peersManager != nil { | ||
| peerList = s.peersManager.GetPeers() | ||
| } | ||
| PeersTotal.Set(float64(len(peerList))) | ||
|
|
There was a problem hiding this comment.
PeersTotal is only updated when SyncRouteWithPeers is called. Since the metric Help says it’s the “current” number of connected peers, it may be stale if peers change without a route sync. Either update this gauge from peer connect/disconnect events (or a periodic poll) or adjust the metric Help/semantics to “last observed peer count during route sync”.
| sandboxID := r.PathValue("sandboxID") | ||
| log := klog.FromContext(ctx) | ||
| start := time.Now() | ||
| request, parseErr := sc.parseCreateSnapshotRequest(r) | ||
| if parseErr != nil { |
There was a problem hiding this comment.
SnapshotDuration’s timer starts before request parsing, sandbox lookup, and state validation, so the histogram measures endpoint latency rather than just snapshot/checkpoint creation time. Consider starting the timer immediately before CreateCheckpoint (or updating the metric Help/name to reflect full handler latency).
|
@rishisulakhe: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@rishisulakhe thanks for your patch, but recently we have merge a similar patch that is more complete , so i will close this patch |
Ⅰ. Describe what this PR does
This PR adds comprehensive Prometheus metrics to the SandboxManager to improve observability across all major sandbox lifecycle operations. Prior to this change, only two metrics existed (
sandbox_creation_latency_ms,sandbox_creation_responses), leaving pause/resume, claim stages, clone stages, snapshot creation, and route synchronization completely unobservable.New metrics added (20 total across 6 categories):
sandboxmanager_pause_duration_mssandboxmanager_pause_max_duration_mssandboxmanager_pause_totalsandboxmanager_resume_duration_mssandboxmanager_resume_max_duration_mssandboxmanager_resume_totalsandboxmanager_claim_duration_mssandboxmanager_claim_stage_duration_mssandboxmanager_claim_totalsandboxmanager_claim_retriessandboxmanager_clone_duration_mssandboxmanager_clone_stage_duration_mssandboxmanager_clone_totalsandboxmanager_snapshot_duration_mssandboxmanager_snapshot_totalsandboxmanager_route_sync_duration_mssandboxmanager_route_sync_totalsandboxmanager_route_sync_delay_mssandboxmanager_routes_totalsandboxmanager_peers_totalⅡ. Does this pull request fix one issue?
fixes #263
Ⅲ. Describe how to verify it
All unit tests are passing.
Example queries once scraped by Prometheus: