add HTTP client request metrics for scaler metric requests#7644
add HTTP client request metrics for scaler metric requests#7644aliaqel-stripe wants to merge 25 commits intokedacore:mainfrom
Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
|
Thank you for your contribution! 🙏 Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected. While you are waiting, make sure to:
Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient. Learn more about our contribution guide. |
77741df to
82d1237
Compare
…quests Instruments all outbound HTTP calls made by KEDA scalers with keda_http_client_requests_total and keda_http_client_request_duration_seconds (Prometheus) and keda.http.client.requests.count / keda.http.client.request.duration.seconds (OTel) via an InstrumentedRoundTripper injected in CreateHTTPTransportWithTLSConfig. Closes kedacore#6600 Signed-off-by: Ali Aqel <aliaqel@stripe.com>
…tion and double instrumentation Signed-off-by: Ali Aqel <aliaqel@stripe.com>
…after refresh Signed-off-by: Ali Aqel <aliaqel@stripe.com>
…o remove cross-file coupling Signed-off-by: Ali Aqel <aliaqel@stripe.com>
…bel precedence, and histogram recording Signed-off-by: Ali Aqel <aliaqel@stripe.com>
…ey-value pairs Signed-off-by: Ali Aqel <aliaqel@stripe.com>
…g metric fetches Signed-off-by: Ali Aqel <aliaqel@stripe.com>
…on failures Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
The duration histogram keda_scaler_http_request_duration_seconds previously carried 6 label dimensions (namespace, scaled_resource, scaler, trigger_name, metric_name, status_code), creating high MTS cardinality. Latency by scaler type is what matters, so the histogram is reduced to 2 labels: scaler and status_code. The counter retains all 6 labels. Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
fdb37a6 to
9c79fce
Compare
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
There was a problem hiding this comment.
Pull request overview
Adds outbound HTTP client request metrics for scaler metric collection, enabling better observability of scaler-to-upstream HTTP success/error rates and latencies across both Prometheus and OpenTelemetry backends.
Changes:
- Introduces an instrumented
http.RoundTripperand wires it into shared HTTP client/transport helpers, with metric recording gated on scaler-context keys. - Injects scaler context into metric-collection request contexts (including retry/refresh path) so outbound requests can be labeled correctly.
- Adds Prometheus + OTel metric instruments/recording paths, plus unit and sequential/e2e-style assertions.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/sequential/prometheus_metrics/prometheus_metrics_test.go | Adds sequential test assertions for new Prometheus HTTP client metrics. |
| tests/sequential/opentelemetry_metrics/opentelemetry_metrics_test.go | Adds sequential test assertions for OTel-exported HTTP client metrics. |
| pkg/util/http_roundtripper.go | New InstrumentedRoundTripper that records per-request count + duration when scaler context is present. |
| pkg/util/http_roundtripper_test.go | Unit tests for round-tripper wrapping behavior and helper instrumentation. |
| pkg/util/http.go | Wraps shared transports with instrumentation; changes helper return types to http.RoundTripper. |
| pkg/scaling/cache/scalers_cache.go | Injects scaler context keys into metric-collection context; updates retry path label sourcing. |
| pkg/scaling/cache/scalers_cache_test.go | Unit test verifying scaler request context key injection. |
| pkg/scalers/aws/aws_sigv4.go | Reuses a single transport in SigV4 RT to restore pooling and avoid repeated wrapping. |
| pkg/metricscollector/metricscollectors.go | Extends metrics collector interface with RecordHTTPClientRequest and shared status-code labeling. |
| pkg/metricscollector/prommetrics.go | Adds Prometheus counter + histogram and implements RecordHTTPClientRequest. |
| pkg/metricscollector/prommetrics_test.go | Unit tests for status code label logic and Prometheus recording path. |
| pkg/metricscollector/opentelemetry.go | Adds OTel instruments and implements RecordHTTPClientRequest; adjusts an error log message. |
| pkg/metricscollector/opentelemetry_test.go | Adds unit test coverage for OTel HTTP client request recording. |
| CHANGELOG.md | Documents addition of scaler HTTP request metrics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
wozniakjan
left a comment
There was a problem hiding this comment.
Thanks, great addition! Just please see the copilot review above and my nits below
- Rename CreateHTTPTransport/CreateHTTPTransportWithTLSConfig to CreateRT/CreateRTWithTLSConfig to reflect they return http.RoundTripper - Fix OTel metric description: "labeled by status class" → "labeled by HTTP status code" - Fix doc comments on RecordHTTPClientRequest to accurately describe that context keys are extracted by InstrumentedRoundTripper, not the collector - Fix test comments and rename TestInstrumentedRoundTripper_ScalerContextKey to TestInstrumentedRoundTripper_AllContextKeys with all five required context keys set Signed-off-by: Ali Aqel <aliaqel@stripe.com>
The return type change (*http.Transport -> http.RoundTripper) is non-breaking for all callers since http.Client.Transport and all struct fields receiving the value are already typed as http.RoundTripper. No caller accesses *http.Transport-specific fields, so the rename to CreateRT* is unnecessary churn. Signed-off-by: Ali Aqel <aliaqel@stripe.com>
…ipper return type CreateHTTPTransport and CreateHTTPTransportWithTLSConfig now return http.RoundTripper (wrapping InstrumentedRoundTripper), so the *Transport suffix is misleading. Rename to CreateRT / CreateRTWithTLSConfig. All callers assign to http.Client.Transport or struct fields typed as http.RoundTripper, so the return type change is non-breaking. Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
|
/run-e2e |
|
Hello The potential cardinality that we can generate if we record metrics with all these labels is huge
|
|
worth discussing, with third-party tools we would lose granularity and per SO / trigger information but perhaps that is acceptable? There are imho two points where we currently lack consensus: A) gauge vs. histogram
B) custom HTTP metrics vs. third-party package
any other arguments worth adding?
theoretical cardinality yes, also amplified by buckets for sure. But practically, I think the estimates might be less dramatic and frequently similar to other KEDA metrics
|
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: aliaqel-stripe <120822631+aliaqel-stripe@users.noreply.github.com>
|
@JorTurFer @wozniakjan ready for second review |
Add HTTP client request metrics for scaler metric collection (
keda_scaler_http_requests_total/keda_scaler_http_request_duration_secondsfor Prometheus,keda.scaler.http.requests.count/keda.scaler.http.request.duration.secondsfor OTel). Recording is gated on all five scaler context keys being present in the request context, so only metric collection calls emitted throughGetMetricsAndActivityForScalerare counted — not scaler-initialization requests.The request counter metric (
keda_scaler_http_requests_total) is labelled bynamespace,scaled_resource,scaler,trigger_name,metric_name,status_code.The latency histogram (
keda_scaler_http_request_duration_seconds) uses onlyscalerandstatus_codeto keep MTS cardinality low — latency by scaler type probably matters most here.status_codeis the numeric HTTP status, or"error"for transport-level failures with no HTTP response.Changes
pkg/util/http_roundtripper.go(new):InstrumentedRoundTripperwraps anyhttp.RoundTripperand records metrics after each round-trip when all scaler context keys are present.pkg/util/http.go:CreateHTTPTransportWithTLSConfignow wraps the transport withNewInstrumentedRoundTripper, so all scalers using these helpers gain instrumentation automatically.pkg/scaling/cache/scalers_cache.go: NewbuildScalerRequestCtxinjects the five context keys before callingGetMetricsAndActivity. The retry path re-fetches theScalerBuilderafter refresh so labels reflect the current scaler config.pkg/scalers/aws/aws_sigv4.go: Transport is now initialised once inNewSigV4RoundTripperrather than per-request, eliminating double instrumentation and restoring connection pooling.pkg/metricscollector/: NewRecordHTTPClientRequestin the interface and both backends;httpStatusCodeLabelmoved to the shared file; OTel error log message corrected.http_roundtripperandprommetrics; e2e assertions added to both Prometheus and OTel sequential test suites.Checklist
Fixes #6600