feat(metrics): add histograms for scaler and internal-scale-loop latency#7682
feat(metrics): add histograms for scaler and internal-scale-loop latency#7682Sanil2108 wants to merge 1 commit intokedacore:mainfrom
Conversation
Latency metrics are better served by histograms than gauges — gauges only surface the last observed value, so percentiles and averages over a scrape interval aren't available. Most comparable controller/runtime and workqueue latency metrics are histograms; these were the exception. Add two new histograms alongside the existing gauges: - keda_scaler_metrics_duration_seconds — histogram of the latency of retrieving current metric from each scaler. - keda_internal_scale_loop_duration_seconds — histogram of the deviation between expected and actual execution time for the scaling loop. The existing gauges (keda_scaler_metrics_latency_seconds, keda_internal_scale_loop_latency_seconds) are kept and marked deprecated in their help text so dashboards and alerts don't break. Record functions now write to both streams. Same dual-write is applied to the OpenTelemetry exporter via keda.scaler.metrics.duration.seconds and keda.internal.scale.loop.duration.seconds, using synchronous Float64Histogram instruments rather than observable gauges. Fixes kedacore#7675 Signed-off-by: Sanil2108 <sanilkhurana7@gmail.com>
|
Thank you for your contribution! 🙏 Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected. While you are waiting, make sure to:
Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient. Learn more about our contribution guide. |
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
| }, | ||
| metricLabels, | ||
| ) | ||
| scalerMetricsDuration = prometheus.NewHistogramVec( |
There was a problem hiding this comment.
cardinality of this is going to be pretty high, see the discussion in #7644 (comment)
I think we want to add a flag that enables the metrics with higher cardinality and if it's off, emit the metrics with reduced labels.
Summary
Fixes #7675 by adding histogram metrics alongside the two existing latency gauges. The gauges stay in place (marked deprecated in their help text) so nothing downstream breaks immediately — users can migrate dashboards/alerts to the histograms at their own pace.
Prometheus
New:
Deprecated (kept emitting for backward compatibility):
Buckets use `prometheus.DefBuckets` — happy to tune these if maintainers have preferred ranges (e.g. sub-millisecond resolution for fast scalers, or a longer tail for slow scrapers).
OpenTelemetry
Same pattern via synchronous `Float64Histogram` instruments:
Why dual-write instead of changing the type in place
Prometheus treats a TYPE change on an existing metric name as a breaking change — it confuses scrapers and any dashboards that assume the type. Keeping the gauges and adding histograms under new names (the convention used elsewhere: `workqueue_work_duration_seconds`, `controller_runtime_webhook_latency_seconds`, etc.) lets us migrate without a disruptive release. The gauges can be removed in a later major release once enough users are on the histograms.
Testing
Out of scope
Fixes #7675