feat(metrics): add histograms for scaler and internal-scale-loop latency by Sanil2108 · Pull Request #7682 · kedacore/keda

Sanil2108 · 2026-04-24T13:26:46Z

Summary

Fixes #7675 by adding histogram metrics alongside the two existing latency gauges. The gauges stay in place (marked deprecated in their help text) so nothing downstream breaks immediately — users can migrate dashboards/alerts to the histograms at their own pace.

Prometheus

New:

`keda_scaler_metrics_duration_seconds` — histogram of scaler metric retrieval latency.
`keda_internal_scale_loop_duration_seconds` — histogram of scale-loop deviation.

Deprecated (kept emitting for backward compatibility):

`keda_scaler_metrics_latency_seconds` (gauge)
`keda_internal_scale_loop_latency_seconds` (gauge)

Buckets use `prometheus.DefBuckets` — happy to tune these if maintainers have preferred ranges (e.g. sub-millisecond resolution for fast scalers, or a longer tail for slow scrapers).

OpenTelemetry

Same pattern via synchronous `Float64Histogram` instruments:

`keda.scaler.metrics.duration.seconds` (new)
`keda.internal.scale.loop.duration.seconds` (new)
`keda.scaler.metrics.latency.seconds` / `keda.internal.scale.loop.latency.seconds` gauges marked deprecated in their descriptions.

Why dual-write instead of changing the type in place

Prometheus treats a TYPE change on an existing metric name as a breaking change — it confuses scrapers and any dashboards that assume the type. Keeping the gauges and adding histograms under new names (the convention used elsewhere: `workqueue_work_duration_seconds`, `controller_runtime_webhook_latency_seconds`, etc.) lets us migrate without a disruptive release. The gauges can be removed in a later major release once enough users are on the histograms.

Testing

`go build ./...` green.
`go test ./pkg/metricscollector/ -count=1` green.

Out of scope

Tuning histogram bucket boundaries — defaulted to Prometheus's standard `DefBuckets`.
Removing the deprecated gauges — follow-up once users have had a release cycle to migrate.

Fixes #7675

Latency metrics are better served by histograms than gauges — gauges only surface the last observed value, so percentiles and averages over a scrape interval aren't available. Most comparable controller/runtime and workqueue latency metrics are histograms; these were the exception. Add two new histograms alongside the existing gauges: - keda_scaler_metrics_duration_seconds — histogram of the latency of retrieving current metric from each scaler. - keda_internal_scale_loop_duration_seconds — histogram of the deviation between expected and actual execution time for the scaling loop. The existing gauges (keda_scaler_metrics_latency_seconds, keda_internal_scale_loop_latency_seconds) are kept and marked deprecated in their help text so dashboards and alerts don't break. Record functions now write to both streams. Same dual-write is applied to the OpenTelemetry exporter via keda.scaler.metrics.duration.seconds and keda.internal.scale.loop.duration.seconds, using synchronous Float64Histogram instruments rather than observable gauges. Fixes kedacore#7675 Signed-off-by: Sanil2108 <sanilkhurana7@gmail.com>

github-actions · 2026-04-24T13:26:58Z

Thank you for your contribution! 🙏

Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected.

While you are waiting, make sure to:

Add an entry in our changelog in alphabetical order and link related issue
Update the documentation, if needed
Add unit & e2e tests for your changes
GitHub checks are passing
Is the DCO check failing? Here is how you can fix DCO issues

Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient.

Learn more about our contribution guide.

snyk-io · 2026-04-24T13:27:00Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scan Engine	Critical	High	Medium	Low	Total (0)
✅	Open Source Security	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

aliaqel-stripe · 2026-04-26T16:22:29Z

+		},
+		metricLabels,
+	)
+	scalerMetricsDuration = prometheus.NewHistogramVec(


cardinality of this is going to be pretty high, see the discussion in #7644 (comment)

I think we want to add a flag that enables the metrics with higher cardinality and if it's off, emit the metrics with reduced labels.

Sanil2108 requested a review from a team as a code owner April 24, 2026 13:26

keda-automation requested a review from a team April 24, 2026 13:26

aliaqel-stripe mentioned this pull request Apr 26, 2026

add HTTP client request metrics for scaler metric requests #7644

Open

4 tasks

aliaqel-stripe reviewed Apr 26, 2026

View reviewed changes

rickbrouwer mentioned this pull request May 3, 2026

feat(metrics): add histogram mirrors for latency metrics (#7675) #7715

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): add histograms for scaler and internal-scale-loop latency#7682

feat(metrics): add histograms for scaler and internal-scale-loop latency#7682
Sanil2108 wants to merge 1 commit intokedacore:mainfrom
Sanil2108:feat/latency-histograms

Sanil2108 commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

snyk-io Bot commented Apr 24, 2026

Uh oh!

aliaqel-stripe Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Sanil2108 commented Apr 24, 2026

Summary

Prometheus

OpenTelemetry

Why dual-write instead of changing the type in place

Testing

Out of scope

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

snyk-io Bot commented Apr 24, 2026

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

aliaqel-stripe Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants