Skip to content

feat(metrics): add histograms for scaler and internal-scale-loop latency#7682

Open
Sanil2108 wants to merge 1 commit intokedacore:mainfrom
Sanil2108:feat/latency-histograms
Open

feat(metrics): add histograms for scaler and internal-scale-loop latency#7682
Sanil2108 wants to merge 1 commit intokedacore:mainfrom
Sanil2108:feat/latency-histograms

Conversation

@Sanil2108
Copy link
Copy Markdown

Summary

Fixes #7675 by adding histogram metrics alongside the two existing latency gauges. The gauges stay in place (marked deprecated in their help text) so nothing downstream breaks immediately — users can migrate dashboards/alerts to the histograms at their own pace.

Prometheus

New:

  • `keda_scaler_metrics_duration_seconds` — histogram of scaler metric retrieval latency.
  • `keda_internal_scale_loop_duration_seconds` — histogram of scale-loop deviation.

Deprecated (kept emitting for backward compatibility):

  • `keda_scaler_metrics_latency_seconds` (gauge)
  • `keda_internal_scale_loop_latency_seconds` (gauge)

Buckets use `prometheus.DefBuckets` — happy to tune these if maintainers have preferred ranges (e.g. sub-millisecond resolution for fast scalers, or a longer tail for slow scrapers).

OpenTelemetry

Same pattern via synchronous `Float64Histogram` instruments:

  • `keda.scaler.metrics.duration.seconds` (new)
  • `keda.internal.scale.loop.duration.seconds` (new)
  • `keda.scaler.metrics.latency.seconds` / `keda.internal.scale.loop.latency.seconds` gauges marked deprecated in their descriptions.

Why dual-write instead of changing the type in place

Prometheus treats a TYPE change on an existing metric name as a breaking change — it confuses scrapers and any dashboards that assume the type. Keeping the gauges and adding histograms under new names (the convention used elsewhere: `workqueue_work_duration_seconds`, `controller_runtime_webhook_latency_seconds`, etc.) lets us migrate without a disruptive release. The gauges can be removed in a later major release once enough users are on the histograms.

Testing

  • `go build ./...` green.
  • `go test ./pkg/metricscollector/ -count=1` green.

Out of scope

  • Tuning histogram bucket boundaries — defaulted to Prometheus's standard `DefBuckets`.
  • Removing the deprecated gauges — follow-up once users have had a release cycle to migrate.

Fixes #7675

Latency metrics are better served by histograms than gauges — gauges only
surface the last observed value, so percentiles and averages over a scrape
interval aren't available. Most comparable controller/runtime and workqueue
latency metrics are histograms; these were the exception.

Add two new histograms alongside the existing gauges:

- keda_scaler_metrics_duration_seconds    — histogram of the latency of
  retrieving current metric from each scaler.
- keda_internal_scale_loop_duration_seconds — histogram of the deviation
  between expected and actual execution time for the scaling loop.

The existing gauges (keda_scaler_metrics_latency_seconds,
keda_internal_scale_loop_latency_seconds) are kept and marked deprecated
in their help text so dashboards and alerts don't break. Record functions
now write to both streams.

Same dual-write is applied to the OpenTelemetry exporter via
keda.scaler.metrics.duration.seconds and
keda.internal.scale.loop.duration.seconds, using synchronous
Float64Histogram instruments rather than observable gauges.

Fixes kedacore#7675

Signed-off-by: Sanil2108 <sanilkhurana7@gmail.com>
@Sanil2108 Sanil2108 requested a review from a team as a code owner April 24, 2026 13:26
@github-actions
Copy link
Copy Markdown

Thank you for your contribution! 🙏

Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected.

While you are waiting, make sure to:

  • Add an entry in our changelog in alphabetical order and link related issue
  • Update the documentation, if needed
  • Add unit & e2e tests for your changes
  • GitHub checks are passing
  • Is the DCO check failing? Here is how you can fix DCO issues

Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient.

Learn more about our contribution guide.

@keda-automation keda-automation requested a review from a team April 24, 2026 13:26
@snyk-io
Copy link
Copy Markdown

snyk-io Bot commented Apr 24, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

},
metricLabels,
)
scalerMetricsDuration = prometheus.NewHistogramVec(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cardinality of this is going to be pretty high, see the discussion in #7644 (comment)

I think we want to add a flag that enables the metrics with higher cardinality and if it's off, emit the metrics with reduced labels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use histograms for latency related metrics

2 participants