Skip to content

feat: Exemplars for metric & PromQL charts#2536

Open
jordan-simonovski wants to merge 8 commits into
mainfrom
feat/exemplars
Open

feat: Exemplars for metric & PromQL charts#2536
jordan-simonovski wants to merge 8 commits into
mainfrom
feat/exemplars

Conversation

@jordan-simonovski

@jordan-simonovski jordan-simonovski commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Why

Engineers staring at a latency spike on a chart have no way to jump to a trace that caused it.
This change adds exemplars, clickable markers overlaid on time charts, each linking to a representative trace.
Works for metric and PromQL sources.

Also added some test-telemetry infrastructure to build and validate the feature against more complex data sets.

CleanShot.2026-06-29.at.13.50.52.mp4

What's includedExemplar overlay (app)

  • Diamond markers on HDXMultiSeriesTimeChart, plotted at the trace's own value, thinned to a configurable target across the visible range (the slowest/most-notable trace per window).
  • Hover card showing trace metadata (service, span, duration, status) fetched from a configurable exemplar trace source, with an Inspect trace button that deep-links straight to that trace. The card flips/clamps to stay on-screen near chart edges.
  • Opt-in "Exemplars" toggle in the chart editor (next to "As Ratio" for builder/metric charts; in the PromQL editor for PromQL charts) — persisted on the chart config, not a runtime toggle.

Two data backends

  • Metric sources (ClickHouse): renderMetricExemplarsChartConfig reads the OTel metric tables' Exemplars.* columns, honoring the chart's time range, metric name, and filters.
  • PromQL sources (native Prometheus): new /v1/prometheus/query_exemplars proxy + prometheussponses normalized to a shared Exemplar shape (defensive about trace_id/traceID labelspellings).
  • Fetched in parallel via a new useExemplars hook, gated so it's a no-op unless enabled and.

Fully-OTLP coherent metrics (collector)

  • Added the spanmetrics connector to the collector build and wired it into the OpAMP-generaPAN_METRICS, off by default). It derives traces.span.metrics.* (calls + duration histogram)
    with exemplars from spans, so the duration histogram lands in ClickHouse with Exemplars.* ps — no synthetic/seeded data.
  • Optionally remote-writes those metrics (with exemplars) to a Prometheus endpoint (ENABLE_native PromQL exemplar path is testable against the same real data.

Telemetry generator (telemetry-generator/)

  • A Node service emitting realistic OTLP traces (6 services, weighted attribute pools, nest several failure scenarios) with backfill + live emission. Replaces ad-hoc ClickHouseseeding; wired into docker-compose.dev.yml. The spanmetrics connector turns its traces into coherent metric exemplars.

Team setting

  • maxExemplars (Team Settings → Chart Settings; 0 = unlimited) controls overlay density.

Scoping

  • Exemplars are restricted to single, non-ratio, histogram (latency) metric series — the onalue shares the chart's y-axis unit. Toggle hidden and renderer returns null otherwise.

Out of scope (separate tickets)

  • Heatmap exemplars — needs trace-source exemplar generation + a uPlot overlay.
  • Ratio + Group By — the ratio engine isn't group-aware; tracked as a separate bugfix PR of

Testing

  • common-utils: SQL renderer tests (metric exemplar query shape; null for ratio/multi-series/non-histogram/non-metric).
  • app: normalizePrometheusExemplars label-variant tests; DBTimeChart updated to mock the ne
  • api: query_exemplars route integration tests (native proxy + ClickHouse-backed empty result).
  • Verified end-to-end in an isolated stack: the spanmetrics connector emits a traces.span.metrics.duration histogram with real exemplars (trace id + actual latency), 100% resolving back to seeded traces.
  • make ci-lint / per-package tsc + unit suites green.

Changesets

  • exemplar-mode-metrics.md — @hyperdx/common-utils, @hyperdx/api, @hyperdx/app (minor)
  • span-metrics-connector.md — @hyperdx/api, @hyperdx/otel-collector (minor)

Notes / caveats

  • ENABLE_SPAN_METRICS is off by default — no production behavior change; it's enabled in local dev.
  • HyperDX collectors enforce ingest auth with scheme: '' (raw token, no Bearer prefix) — thon: <INGESTION_API_KEY> accordingly.

@changeset-bot

changeset-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: d50b65f

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 4 packages
Name Type
@hyperdx/common-utils Minor
@hyperdx/api Minor
@hyperdx/app Minor
@hyperdx/otel-collector Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel

vercel Bot commented Jun 29, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment Jun 30, 2026 2:01am
hyperdx-storybook Ready Ready Preview, Comment Jun 30, 2026 2:01am

Request Review

@github-actions github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label Jun 29, 2026
@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

  • Critical-path files (3):
    • packages/api/src/config.ts
    • packages/api/src/models/team.ts
    • packages/otel-collector/builder-config.yaml
  • Cross-layer change: touches frontend (packages/app) + backend (packages/api) + shared utils (packages/common-utils)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats
  • Production files changed: 24
  • Production lines changed: 1610 (+ 274 in test files, excluded from tier calculation)
  • Branch: feat/exemplars
  • Author: jordan-simonovski

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@jordan-simonovski jordan-simonovski marked this pull request as draft June 29, 2026 05:57
@greptile-apps

greptile-apps Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds exemplar overlays for metric and PromQL charts. The main changes are:

  • Clickable exemplar markers on time-series charts.
  • Prometheus query_exemplars proxy support.
  • ClickHouse metric exemplar SQL rendering.
  • Optional spanmetrics collector wiring for coherent exemplar data.
  • Local telemetry generation and dev Prometheus setup.

Confidence Score: 4/5

This is close, but the exemplar SQL scoping issue should be fixed before merging.

  • The remote-write endpoint fix now avoids the collector environment mismatch.
  • The exemplar SQL can still return markers outside the plotted metric when chart filters use OR semantics.
  • The Prometheus exemplar route follows the existing team-scoped connection pattern.

packages/common-utils/src/core/renderChartConfig.ts

Important Files Changed

Filename Overview
packages/api/src/config.ts Adds API-side spanmetrics remote-write endpoint resolution and disables the exporter when the endpoint is absent.
packages/api/src/opamp/controllers/opampController.ts Adds spanmetrics connector wiring and inlines the remote-write endpoint into the generated collector config.
packages/common-utils/src/core/renderChartConfig.ts Adds ClickHouse exemplar query rendering, but the metric-name filter can be weakened by OR filter semantics.
packages/api/src/routers/api/prometheus.ts Adds the Prometheus exemplar proxy route using the existing team-scoped connection lookup and proxy helper.

Fix All in Claude Code Fix All in Conductor Fix All in Cursor Fix All in Codex

Reviews (5): Last reviewed commit: "Merge branch 'main' into feat/exemplars" | Re-trigger Greptile

Comment thread packages/api/src/opamp/controllers/opampController.ts Outdated
@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

E2E Test Results

All tests passed • 225 passed • 3 skipped • 1454s

Status Count
✅ Passed 225
❌ Failed 0
⚠️ Flaky 1
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

@jordan-simonovski jordan-simonovski marked this pull request as ready for review June 29, 2026 20:58
Comment thread packages/api/src/config.ts Outdated
…ated collector config, so the collector container no longer needs SPAN_METRICS_PROM_RW_ENDPOINT in its own environment.
Comment on lines +2247 to +2256
const where = await renderWhere(whereConfig, metadata);
const from = renderFrom({ from: whereConfig.from });

return concatChSql(' ', [
chSql`SELECT
toUnixTimestamp64Milli(ex_TimeUnix) AS timestamp,
ex_Value AS value,
ex_TraceId AS traceId,
ex_SpanId AS spanId`,
chSql`FROM ${from}`,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Metric Filter Escapes

The metric-name condition is added to filters, so a chart using filtersLogicalOperator: 'OR' can turn the exemplar WHERE clause into userFilterA OR userFilterB OR MetricName = .... In that case, the exemplar scan can include other metrics whenever a user filter matches, or include this metric outside the intended user filter group. Keep the required metric-name check ANDed separately from the user filter group so exemplar markers stay scoped to the plotted series.

Fix in Claude Code Fix in Conductor Fix in Cursor Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review/tier-4 Critical — deep review + domain expert sign-off

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant