Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Discarded Threads

## Considered: Testing with actual Honeycomb export

Considered setting up an integration test that exports to a mock OTLP endpoint and verifies the full proto payload. Decided this is overkill for an investigation — the unit tests on `convert_data/2` output are sufficient to confirm the issues at the exporter level. The percentile display is a Honeycomb-side concern.
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Open Questions

1. **Should min/max be tracked and exported?** The OTLP Histogram spec supports optional min/max fields. Adding them would require tracking min/max per tag set in the ETS table (not per bucket). This would be a feature addition, not just a bug fix.

2. **Should sum use float instead of round()?** Currently `round(value)` is used for the ETS counter operation which requires integers. An alternative would be to store the sum separately as a float, but this would require changing the ETS update pattern.

3. **Are the default buckets appropriate?** The default buckets `[0, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000]` have very wide gaps at higher values (e.g., 500-750 is a 250ms range). Users can customize via `reporter_options: [buckets: [...]]` but the defaults may be too coarse for many use cases.
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Progress Log

## 2026-03-20

### Initial analysis

- Read the upstream issue (electric-sql/elixir-otel-metric-exporter#20)
- Analyzed the full distribution metric pipeline in the codebase
- Identified root causes (see findings below)

### Code analysis findings

1. **Min/Max never populated**: The `HistogramDataPoint` proto has optional `min` (field 11) and `max` (field 12) fields, but `convert_data/2` for `%Metrics.Distribution{}` never sets them. The ETS storage only tracks `{count, sum}` per bucket — individual min/max values are not tracked during `write_metric/5`.

2. **Values rounded to integers**: `write_metric/5` uses `round(value)` in `update_sum_op = {3, round(value)}` (line 107 of metric_store.ex). This means fractional values are truncated. For a value of 668.7, the sum would record 669.

3. **Percentiles are a receiver-side interpretation**: The exporter sends standard OTLP histogram data (bucket_counts + explicit_bounds). Honeycomb estimates percentiles from these bucket boundaries. With default buckets `[0, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, ...]`, a value of 668 falls in the (500, 750] bucket. Honeycomb estimates percentiles as 500 (lower bound) since it can't know the actual value within the bucket.

### Writing tests

- Created `test/otel_metric_exporter/distribution_investigation_test.exs` with tests that:
- Confirm min/max are nil in exported HistogramDataPoint
- Confirm values are rounded (precision loss)
- Verify bucket assignment correctness
- Exercise the scenario from the original issue

### Operational issues

(none so far)
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Task Prompt

Issue #5 from electric-sql/alco-agent-tasks: "telemetry_metrics.distribution is not reporting sensible results"

[This issue](https://github.com/electric-sql/elixir-otel-metric-exporter/issues/20) needs an investigation to confirm it's a real problem.

Use synthetic data and write the necessary unit tests to either confirm that the distribution metric produces incorrect results or disprove that thesis.
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Task: Investigate Distribution Metric Reporting

## Problem Statement

The `telemetry_metrics.distribution` metric is reported to Honeycomb (via OTLP export) with:
- `min` and `max` always showing as 0
- Percentiles (p01, p50, p99, etc.) not matching actual values
- Only `avg` and `sum` appearing to be correct

## Reported Symptoms

Example 1 (single data point, value ~668):
- avg=668, sum=668, count=1
- min=0, max=0
- All percentiles show 500

Example 2 (two data points, values ~714 and ~1000):
- avg=857, sum=1714, count=2
- min=0, max=0
- Lower percentiles=500, higher percentiles=1000

## Investigation Approach

1. Write unit tests that exercise the full distribution metric pipeline: write → aggregate → export
2. Verify the exported HistogramDataPoint protobuf structure
3. Confirm whether min/max fields are populated
4. Confirm whether `round(value)` causes precision loss
5. Document whether percentile behavior is expected (bucket-based limitation) vs a bug
311 changes: 311 additions & 0 deletions test/otel_metric_exporter/distribution_investigation_test.exs
Original file line number Diff line number Diff line change
@@ -0,0 +1,311 @@
defmodule OtelMetricExporter.DistributionInvestigationTest do
@moduledoc """
Investigation tests for electric-sql/elixir-otel-metric-exporter#20:
"telemetry_metrics.distribution is not reporting sensible results"

These tests use synthetic data to confirm that the distribution metric export
has specific issues with min/max fields and value precision.
"""

use ExUnit.Case, async: false

alias OtelMetricExporter.Opentelemetry.Proto.Collector.Metrics.V1.ExportMetricsServiceRequest
alias OtelMetricExporter.Opentelemetry.Proto.Metrics.V1.HistogramDataPoint
alias Telemetry.Metrics
alias OtelMetricExporter.MetricStore

@name :distribution_investigation_test
@default_buckets [0, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000]

setup do
bypass = Bypass.open()
{:ok, _} = start_supervised({Finch, name: DistInvestigationFinch})

config = %{
otlp_protocol: :http_protobuf,
otlp_endpoint: "http://localhost:#{bypass.port}",
otlp_headers: %{},
otlp_compression: nil,
resource: %{instance: %{id: "test"}},
export_period: 60_000,
default_buckets: @default_buckets,
metrics: [],
finch_pool: DistInvestigationFinch,
retry: false,
name: @name
}

{:ok, bypass: bypass, store_config: config}
end

describe "issue #20: min/max not populated in exported histogram" do
test "min and max fields are nil in exported HistogramDataPoint", %{
bypass: bypass,
store_config: config
} do
metric = Metrics.distribution("test.gc.timeout", unit: :millisecond)
start_supervised!({MetricStore, %{config | metrics: [metric]}})

tags = %{process_type: "worker"}

# Write a single value of 668 (matching the issue report)
MetricStore.write_metric(@name, metric, 668, tags)

histogram_data_point = export_and_decode_histogram(bypass, @name)

# CONFIRMED: min and max are nil (not populated by the exporter)
assert histogram_data_point.min == nil,
"Expected min to be nil (not tracked), got: #{inspect(histogram_data_point.min)}"

assert histogram_data_point.max == nil,
"Expected max to be nil (not tracked), got: #{inspect(histogram_data_point.max)}"
Comment on lines +57 to +61
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop asserting that histogram min/max stay unset

Because this file is a normal mix test test, these assertions turn the current bug into the expected contract: any future fix for issue #20 that starts populating HistogramDataPoint.min/max will fail CI even though the exporter is behaving correctly. If this is meant as one-off investigation code, it should stay out of the regular suite or assert the desired post-fix behavior instead.

Useful? React with 👍 / 👎.


# count and sum ARE correctly populated
assert histogram_data_point.count == 1
assert histogram_data_point.sum == 668.0
end

test "min and max remain nil even with multiple data points", %{
bypass: bypass,
store_config: config
} do
metric = Metrics.distribution("test.schedule.timeout", unit: :millisecond)
start_supervised!({MetricStore, %{config | metrics: [metric]}})

tags = %{process_type: "worker"}

# Write two values matching the second example from the issue (sum=1714, count=2)
MetricStore.write_metric(@name, metric, 714, tags)
MetricStore.write_metric(@name, metric, 1000, tags)

histogram_data_point = export_and_decode_histogram(bypass, @name)

# CONFIRMED: min and max still nil even with multiple values
assert histogram_data_point.min == nil
assert histogram_data_point.max == nil

# count and sum are correct
assert histogram_data_point.count == 2
assert histogram_data_point.sum == 1714.0
end
end

describe "issue #20: value precision loss due to round()" do
test "fractional values are rounded to integers in sum", %{
bypass: bypass,
store_config: config
} do
metric = Metrics.distribution("test.latency", unit: :millisecond)
start_supervised!({MetricStore, %{config | metrics: [metric]}})

tags = %{}

# Write a fractional value
MetricStore.write_metric(@name, metric, 668.7, tags)

histogram_data_point = export_and_decode_histogram(bypass, @name)

# CONFIRMED: the sum is rounded — 668.7 becomes 669
# This is because write_metric uses `round(value)` for the ETS counter update
assert histogram_data_point.sum == 669.0,
"Expected sum to be 669.0 (rounded from 668.7), got: #{histogram_data_point.sum}"
Comment on lines +110 to +111
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not encode rounded distribution sums as the expected result

This test currently blesses the precision-loss bug by expecting 668.7 to export as 669.0. Once write_metric/5 is corrected to preserve fractional measurements, the new test suite will reject the fix and keep CI pinned to the broken behavior. The same issue repeats in the multi-value rounding test below.

Useful? React with 👍 / 👎.


assert histogram_data_point.count == 1
end

test "rounding error accumulates across multiple fractional values", %{
bypass: bypass,
store_config: config
} do
metric = Metrics.distribution("test.latency", unit: :millisecond)
start_supervised!({MetricStore, %{config | metrics: [metric]}})

tags = %{}

# Write several fractional values
# Exact sum: 1.5 + 2.5 + 3.5 = 7.5
# Rounded sum: 2 + 2 + 4 = 8 (each value rounded before accumulation)
# Note: round/1 uses banker's rounding, so 2.5 rounds to 2, not 3
MetricStore.write_metric(@name, metric, 1.5, tags)
MetricStore.write_metric(@name, metric, 2.5, tags)
MetricStore.write_metric(@name, metric, 3.5, tags)

histogram_data_point = export_and_decode_histogram(bypass, @name)

# The sum reflects rounded values, not the true sum
# round(1.5) = 2, round(2.5) = 2 (banker's rounding), round(3.5) = 4
expected_rounded_sum = round(1.5) + round(2.5) + round(3.5)

assert histogram_data_point.sum == expected_rounded_sum * 1.0,
"Expected sum to be #{expected_rounded_sum}.0 (rounded), got: #{histogram_data_point.sum}"
end
end

describe "issue #20: bucket assignment verification" do
test "values are assigned to correct buckets with default bounds", %{
bypass: bypass,
store_config: config
} do
metric = Metrics.distribution("test.gc.timeout", unit: :millisecond)
start_supervised!({MetricStore, %{config | metrics: [metric]}})

tags = %{process_type: "worker"}

# Value 668 should go into bucket index 9 (bound 750, i.e., 500 < 668 <= 750)
MetricStore.write_metric(@name, metric, 668, tags)

histogram_data_point = export_and_decode_histogram(bypass, @name)

# Default buckets: [0, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, ...]
# 668 <= 750, so bucket index 9 should have count=1
# There are 16 bucket counts (15 bounds + 1 overflow)
assert length(histogram_data_point.bucket_counts) == length(@default_buckets) + 1

# Only bucket index 9 (<=750) should have a count
assert Enum.at(histogram_data_point.bucket_counts, 9) == 1

# All other buckets should be 0
histogram_data_point.bucket_counts
|> Enum.with_index()
|> Enum.each(fn {count, idx} ->
if idx != 9 do
assert count == 0,
"Expected bucket #{idx} to be 0, got: #{count}"
end
end)

assert histogram_data_point.explicit_bounds == @default_buckets
end

test "reproduces issue scenario: single value 668 with default buckets", %{
bypass: bypass,
store_config: config
} do
# This test reproduces the exact scenario from issue #20, example 1
metric = Metrics.distribution("vm.monitor.long_gc.timeout", unit: :millisecond)
start_supervised!({MetricStore, %{config | metrics: [metric]}})

tags = %{process_type: "worker"}
MetricStore.write_metric(@name, metric, 668, tags)

histogram_data_point = export_and_decode_histogram(bypass, @name)

# What gets exported:
assert histogram_data_point.count == 1
assert histogram_data_point.sum == 668.0
assert histogram_data_point.min == nil # NOT POPULATED — reported as 0 by Honeycomb
assert histogram_data_point.max == nil # NOT POPULATED — reported as 0 by Honeycomb

# The value 668 is in the (500, 750] bucket.
# Honeycomb can only estimate percentiles from bucket boundaries,
# so it reports all percentiles as 500 (the lower bound of the bucket).
# This is an inherent limitation of histogram-based distribution metrics,
# but the missing min/max is a bug in the exporter.
bucket_index_for_750 = Enum.find_index(@default_buckets, &(&1 == 750))
assert Enum.at(histogram_data_point.bucket_counts, bucket_index_for_750) == 1
end

test "reproduces issue scenario: two values across different buckets", %{
bypass: bypass,
store_config: config
} do
# This test reproduces the exact scenario from issue #20, example 2
metric = Metrics.distribution("vm.monitor.long_schedule.timeout", unit: :millisecond)
start_supervised!({MetricStore, %{config | metrics: [metric]}})

tags = %{process_type: "worker"}

# Two values: one ~714 (bucket <=750) and one ~1000 (bucket <=1000)
MetricStore.write_metric(@name, metric, 714, tags)
MetricStore.write_metric(@name, metric, 1000, tags)

histogram_data_point = export_and_decode_histogram(bypass, @name)

assert histogram_data_point.count == 2
assert histogram_data_point.sum == 1714.0
assert histogram_data_point.min == nil
assert histogram_data_point.max == nil

# 714 goes into bucket <=750 (index 9)
# 1000 goes into bucket <=1000 (index 10)
bucket_index_for_750 = Enum.find_index(@default_buckets, &(&1 == 750))
bucket_index_for_1000 = Enum.find_index(@default_buckets, &(&1 == 1000))

assert Enum.at(histogram_data_point.bucket_counts, bucket_index_for_750) == 1
assert Enum.at(histogram_data_point.bucket_counts, bucket_index_for_1000) == 1

# Honeycomb estimates percentiles from these buckets:
# - Lower percentiles (p01-p25) → lower bound of first populated bucket → 500
# - Higher percentiles (p50-p999) → lower bound of second populated bucket → 750 or 1000
# This matches the issue report where lower percentiles=500, higher=1000
end
end

describe "issue #20: bucket boundary edge cases" do
test "value exactly on a bucket boundary goes into that bucket", %{
bypass: bypass,
store_config: config
} do
metric = Metrics.distribution("test.latency", reporter_options: [buckets: [10, 20, 30]])
start_supervised!({MetricStore, %{config | metrics: [metric]}})

tags = %{}

# Value exactly equal to a bound (10) should go into that bucket (<=10)
MetricStore.write_metric(@name, metric, 10, tags)

histogram_data_point = export_and_decode_histogram(bypass, @name)

# Bucket 0: <=10, should have count=1
assert Enum.at(histogram_data_point.bucket_counts, 0) == 1
assert histogram_data_point.count == 1
end

test "value exceeding all bucket bounds goes into overflow bucket", %{
bypass: bypass,
store_config: config
} do
metric = Metrics.distribution("test.latency", reporter_options: [buckets: [10, 20, 30]])
start_supervised!({MetricStore, %{config | metrics: [metric]}})

tags = %{}

MetricStore.write_metric(@name, metric, 100, tags)

histogram_data_point = export_and_decode_histogram(bypass, @name)

# Overflow bucket (index 3 for 3 bounds) should have count=1
assert Enum.at(histogram_data_point.bucket_counts, 3) == 1
assert histogram_data_point.count == 1
end
end

# Helper: exports metrics and returns the single HistogramDataPoint
defp export_and_decode_histogram(bypass, name) do
histogram_data_point_ref = make_ref()
test_pid = self()

Bypass.expect_once(bypass, "POST", "/v1/metrics", fn conn ->
{:ok, body, conn} = Plug.Conn.read_body(conn)
request = ExportMetricsServiceRequest.decode(body)

[resource_metrics] = request.resource_metrics
[scope_metrics] = resource_metrics.scope_metrics
[metric] = scope_metrics.metrics

{:histogram, histogram} = metric.data
[data_point] = histogram.data_points

send(test_pid, {histogram_data_point_ref, data_point})
Plug.Conn.resp(conn, 200, "")
end)

assert :ok = MetricStore.export_sync(name)

receive do
{^histogram_data_point_ref, %HistogramDataPoint{} = data_point} -> data_point
after
5000 -> flunk("Timed out waiting for histogram data point")
end
end
end