electric-sql · alco · Mar 20, 2026 · Mar 20, 2026 · chatgpt-codex-connector · Mar 20, 2026
diff --git a/.agent-tasks/2026-03-20--5--distribution-metric-investigation/discarded-threads.md b/.agent-tasks/2026-03-20--5--distribution-metric-investigation/discarded-threads.md
@@ -0,0 +1,5 @@
+# Discarded Threads
+
+## Considered: Testing with actual Honeycomb export
+
+Considered setting up an integration test that exports to a mock OTLP endpoint and verifies the full proto payload. Decided this is overkill for an investigation — the unit tests on `convert_data/2` output are sufficient to confirm the issues at the exporter level. The percentile display is a Honeycomb-side concern.
diff --git a/.agent-tasks/2026-03-20--5--distribution-metric-investigation/open-questions.md b/.agent-tasks/2026-03-20--5--distribution-metric-investigation/open-questions.md
@@ -0,0 +1,7 @@
+# Open Questions
+
+1. **Should min/max be tracked and exported?** The OTLP Histogram spec supports optional min/max fields. Adding them would require tracking min/max per tag set in the ETS table (not per bucket). This would be a feature addition, not just a bug fix.
+
+2. **Should sum use float instead of round()?** Currently `round(value)` is used for the ETS counter operation which requires integers. An alternative would be to store the sum separately as a float, but this would require changing the ETS update pattern.
+
+3. **Are the default buckets appropriate?** The default buckets `[0, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000]` have very wide gaps at higher values (e.g., 500-750 is a 250ms range). Users can customize via `reporter_options: [buckets: [...]]` but the defaults may be too coarse for many use cases.
diff --git a/.agent-tasks/2026-03-20--5--distribution-metric-investigation/progress.md b/.agent-tasks/2026-03-20--5--distribution-metric-investigation/progress.md
@@ -0,0 +1,29 @@
+# Progress Log
+
+## 2026-03-20
+
+### Initial analysis
+
+- Read the upstream issue (electric-sql/elixir-otel-metric-exporter#20)
+- Analyzed the full distribution metric pipeline in the codebase
+- Identified root causes (see findings below)
+
+### Code analysis findings
+
+1. **Min/Max never populated**: The `HistogramDataPoint` proto has optional `min` (field 11) and `max` (field 12) fields, but `convert_data/2` for `%Metrics.Distribution{}` never sets them. The ETS storage only tracks `{count, sum}` per bucket — individual min/max values are not tracked during `write_metric/5`.
+
+2. **Values rounded to integers**: `write_metric/5` uses `round(value)` in `update_sum_op = {3, round(value)}` (line 107 of metric_store.ex). This means fractional values are truncated. For a value of 668.7, the sum would record 669.
+
+3. **Percentiles are a receiver-side interpretation**: The exporter sends standard OTLP histogram data (bucket_counts + explicit_bounds). Honeycomb estimates percentiles from these bucket boundaries. With default buckets `[0, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, ...]`, a value of 668 falls in the (500, 750] bucket. Honeycomb estimates percentiles as 500 (lower bound) since it can't know the actual value within the bucket.
+
+### Writing tests
+
+- Created `test/otel_metric_exporter/distribution_investigation_test.exs` with tests that:
+  - Confirm min/max are nil in exported HistogramDataPoint
+  - Confirm values are rounded (precision loss)
+  - Verify bucket assignment correctness
+  - Exercise the scenario from the original issue
+
+### Operational issues
+
+(none so far)
diff --git a/.agent-tasks/2026-03-20--5--distribution-metric-investigation/prompt.md b/.agent-tasks/2026-03-20--5--distribution-metric-investigation/prompt.md
@@ -0,0 +1,7 @@
+# Task Prompt
+
+Issue #5 from electric-sql/alco-agent-tasks: "telemetry_metrics.distribution is not reporting sensible results"
+
+[This issue](https://github.com/electric-sql/elixir-otel-metric-exporter/issues/20) needs an investigation to confirm it's a real problem.
+
+Use synthetic data and write the necessary unit tests to either confirm that the distribution metric produces incorrect results or disprove that thesis.
diff --git a/.agent-tasks/2026-03-20--5--distribution-metric-investigation/task.md b/.agent-tasks/2026-03-20--5--distribution-metric-investigation/task.md
@@ -0,0 +1,28 @@
+# Task: Investigate Distribution Metric Reporting
+
+## Problem Statement
+
+The `telemetry_metrics.distribution` metric is reported to Honeycomb (via OTLP export) with:
+- `min` and `max` always showing as 0
+- Percentiles (p01, p50, p99, etc.) not matching actual values
+- Only `avg` and `sum` appearing to be correct
+
+## Reported Symptoms
+
+Example 1 (single data point, value ~668):
+- avg=668, sum=668, count=1
+- min=0, max=0
+- All percentiles show 500
+
+Example 2 (two data points, values ~714 and ~1000):
+- avg=857, sum=1714, count=2
+- min=0, max=0
+- Lower percentiles=500, higher percentiles=1000
+
+## Investigation Approach
+
+1. Write unit tests that exercise the full distribution metric pipeline: write → aggregate → export
+2. Verify the exported HistogramDataPoint protobuf structure
+3. Confirm whether min/max fields are populated
+4. Confirm whether `round(value)` causes precision loss
+5. Document whether percentile behavior is expected (bucket-based limitation) vs a bug
diff --git a/test/otel_metric_exporter/distribution_investigation_test.exs b/test/otel_metric_exporter/distribution_investigation_test.exs
@@ -0,0 +1,311 @@
+defmodule OtelMetricExporter.DistributionInvestigationTest do
+  @moduledoc """
+  Investigation tests for electric-sql/elixir-otel-metric-exporter#20:
+  "telemetry_metrics.distribution is not reporting sensible results"
+
+  These tests use synthetic data to confirm that the distribution metric export
+  has specific issues with min/max fields and value precision.
+  """
+
+  use ExUnit.Case, async: false
+
+  alias OtelMetricExporter.Opentelemetry.Proto.Collector.Metrics.V1.ExportMetricsServiceRequest
+  alias OtelMetricExporter.Opentelemetry.Proto.Metrics.V1.HistogramDataPoint
+  alias Telemetry.Metrics
+  alias OtelMetricExporter.MetricStore
+
+  @name :distribution_investigation_test
+  @default_buckets [0, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000]
+
+  setup do
+    bypass = Bypass.open()
+    {:ok, _} = start_supervised({Finch, name: DistInvestigationFinch})
+
+    config = %{
+      otlp_protocol: :http_protobuf,
+      otlp_endpoint: "http://localhost:#{bypass.port}",
+      otlp_headers: %{},
+      otlp_compression: nil,
+      resource: %{instance: %{id: "test"}},
+      export_period: 60_000,
+      default_buckets: @default_buckets,
+      metrics: [],
+      finch_pool: DistInvestigationFinch,
+      retry: false,
+      name: @name
+    }
+
+    {:ok, bypass: bypass, store_config: config}
+  end
+
+  describe "issue #20: min/max not populated in exported histogram" do
+    test "min and max fields are nil in exported HistogramDataPoint", %{
+      bypass: bypass,
+      store_config: config
+    } do
+      metric = Metrics.distribution("test.gc.timeout", unit: :millisecond)
+      start_supervised!({MetricStore, %{config | metrics: [metric]}})
+
+      tags = %{process_type: "worker"}
+
+      # Write a single value of 668 (matching the issue report)
+      MetricStore.write_metric(@name, metric, 668, tags)
+
+      histogram_data_point = export_and_decode_histogram(bypass, @name)
+
+      # CONFIRMED: min and max are nil (not populated by the exporter)
+      assert histogram_data_point.min == nil,
+             "Expected min to be nil (not tracked), got: #{inspect(histogram_data_point.min)}"
+
+      assert histogram_data_point.max == nil,
+             "Expected max to be nil (not tracked), got: #{inspect(histogram_data_point.max)}"
+
+      # count and sum ARE correctly populated
+      assert histogram_data_point.count == 1
+      assert histogram_data_point.sum == 668.0
+    end
+
+    test "min and max remain nil even with multiple data points", %{
+      bypass: bypass,
+      store_config: config
+    } do
+      metric = Metrics.distribution("test.schedule.timeout", unit: :millisecond)
+      start_supervised!({MetricStore, %{config | metrics: [metric]}})
+
+      tags = %{process_type: "worker"}
+
+      # Write two values matching the second example from the issue (sum=1714, count=2)
+      MetricStore.write_metric(@name, metric, 714, tags)
+      MetricStore.write_metric(@name, metric, 1000, tags)
+
+      histogram_data_point = export_and_decode_histogram(bypass, @name)
+
+      # CONFIRMED: min and max still nil even with multiple values
+      assert histogram_data_point.min == nil
+      assert histogram_data_point.max == nil
+
+      # count and sum are correct
+      assert histogram_data_point.count == 2
+      assert histogram_data_point.sum == 1714.0
+    end
+  end
+
+  describe "issue #20: value precision loss due to round()" do
+    test "fractional values are rounded to integers in sum", %{
+      bypass: bypass,
+      store_config: config
+    } do
+      metric = Metrics.distribution("test.latency", unit: :millisecond)
+      start_supervised!({MetricStore, %{config | metrics: [metric]}})
+
+      tags = %{}
+
+      # Write a fractional value
+      MetricStore.write_metric(@name, metric, 668.7, tags)
+
+      histogram_data_point = export_and_decode_histogram(bypass, @name)
+
+      # CONFIRMED: the sum is rounded — 668.7 becomes 669
+      # This is because write_metric uses `round(value)` for the ETS counter update
+      assert histogram_data_point.sum == 669.0,
+             "Expected sum to be 669.0 (rounded from 668.7), got: #{histogram_data_point.sum}"
+
+      assert histogram_data_point.count == 1
+    end
+
+    test "rounding error accumulates across multiple fractional values", %{
+      bypass: bypass,
+      store_config: config
+    } do
+      metric = Metrics.distribution("test.latency", unit: :millisecond)
+      start_supervised!({MetricStore, %{config | metrics: [metric]}})
+
+      tags = %{}
+
+      # Write several fractional values
+      # Exact sum: 1.5 + 2.5 + 3.5 = 7.5
+      # Rounded sum: 2 + 2 + 4 = 8 (each value rounded before accumulation)
+      # Note: round/1 uses banker's rounding, so 2.5 rounds to 2, not 3
+      MetricStore.write_metric(@name, metric, 1.5, tags)
+      MetricStore.write_metric(@name, metric, 2.5, tags)
+      MetricStore.write_metric(@name, metric, 3.5, tags)
+
+      histogram_data_point = export_and_decode_histogram(bypass, @name)
+
+      # The sum reflects rounded values, not the true sum
+      # round(1.5) = 2, round(2.5) = 2 (banker's rounding), round(3.5) = 4
+      expected_rounded_sum = round(1.5) + round(2.5) + round(3.5)
+
+      assert histogram_data_point.sum == expected_rounded_sum * 1.0,
+             "Expected sum to be #{expected_rounded_sum}.0 (rounded), got: #{histogram_data_point.sum}"
+    end
+  end
+
+  describe "issue #20: bucket assignment verification" do
+    test "values are assigned to correct buckets with default bounds", %{
+      bypass: bypass,
+      store_config: config
+    } do
+      metric = Metrics.distribution("test.gc.timeout", unit: :millisecond)
+      start_supervised!({MetricStore, %{config | metrics: [metric]}})
+
+      tags = %{process_type: "worker"}
+
+      # Value 668 should go into bucket index 9 (bound 750, i.e., 500 < 668 <= 750)
+      MetricStore.write_metric(@name, metric, 668, tags)
+
+      histogram_data_point = export_and_decode_histogram(bypass, @name)
+
+      # Default buckets: [0, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, ...]
+      # 668 <= 750, so bucket index 9 should have count=1
+      # There are 16 bucket counts (15 bounds + 1 overflow)
+      assert length(histogram_data_point.bucket_counts) == length(@default_buckets) + 1
+
+      # Only bucket index 9 (<=750) should have a count
+      assert Enum.at(histogram_data_point.bucket_counts, 9) == 1
+
+      # All other buckets should be 0
+      histogram_data_point.bucket_counts
+      |> Enum.with_index()
+      |> Enum.each(fn {count, idx} ->
+        if idx != 9 do
+          assert count == 0,
+                 "Expected bucket #{idx} to be 0, got: #{count}"
+        end
+      end)
+
+      assert histogram_data_point.explicit_bounds == @default_buckets
+    end
+
+    test "reproduces issue scenario: single value 668 with default buckets", %{
+      bypass: bypass,
+      store_config: config
+    } do
+      # This test reproduces the exact scenario from issue #20, example 1
+      metric = Metrics.distribution("vm.monitor.long_gc.timeout", unit: :millisecond)
+      start_supervised!({MetricStore, %{config | metrics: [metric]}})
+
+      tags = %{process_type: "worker"}
+      MetricStore.write_metric(@name, metric, 668, tags)
+
+      histogram_data_point = export_and_decode_histogram(bypass, @name)
+
+      # What gets exported:
+      assert histogram_data_point.count == 1
+      assert histogram_data_point.sum == 668.0
+      assert histogram_data_point.min == nil  # NOT POPULATED — reported as 0 by Honeycomb
+      assert histogram_data_point.max == nil  # NOT POPULATED — reported as 0 by Honeycomb
+
+      # The value 668 is in the (500, 750] bucket.
+      # Honeycomb can only estimate percentiles from bucket boundaries,
+      # so it reports all percentiles as 500 (the lower bound of the bucket).
+      # This is an inherent limitation of histogram-based distribution metrics,
+      # but the missing min/max is a bug in the exporter.
+      bucket_index_for_750 = Enum.find_index(@default_buckets, &(&1 == 750))
+      assert Enum.at(histogram_data_point.bucket_counts, bucket_index_for_750) == 1
+    end
+
+    test "reproduces issue scenario: two values across different buckets", %{
+      bypass: bypass,
+      store_config: config
+    } do
+      # This test reproduces the exact scenario from issue #20, example 2
+      metric = Metrics.distribution("vm.monitor.long_schedule.timeout", unit: :millisecond)
+      start_supervised!({MetricStore, %{config | metrics: [metric]}})
+
+      tags = %{process_type: "worker"}
+
+      # Two values: one ~714 (bucket <=750) and one ~1000 (bucket <=1000)
+      MetricStore.write_metric(@name, metric, 714, tags)
+      MetricStore.write_metric(@name, metric, 1000, tags)
+
+      histogram_data_point = export_and_decode_histogram(bypass, @name)
+
+      assert histogram_data_point.count == 2
+      assert histogram_data_point.sum == 1714.0
+      assert histogram_data_point.min == nil
+      assert histogram_data_point.max == nil
+
+      # 714 goes into bucket <=750 (index 9)
+      # 1000 goes into bucket <=1000 (index 10)
+      bucket_index_for_750 = Enum.find_index(@default_buckets, &(&1 == 750))
+      bucket_index_for_1000 = Enum.find_index(@default_buckets, &(&1 == 1000))
+
+      assert Enum.at(histogram_data_point.bucket_counts, bucket_index_for_750) == 1
+      assert Enum.at(histogram_data_point.bucket_counts, bucket_index_for_1000) == 1
+
+      # Honeycomb estimates percentiles from these buckets:
+      # - Lower percentiles (p01-p25) → lower bound of first populated bucket → 500
+      # - Higher percentiles (p50-p999) → lower bound of second populated bucket → 750 or 1000
+      # This matches the issue report where lower percentiles=500, higher=1000
+    end
+  end
+
+  describe "issue #20: bucket boundary edge cases" do
+    test "value exactly on a bucket boundary goes into that bucket", %{
+      bypass: bypass,
+      store_config: config
+    } do
+      metric = Metrics.distribution("test.latency", reporter_options: [buckets: [10, 20, 30]])
+      start_supervised!({MetricStore, %{config | metrics: [metric]}})
+
+      tags = %{}
+
+      # Value exactly equal to a bound (10) should go into that bucket (<=10)
+      MetricStore.write_metric(@name, metric, 10, tags)
+
+      histogram_data_point = export_and_decode_histogram(bypass, @name)
+
+      # Bucket 0: <=10, should have count=1
+      assert Enum.at(histogram_data_point.bucket_counts, 0) == 1
+      assert histogram_data_point.count == 1
+    end
+
+    test "value exceeding all bucket bounds goes into overflow bucket", %{
+      bypass: bypass,
+      store_config: config
+    } do
+      metric = Metrics.distribution("test.latency", reporter_options: [buckets: [10, 20, 30]])
+      start_supervised!({MetricStore, %{config | metrics: [metric]}})
+
+      tags = %{}
+
+      MetricStore.write_metric(@name, metric, 100, tags)
+
+      histogram_data_point = export_and_decode_histogram(bypass, @name)
+
+      # Overflow bucket (index 3 for 3 bounds) should have count=1
+      assert Enum.at(histogram_data_point.bucket_counts, 3) == 1
+      assert histogram_data_point.count == 1
+    end
+  end
+
+  # Helper: exports metrics and returns the single HistogramDataPoint
+  defp export_and_decode_histogram(bypass, name) do
+    histogram_data_point_ref = make_ref()
+    test_pid = self()
+
+    Bypass.expect_once(bypass, "POST", "/v1/metrics", fn conn ->
+      {:ok, body, conn} = Plug.Conn.read_body(conn)
+      request = ExportMetricsServiceRequest.decode(body)
+
+      [resource_metrics] = request.resource_metrics
+      [scope_metrics] = resource_metrics.scope_metrics
+      [metric] = scope_metrics.metrics
+
+      {:histogram, histogram} = metric.data
+      [data_point] = histogram.data_points
+
+      send(test_pid, {histogram_data_point_ref, data_point})
+      Plug.Conn.resp(conn, 200, "")
+    end)
+
+    assert :ok = MetricStore.export_sync(name)
+
+    receive do
+      {^histogram_data_point_ref, %HistogramDataPoint{} = data_point} -> data_point
+    after
+      5000 -> flunk("Timed out waiting for histogram data point")
+    end
+  end
+end