refactor(core): rip telemetry wrappers, use logfire directly#754
refactor(core): rip telemetry wrappers, use logfire directly#754phernandez merged 6 commits intomainfrom
Conversation
basic_memory.telemetry had grown a layer of span/metric/contextualize
wrappers (scope, operation, span, started_span, contextualize,
add_counter, record_histogram) that added indirection without value.
Call sites nested the wrappers with overlapping attributes, producing
two- and three-level span trees for single operations.
This strip removes the wrappers entirely. Call sites use
logfire.span(...) and logfire.metric_counter(...) directly. The
logfire↔loguru integration propagates span attributes to logs without
the manual logger.contextualize bind that previously wrapped every
span.
Summary:
- telemetry.py: 246 → 64 lines. Only configure_telemetry() +
get_logfire_handler() remain. TelemetryState dataclass, _STATE,
reset_telemetry_state, pop_telemetry_warnings, _filter_attributes,
_METRICS cache, and _get_metric helper all deleted (dead code after
the wrapper removal).
- logfire: moved from optional [telemetry] extra to required
dependency so call sites can import it directly.
- search_service.py: nested search.execute → search.repository_query
spans flattened into one. Dropped trivial wrapping spans
(search.index.read_content, search.index.build_rows,
search.index.bulk_upsert) and redundant phase=/result_count=1
filler attributes.
- search_repository_base.py: metrics use
logfire.metric_counter(name).add(amount, attributes={...})
directly (11 sites).
- mcp/tools/*: the contextualize + started_span pair collapsed to a
single logfire.span(...) — loguru propagation happens via the
handler configured in bootstrap.
- utils.py: dropped the pop_telemetry_warnings logging loop (always
empty after the warnings path was removed).
- Tests updated to patch logfire.span / logfire.metric_* directly;
removed tests for deleted helpers.
Verified with just fast-check: 2756 passed, 34 skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 2m 59s —— View job PR Review: refactor(core): rip telemetry wrappers, use logfire directlyTasks:
SummaryThis is a solid, well-executed simplification. The removal of the custom ChecklistCode Quality & Standards
Testing & Documentation
Basic Memory Architecture
Security & Performance
Compatibility
Issues1. Stale comment in "aiofiles>=24.1.0", # Optional observability (disabled by default via config)This comment was previously associated with the 2. Sync context managers wrapping async code [ All 3. Instrument creation on each call [ logfire.metric_counter("vector_sync_entities_total").add(result.entities_total, ...)Called on every batch. The old code had an explicit VerdictApprove with minor fix. The stale comment on |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0b496661ec
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| logfire.configure(**kwargs) | ||
| except TypeError: | ||
| configure_kwargs.pop("send_to_logfire", None) | ||
| logfire.configure(**configure_kwargs) | ||
| except Exception as exc: # pragma: no cover | ||
| _STATE.enabled = False # pragma: no cover | ||
| _STATE.warnings.append(f"Failed to configure Logfire telemetry: {exc}") # pragma: no cover | ||
| return False # pragma: no cover | ||
| # Older logfire releases don't accept send_to_logfire as a keyword. | ||
| kwargs.pop("send_to_logfire", None) | ||
| logfire.configure(**kwargs) |
There was a problem hiding this comment.
Guard Logfire bootstrap failures from crashing startup
configure_telemetry() now only handles TypeError around logfire.configure, so any other Logfire initialization failure will propagate and abort entrypoint startup when telemetry is enabled. Since init_cli_logging/init_api_logging call this during process boot, a telemetry-only misconfiguration or transient SDK error can now take down CLI/API startup instead of degrading gracefully as before.
Useful? React with 👍 / 👎.
| with logfire.span("mcp.http.request", **attrs) as active_span: | ||
| yield _RequestSpan(active_span) |
There was a problem hiding this comment.
Preserve request log context when telemetry is off
This scope now creates only a Logfire span and no longer binds request attributes into Loguru context. With logfire_enabled defaulting to False and the Logfire sink only attached when telemetry is configured, the common local/default path loses method/client_name/operation metadata on HTTP helper logs, which is a regression in debuggability for request failures.
Useful? React with 👍 / 👎.
This comment was attached to the logfire line in the previous PR state. When logfire moved from optional-dependencies to required dependencies, the comment ended up floating next to aiofiles, where it's misleading — aiofiles is an async I/O library, not observability. Flagged in PR #754 review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 1m 50s —— View job PR Review: refactor(core): rip telemetry wrappers, use logfire directlyTasks:
SummaryThis is a clean simplification that removes a custom abstraction layer that had outlived its purpose. The net result is a smaller, more readable, and more direct codebase. The stale ChecklistCode Quality & Standards
Testing & Documentation
Basic Memory Architecture
Security & Performance
Compatibility
Issues1. Speculative def _set_attributes(self, attrs: dict[str, typing.Any]) -> None:
if self._active_span is None:
return
set_attributes = getattr(self._active_span, "set_attributes", None)
if callable(set_attributes):
set_attributes(attrs)
return
set_attribute = getattr(self._active_span, "set_attribute", None)
if callable(set_attribute):
for key, value in attrs.items():
set_attribute(key, value)CLAUDE.md explicitly forbids this pattern: "Never use 2. Codex P1: Narrow try:
logfire.configure(**kwargs)
except TypeError:
kwargs.pop("send_to_logfire", None)
logfire.configure(**kwargs)Only 3. Sync context managers in async functions [ All VerdictApprove with one nit. Issue 1 (speculative |
Feedback from PR #754 review: _RequestSpan used speculative getattr to probe for set_attributes / set_attribute on its stored span, which CLAUDE.md explicitly forbids. The check existed because the span was typed as typing.Any | None, hiding the fact that logfire.span always yields a LogfireSpan with a stable .set_attributes method. This strip: - Deletes the _RequestSpan adapter class entirely. - _request_scope now yields the logfire span itself; callers hold a logfire.LogfireSpan | None and call .set_attributes directly. - Extracts two one-liner helpers (_response_span_attrs, _transport_error_span_attrs) for the attribute dicts, preserving the dedup across the 5 call_{get,put,post,patch,delete} sites. - Test fakes yield a _NoopSpan stub so patched `logfire.span` stays compatible with the new direct `.set_attributes` calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 1m 45s —— View job PR Review: refactor(core): rip telemetry wrappers, use logfire directlyTasks:
SummaryThree commits; each one closes a gap from the previous review pass. The end state is clean and the direction is correct. Previous issues are fully resolved. ChecklistCode Quality & Standards
Testing & Documentation
Basic Memory Architecture
Security & Performance
Compatibility
Issues1. Minor: Loguru context lost when telemetry is disabled [ The old In practice, 2. Nit: Sync context managers in async functions [ All VerdictApprove. All previously flagged issues are addressed:
The simplification is materially positive: |
Two more simplifications in utils.py the reviewer/PM flagged:
1. _request_scope was a no-op wrapper around logfire.span that added
nothing — same name, same attributes, one call deep. Inlined at all
5 call sites (call_get/put/post/patch/delete). The helper is gone.
2. _extract_response_data used `except Exception` to swallow any JSON
decode failure. That's defensive code for upstream gateway HTML
error pages (Fly/Cloudflare 5xx); FastAPI itself always emits JSON.
Replaced with a content-type check: if the response doesn't claim
`application/json`, return None; otherwise let any decode error
surface. Matches CLAUDE.md's fail-fast guidance.
Tests updated: MockResponse fixtures now include a
`headers = {"content-type": "application/json"}` attribute so they
pass the new gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 1m 26s —— View job PR Review: refactor(core): rip telemetry wrappers, use logfire directlyTasks:
SummaryFourth review pass, triggered by ChecklistCode Quality & Standards
Testing & Documentation
Basic Memory Architecture
Security & Performance
Compatibility
Issues1. Minor inconsistency in
2. Loguru context lost when telemetry is disabled [ Acknowledged trade-off from the previous review pass. The old VerdictApprove. All code-level issues from prior review passes are closed. The two items above are a pre-existing style divergence and an acknowledged intentional trade-off. The refactor is materially positive: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7ce47d7113
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| logfire.metric_histogram( | ||
| "vector_sync_prepare_seconds", | ||
| prepare_seconds, | ||
| unit="s", | ||
| backend=backend_name, | ||
| skip_only_entity=entity_skipped and embedding_jobs_count == 0, | ||
| ).record( |
There was a problem hiding this comment.
Reuse metric instruments instead of recreating in hot sync path
_log_vector_sync_complete() runs for every synced entity, but each invocation now calls logfire.metric_histogram(...).record(...) directly, which re-enters instrument registration/lookup on every sample. In large sync runs this adds avoidable lock/contention and CPU overhead in a hot loop (the previous telemetry helper cached metric handles once per process). Hoisting/caching these metric instruments restores the prior constant-cost recording behavior for high-entity indexing workloads.
Useful? React with 👍 / 👎.
Codex review on PR #754 flagged that _log_vector_sync_complete runs per entity in large sync batches, and each call now goes through logfire.metric_histogram(...).record(...) — OTel's MeterProvider lookup is O(1) but still non-free under load. The original telemetry wrapper cached instruments in a process-wide dict; my strip lost that. Restores the cache as two small module-level helpers in search_repository_base.py (_metric_histogram / _metric_counter) so instruments are resolved once and reused. Callers switch to the helpers (11 sites). Tests clear _METRIC_INSTRUMENTS via monkeypatch so fake factories win on the first lookup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 1m 48s —— View job PR Review: refactor(core): rip telemetry wrappers, use logfire directlyTasks:
SummaryFifth and final review pass, triggered by ChecklistCode Quality & Standards
Testing & Documentation
Basic Memory Architecture
Security & Performance
Compatibility
Issues1. The caching helpers are typed 2.
3. Loguru context lost when telemetry is off (acknowledged trade-off) [ Carried from prior passes. The old VerdictApprove. All previously flagged issues are closed:
The three items above are: a typing nit (defensible), a pre-existing style divergence in |
The previous fix introduced a _METRIC_INSTRUMENTS cache + helper functions to work around instrument re-creation in the per-entity hot path. That was the same kind of wrapper we just deleted from telemetry.py — cache to work around a wrapper around logfire is not the right shape. Proper fix: record these metrics where they belong — once per batch, using the totals VectorSyncBatchResult already accumulates (prepare_seconds_total, queue_wait_seconds_total, etc.). The per-entity histogram calls in _log_vector_sync_complete are gone; that function now only emits the slow-entity warning log. Batch-level block now: - shares one `batch_attrs` dict across all recordings (was repeated 7x) - records 5 histograms + 6 counters per batch with direct logfire.metric_* calls, no cache, no helpers - _METRIC_INSTRUMENTS + _metric_histogram + _metric_counter all removed Tests updated: per-entity histogram counts (was 2 per 2-entity batch) are now 1 per batch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 1m 51s —— View job PR Review: refactor(core): rip telemetry wrappers, use logfire directlyTasks:
SummarySixth and final review pass, triggered by ChecklistCode Quality & Standards
Testing & Documentation
Basic Memory Architecture
Security & Performance
Compatibility
Issues1. Metric name semantics shifted silently [
2.
3. Loguru context lost when telemetry is off (acknowledged trade-off) [ Carried from prior passes. The old VerdictApprove. Summary across all six commits:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 009d735d75
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| logfire.metric_histogram("vector_sync_prepare_seconds", unit="s").record( | ||
| result.prepare_seconds_total, attributes=batch_attrs |
There was a problem hiding this comment.
Split batch totals into distinct vector timing metrics
This now records result.prepare_seconds_total (and the other *_seconds_total values) under metric names like vector_sync_prepare_seconds that previously represented per-entity timings. _log_vector_sync_complete no longer emits per-entity histograms, so existing dashboards/alerts using these names will silently change units based on batch size and produce misleading latency distributions. Please either keep per-entity recording for these names or publish the aggregated values under new metric names.
Useful? React with 👍 / 👎.
| """ | ||
| if "application/json" not in response.headers.get("content-type", ""): | ||
| return None | ||
| return response.json() |
There was a problem hiding this comment.
Preserve ToolError wrapping on invalid JSON responses
Returning response.json() directly here allows JSONDecodeError to escape from call_get/call_post/... before they can raise ToolError, which breaks the helper contract that callers rely on for consistent error handling. This is user-visible when upstream intermediaries or misbehaving servers send malformed payloads with a JSON content type, because tool calls now raise raw parser exceptions instead of actionable ToolError messages.
Useful? React with 👍 / 👎.
Summary
basic_memory.telemetryhad grown a layer of span/metric/context wrappers (scope,operation,span,started_span,contextualize,add_counter,record_histogram) that produced nested span trees for single operations (e.g.search.execute→search.repository_queryfor one repo call) and made every site tall and hard to read. This removes the wrappers entirely.telemetry.py: 246 → 64 lines. Onlyconfigure_telemetry()+get_logfire_handler()remain.TelemetryStatedataclass,_STATE,reset_telemetry_state,pop_telemetry_warnings,_filter_attributes,_METRICScache, and_get_metricwere all dead after the wrappers came out.[telemetry]extra) so call sites canimport logfiredirectly. Added ~4MB to the base install; the wrapper layer only existed to handle the absent-package case.Call-site changes
search_service.py: nestedsearch.execute→search.repository_queryspans flattened to one. Dropped trivial wrapping spans (search.index.read_content,search.index.build_rows,search.index.bulk_upsert) and redundantphase=/result_count=1filler.search_repository_base.py: metrics now calllogfire.metric_counter(name).add(amount, attributes={...})directly — 11 sites.mcp/tools/*: thecontextualize+started_spanpair collapsed into a singlelogfire.span(...). The logfire↔loguru integration propagates span attributes to logs automatically via the handler wired up in bootstrap, so the manuallogger.contextualizebind that paired with every span is gone.utils.py: dropped thepop_telemetry_warningslogging loop (warnings path was dead after logfire became required).Verification
just fast-checklocally: 2756 passed, 34 skipped, 0 failed (26:32).tests/**/*telemetry*.pysuites rewritten to patchlogfire.span/logfire.metric_*directly; tests for deleted helpers removed.Test plan
LOGFIRE_ENABLED=true— spans should appear as a flatter tree than before🤖 Generated with Claude Code