Skip to content

refactor(core): rip telemetry wrappers, use logfire directly#754

Merged
phernandez merged 6 commits intomainfrom
refactor/telemetry-direct-logfire
Apr 17, 2026
Merged

refactor(core): rip telemetry wrappers, use logfire directly#754
phernandez merged 6 commits intomainfrom
refactor/telemetry-direct-logfire

Conversation

@phernandez
Copy link
Copy Markdown
Member

Summary

  • basic_memory.telemetry had grown a layer of span/metric/context wrappers (scope, operation, span, started_span, contextualize, add_counter, record_histogram) that produced nested span trees for single operations (e.g. search.executesearch.repository_query for one repo call) and made every site tall and hard to read. This removes the wrappers entirely.
  • telemetry.py: 246 → 64 lines. Only configure_telemetry() + get_logfire_handler() remain. TelemetryState dataclass, _STATE, reset_telemetry_state, pop_telemetry_warnings, _filter_attributes, _METRICS cache, and _get_metric were all dead after the wrappers came out.
  • logfire is now a required dependency (moved out of the optional [telemetry] extra) so call sites can import logfire directly. Added ~4MB to the base install; the wrapper layer only existed to handle the absent-package case.

Call-site changes

  • search_service.py: nested search.executesearch.repository_query spans flattened to one. Dropped trivial wrapping spans (search.index.read_content, search.index.build_rows, search.index.bulk_upsert) and redundant phase=/result_count=1 filler.
  • search_repository_base.py: metrics now call logfire.metric_counter(name).add(amount, attributes={...}) directly — 11 sites.
  • mcp/tools/*: the contextualize + started_span pair collapsed into a single logfire.span(...). The logfire↔loguru integration propagates span attributes to logs automatically via the handler wired up in bootstrap, so the manual logger.contextualize bind that paired with every span is gone.
  • utils.py: dropped the pop_telemetry_warnings logging loop (warnings path was dead after logfire became required).

Verification

  • just fast-check locally: 2756 passed, 34 skipped, 0 failed (26:32).
  • All tests/**/*telemetry*.py suites rewritten to patch logfire.span / logfire.metric_* directly; tests for deleted helpers removed.

Test plan

  • CI passes (SQLite + Postgres)
  • Verify no regression in trace output when running with LOGFIRE_ENABLED=true — spans should appear as a flatter tree than before

🤖 Generated with Claude Code

basic_memory.telemetry had grown a layer of span/metric/contextualize
wrappers (scope, operation, span, started_span, contextualize,
add_counter, record_histogram) that added indirection without value.
Call sites nested the wrappers with overlapping attributes, producing
two- and three-level span trees for single operations.

This strip removes the wrappers entirely. Call sites use
logfire.span(...) and logfire.metric_counter(...) directly. The
logfire↔loguru integration propagates span attributes to logs without
the manual logger.contextualize bind that previously wrapped every
span.

Summary:
- telemetry.py: 246 → 64 lines. Only configure_telemetry() +
  get_logfire_handler() remain. TelemetryState dataclass, _STATE,
  reset_telemetry_state, pop_telemetry_warnings, _filter_attributes,
  _METRICS cache, and _get_metric helper all deleted (dead code after
  the wrapper removal).
- logfire: moved from optional [telemetry] extra to required
  dependency so call sites can import it directly.
- search_service.py: nested search.execute → search.repository_query
  spans flattened into one. Dropped trivial wrapping spans
  (search.index.read_content, search.index.build_rows,
  search.index.bulk_upsert) and redundant phase=/result_count=1
  filler attributes.
- search_repository_base.py: metrics use
  logfire.metric_counter(name).add(amount, attributes={...})
  directly (11 sites).
- mcp/tools/*: the contextualize + started_span pair collapsed to a
  single logfire.span(...) — loguru propagation happens via the
  handler configured in bootstrap.
- utils.py: dropped the pop_telemetry_warnings logging loop (always
  empty after the warnings path was removed).
- Tests updated to patch logfire.span / logfire.metric_* directly;
  removed tests for deleted helpers.

Verified with just fast-check: 2756 passed, 34 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 17, 2026

Claude finished @phernandez's task in 2m 59s —— View job


PR Review: refactor(core): rip telemetry wrappers, use logfire directly

Tasks:

  • Gather context and understand the PR
  • Review code quality & standards
  • Review testing & documentation
  • Review architecture concerns
  • Review security & performance
  • Review compatibility
  • Post final review summary

Summary

This is a solid, well-executed simplification. The removal of the custom telemetry.* wrapper layer in favor of direct logfire calls is clearly the right call — the wrappers added nesting and cognitive overhead without meaningful value. The diff is net negative in lines, the patterns are consistent throughout, and tests are cleanly updated. A few minor issues below.


Checklist

Code Quality & Standards

  • Follows Basic Memory's coding conventions in CLAUDE.md
  • Python 3.12+ type annotations and async patterns
  • SQLAlchemy 2.0 best practices
  • FastAPI and Typer conventions followed
  • 100-character line length limit maintained
  • No commented-out code blocks

Testing & Documentation

  • Unit tests updated for deleted helpers
  • Telemetry tests rewritten to patch logfire.span / logfire.metric_* directly
  • Test coverage maintained at 100%
  • No documentation changes needed (internal refactor)
  • Manual verification required: PR checklist item "Verify no regression in trace output when running with LOGFIRE_ENABLED=true" is still open — no automated test covers actual span output in a live logfire session

Basic Memory Architecture

  • MCP tools follow atomic, composable design
  • No database changes
  • Local-first architecture preserved
  • Knowledge graph operations unaffected

Security & Performance

  • No secrets or credentials
  • No new input validation gaps
  • Error handling preserved
  • No sensitive data in logs

Compatibility

  • No path separator assumptions (Windows-safe)
  • No emoji or unicode in console/log output

Issues

1. Stale comment in pyproject.toml:36 [pyproject.toml]

"aiofiles>=24.1.0", # Optional observability (disabled by default via config)

This comment was previously associated with the logfire optional extra below it. After moving logfire to required deps, the comment now incorrectly decorates aiofiles (an async file I/O library, not observability). Should be removed or corrected.

Fix this →


2. Sync context managers wrapping async code [search_service.py:190-212, search_service.py:232-252, search_service.py:386, search_service.py:706, all MCP tools]

All with logfire.span(...) usages are sync context managers wrapping await calls. This works correctly (Python runs __exit__ after the await resolves), and logfire supports it. The idiomatic form for async functions is async with logfire.span(...). Not a bug, but worth aligning with logfire's recommended async pattern in a follow-up if the team wants consistency with the async codebase style.


3. Instrument creation on each call [search_repository_base.py:1072-1127]

logfire.metric_counter("vector_sync_entities_total").add(result.entities_total, ...)

Called on every batch. The old code had an explicit _METRICS: dict cache for this. logfire does cache instruments internally by name (backed by the OTel MeterProvider), so this is functionally safe. But the old pattern was more explicit about intent. If logfire ever drops internal caching or this code is ported elsewhere, it could silently degrade. Minor nit — not a bug.


Verdict

Approve with minor fix. The stale comment on pyproject.toml:36 is the only change worth making before merge. Everything else is clean. The simplification improves readability materially — the before/after on write_note.py and search_service.py in particular shows how much indentation and noise the old wrapper pair (contextualize + started_span) added.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0b496661ec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +49 to +53
logfire.configure(**kwargs)
except TypeError:
configure_kwargs.pop("send_to_logfire", None)
logfire.configure(**configure_kwargs)
except Exception as exc: # pragma: no cover
_STATE.enabled = False # pragma: no cover
_STATE.warnings.append(f"Failed to configure Logfire telemetry: {exc}") # pragma: no cover
return False # pragma: no cover
# Older logfire releases don't accept send_to_logfire as a keyword.
kwargs.pop("send_to_logfire", None)
logfire.configure(**kwargs)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard Logfire bootstrap failures from crashing startup

configure_telemetry() now only handles TypeError around logfire.configure, so any other Logfire initialization failure will propagate and abort entrypoint startup when telemetry is enabled. Since init_cli_logging/init_api_logging call this during process boot, a telemetry-only misconfiguration or transient SDK error can now take down CLI/API startup instead of degrading gracefully as before.

Useful? React with 👍 / 👎.

Comment thread src/basic_memory/mcp/tools/utils.py Outdated
Comment on lines +212 to +213
with logfire.span("mcp.http.request", **attrs) as active_span:
yield _RequestSpan(active_span)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve request log context when telemetry is off

This scope now creates only a Logfire span and no longer binds request attributes into Loguru context. With logfire_enabled defaulting to False and the Logfire sink only attached when telemetry is configured, the common local/default path loses method/client_name/operation metadata on HTTP helper logs, which is a regression in debuggability for request failures.

Useful? React with 👍 / 👎.

This comment was attached to the logfire line in the previous PR
state. When logfire moved from optional-dependencies to required
dependencies, the comment ended up floating next to aiofiles, where
it's misleading — aiofiles is an async I/O library, not observability.

Flagged in PR #754 review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 17, 2026

Claude finished @phernandez's task in 1m 50s —— View job


PR Review: refactor(core): rip telemetry wrappers, use logfire directly

Tasks:

  • Gather context and understand the PR
  • Review key changed files
  • Review code quality & standards
  • Review testing & documentation
  • Review architecture concerns
  • Review security & performance
  • Post final review summary

Summary

This is a clean simplification that removes a custom abstraction layer that had outlived its purpose. The net result is a smaller, more readable, and more direct codebase. The stale pyproject.toml comment flagged in the previous review pass is already resolved in aaf4fcd. Below is the full checklist assessment and remaining issues.


Checklist

Code Quality & Standards

  • Follows Basic Memory's coding conventions in CLAUDE.md
  • Python 3.12+ type annotations and async patterns
  • SQLAlchemy 2.0 best practices
  • FastAPI and Typer conventions followed
  • 100-character line length limit maintained
  • No commented-out code blocks
  • Speculative getattrmcp/tools/utils.py:68–80 violates CLAUDE.md's "No speculative getattr" guideline (see Issue 1)

Testing & Documentation

  • Unit tests for deleted helpers removed
  • Telemetry tests rewritten to patch logfire.span / logfire.metric_* directly
  • 100% test coverage maintained
  • No documentation changes needed (internal refactor)
  • Manual verification still open: "Verify no regression in trace output when running with LOGFIRE_ENABLED=true" remains an unchecked checklist item in the PR description

Basic Memory Architecture

  • MCP tools follow atomic, composable design
  • No database changes
  • Local-first architecture preserved
  • Knowledge graph operations unaffected

Security & Performance

  • No hardcoded secrets or credentials
  • No new input validation gaps
  • Error handling preserved
  • No sensitive data in logs or commits
  • logfire.metric_counter(name) creates instruments per-call — logfire caches internally by name, so this is safe

Compatibility

  • No path separator assumptions
  • No emoji or unicode in console/log output

Issues

1. Speculative getattr in _RequestSpan._set_attributes [mcp/tools/utils.py:68–80]

def _set_attributes(self, attrs: dict[str, typing.Any]) -> None:
    if self._active_span is None:
        return

    set_attributes = getattr(self._active_span, "set_attributes", None)
    if callable(set_attributes):
        set_attributes(attrs)
        return

    set_attribute = getattr(self._active_span, "set_attribute", None)
    if callable(set_attribute):
        for key, value in attrs.items():
            set_attribute(key, value)

CLAUDE.md explicitly forbids this pattern: "Never use getattr(obj, 'attr', default) when unsure about attribute names. Check the class definition or source code first." The span yielded by with logfire.span(...) as active_span: is a logfire SpanContext object which reliably provides set_attributes. The secondary set_attribute fallback is speculative dead code. The active_span: typing.Any | None typing hides the concrete type and enables this pattern — the type should be logfire.SpanContext | None so the method can be called directly without inspection.

Fix this →


2. Codex P1: Narrow TypeError catch in configure_telemetry [telemetry.py:48–53]

try:
    logfire.configure(**kwargs)
except TypeError:
    kwargs.pop("send_to_logfire", None)
    logfire.configure(**kwargs)

Only TypeError (from older logfire releases not accepting send_to_logfire) is caught. Any other logfire initialization failure (e.g. bad code_source config, OTel backend errors, import-time issues) will propagate and abort CLI/API startup when logfire_enabled=True. This is consistent with the project's "fail fast" principle, so it is a deliberate trade-off — but worth documenting with a comment so future maintainers don't widen the catch accidentally.


3. Sync context managers in async functions [search_service.py:190–212, search_service.py:232–252, all MCP tool files]

All with logfire.span(...) usages are sync context managers wrapping await calls. This works correctly (Python executes __exit__ after the await resolves), and logfire supports both forms. The idiomatic pattern for async code is async with logfire.span(...). Not a bug, but flagging for consistency with the async codebase style.


Verdict

Approve with one nit. Issue 1 (speculative getattr violating an explicit CLAUDE.md guideline) is the only item worth addressing before merge — the fix is small. Issue 2 is intentional by design. Issue 3 is a style nit for a follow-up. The simplification is materially positive: telemetry.py dropped from 246 to 64 lines, tool files lost the noisy contextualize + started_span pairing, and tests are cleanly updated to patch logfire directly.

Feedback from PR #754 review: _RequestSpan used speculative getattr
to probe for set_attributes / set_attribute on its stored span, which
CLAUDE.md explicitly forbids. The check existed because the span was
typed as typing.Any | None, hiding the fact that logfire.span always
yields a LogfireSpan with a stable .set_attributes method.

This strip:
- Deletes the _RequestSpan adapter class entirely.
- _request_scope now yields the logfire span itself; callers hold a
  logfire.LogfireSpan | None and call .set_attributes directly.
- Extracts two one-liner helpers (_response_span_attrs,
  _transport_error_span_attrs) for the attribute dicts, preserving
  the dedup across the 5 call_{get,put,post,patch,delete} sites.
- Test fakes yield a _NoopSpan stub so patched `logfire.span` stays
  compatible with the new direct `.set_attributes` calls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 17, 2026

Claude finished @phernandez's task in 1m 45s —— View job


PR Review: refactor(core): rip telemetry wrappers, use logfire directly

Tasks:

  • Gather context and understand the PR
  • Review key changed files
  • Review code quality & standards
  • Review testing & documentation
  • Review architecture concerns
  • Review security & performance
  • Post final review summary

Summary

Three commits; each one closes a gap from the previous review pass. The end state is clean and the direction is correct. Previous issues are fully resolved.


Checklist

Code Quality & Standards

  • Follows Basic Memory's coding conventions in CLAUDE.md
  • Python 3.12+ type annotations and async patterns
  • No speculative getattrresolved in 85ce163: _RequestSpan adapter deleted entirely; _request_scope now yields a logfire.LogfireSpan directly; all 5 call sites use .set_attributes() with the concrete type
  • request_span: logfire.LogfireSpan | None typed correctly at every call site
  • 100-character line length limit maintained
  • No commented-out code blocks

Testing & Documentation

  • Telemetry tests rewritten to patch logfire.span / logfire.metric_* directly
  • Test fakes updated in test_tool_telemetry.py to yield a _NoopSpan stub with .set_attributes, compatible with the new direct call pattern
  • 100% test coverage maintained
  • Stale pyproject.toml comment removed — resolved in aaf4fcd
  • Manual verification still open: PR checklist item "Verify no regression in trace output when running with LOGFIRE_ENABLED=true" is unchecked — not a blocker, but worth noting

Basic Memory Architecture

  • MCP tools follow atomic, composable design
  • No database changes
  • Local-first architecture preserved
  • Knowledge graph operations unaffected

Security & Performance

  • No secrets or credentials
  • No new input validation gaps
  • Error handling preserved — transport errors still call request_span.set_attributes(_transport_error_span_attrs(e)) before re-raising
  • logfire.metric_counter(name) called per-batch — safe, logfire caches instruments by name internally (OTel MeterProvider)

Compatibility

  • No path separator assumptions
  • No emoji or unicode in console/log output

Issues

1. Minor: Loguru context lost when telemetry is disabled [mcp/tools/utils.py:171–192]

The old _request_scope ran telemetry.contextualize(method=..., client_name=..., operation=..., path_template=...) alongside the span. That bound those fields into Loguru's structural context, making them appear in local dev logs even when logfire was off. The new version only opens a logfire.span, which only propagates context to Loguru via the logfire handler — a handler that isn't wired when logfire_enabled=False (the default).

In practice, logger.debug(f"Calling GET '{url}' params: '{params}'") still captures the URL inline, so the most actionable fields survive. This is a minor debuggability regression for local runs. The PR description explicitly calls this out as intentional trade-off ("the manual logger.contextualize bind ... is gone"), so no fix is required — flagging for awareness only.


2. Nit: Sync context managers in async functions [search_service.py:190–252, write_note.py:152, all MCP tool files]

All with logfire.span(...) usages are sync context managers wrapping await calls. This is correct — Python runs __exit__ after the await resolves — and logfire supports it. The idiomatic form for async code is async with logfire.span(...). Not a bug; surface for a follow-up if the team wants consistency.


Verdict

Approve. All previously flagged issues are addressed:

  • _RequestSpan + speculative getattr → deleted (85ce163)
  • Stale pyproject.toml comment → removed (aaf4fcd)

The simplification is materially positive: telemetry.py 246 → 64 lines, MCP tools lost the noisy contextualize + started_span pairing, and tests cleanly patch logfire directly. Issue 1 above is an acknowledged trade-off; Issue 2 is a style note for later.

Two more simplifications in utils.py the reviewer/PM flagged:

1. _request_scope was a no-op wrapper around logfire.span that added
   nothing — same name, same attributes, one call deep. Inlined at all
   5 call sites (call_get/put/post/patch/delete). The helper is gone.

2. _extract_response_data used `except Exception` to swallow any JSON
   decode failure. That's defensive code for upstream gateway HTML
   error pages (Fly/Cloudflare 5xx); FastAPI itself always emits JSON.
   Replaced with a content-type check: if the response doesn't claim
   `application/json`, return None; otherwise let any decode error
   surface. Matches CLAUDE.md's fail-fast guidance.

Tests updated: MockResponse fixtures now include a
`headers = {"content-type": "application/json"}` attribute so they
pass the new gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 17, 2026

Claude finished @phernandez's task in 1m 26s —— View job


PR Review: refactor(core): rip telemetry wrappers, use logfire directly

Tasks:

  • Gather context and understand the PR
  • Review key changed files
  • Review code quality & standards
  • Review testing & documentation
  • Review architecture concerns
  • Review security & performance
  • Post final review summary

Summary

Fourth review pass, triggered by 7ce47d7 which inlined _request_scope into the five call_* helpers and tightened _extract_response_data. All previously flagged issues are closed. The code is in good shape.


Checklist

Code Quality & Standards

  • Follows Basic Memory's coding conventions in CLAUDE.md
  • Python 3.12+ type annotations — logfire.LogfireSpan | None typed correctly at every call site
  • No speculative getattr_RequestSpan fully deleted, all spans use .set_attributes() on the concrete type
  • SQLAlchemy 2.0 best practices — no ORM changes
  • FastAPI and Typer conventions followed
  • 100-character line length maintained
  • No commented-out code blocks
  • Style nit (carry-forward): with logfire.span(...) in call_get/put/patch/post/delete and all MCP tool files are sync context managers wrapping await calls — correct behavior, but async with logfire.span(...) is idiomatic for async code. Not a bug; noted for a future cleanup pass.

Testing & Documentation

  • Telemetry tests rewritten to patch logfire.span / logfire.metric_* directly
  • _NoopSpan stub in test_tool_telemetry.py yields a concrete .set_attributes() compatible with the new direct-call pattern
  • Stale pyproject.toml comment removed (aaf4fcd)
  • 100% test coverage maintained
  • Manual verification still open: PR checklist item "Verify no regression in trace output when running with LOGFIRE_ENABLED=true" — not a blocker

Basic Memory Architecture

  • MCP tools follow atomic, composable design
  • No database changes
  • Local-first architecture preserved
  • Knowledge graph operations unaffected

Security & Performance

  • No hardcoded secrets or credentials
  • No new input validation gaps
  • Error handling preserved — transport errors still call request_span.set_attributes(_transport_error_span_attrs(e)) before re-raising
  • logfire.metric_counter(name) called per-batch — safe, logfire caches internally by name (OTel MeterProvider)

Compatibility

  • No path separator assumptions (Windows-safe)
  • No emoji or unicode in console/log output

Issues

1. Minor inconsistency in call_patch error handling [mcp/tools/utils.py:469–475]

call_patch doesn't initialize error_message = None at the top of the function like the other four call_* helpers, and its except HTTPStatusError block re-derives response_data and error_message from e.response rather than using the pre-computed variable. This is functionally correct — the exception can only arrive from raise_for_status() where the variables would already be set — but the divergence from the consistent pattern across call_get/put/post/delete is a readability nit. Pre-existing issue, not introduced by this PR.

2. Loguru context lost when telemetry is disabled [mcp/tools/utils.py:176–265]

Acknowledged trade-off from the previous review pass. The old _request_scope bound method/client_name/operation into Loguru's structural context even when logfire was off. The new per-method spans only attach to the logfire handler, which isn't wired when logfire_enabled=False (the default local dev mode). Inline logger.debug(f"Calling GET '{url}' ...") preserves the URL, so the most actionable field survives. Flagged for awareness, no action needed.


Verdict

Approve. All code-level issues from prior review passes are closed. The two items above are a pre-existing style divergence and an acknowledged intentional trade-off. The refactor is materially positive: telemetry.py 246 → 64 lines, MCP tools shed the noisy contextualize + started_span pairing, direct logfire.span / logfire.metric_counter calls are easy to read and test, and the logfire.LogfireSpan type is correctly threaded through all call sites.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7ce47d7113

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1697 to +1700
logfire.metric_histogram(
"vector_sync_prepare_seconds",
prepare_seconds,
unit="s",
backend=backend_name,
skip_only_entity=entity_skipped and embedding_jobs_count == 0,
).record(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reuse metric instruments instead of recreating in hot sync path

_log_vector_sync_complete() runs for every synced entity, but each invocation now calls logfire.metric_histogram(...).record(...) directly, which re-enters instrument registration/lookup on every sample. In large sync runs this adds avoidable lock/contention and CPU overhead in a hot loop (the previous telemetry helper cached metric handles once per process). Hoisting/caching these metric instruments restores the prior constant-cost recording behavior for high-entity indexing workloads.

Useful? React with 👍 / 👎.

Codex review on PR #754 flagged that _log_vector_sync_complete runs
per entity in large sync batches, and each call now goes through
logfire.metric_histogram(...).record(...) — OTel's MeterProvider
lookup is O(1) but still non-free under load. The original
telemetry wrapper cached instruments in a process-wide dict; my
strip lost that.

Restores the cache as two small module-level helpers in
search_repository_base.py (_metric_histogram / _metric_counter)
so instruments are resolved once and reused. Callers switch to
the helpers (11 sites).

Tests clear _METRIC_INSTRUMENTS via monkeypatch so fake factories
win on the first lookup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 17, 2026

Claude finished @phernandez's task in 1m 48s —— View job


PR Review: refactor(core): rip telemetry wrappers, use logfire directly

Tasks:

  • Gather context and understand the PR
  • Review latest commit (08e06e6: perf/search metric caching)
  • Check all prior issues are resolved
  • Run checklist against current branch state
  • Post final review summary

Summary

Fifth and final review pass, triggered by 08e06e6 which adds module-level metric instrument caching in search_repository_base.py to address the Codex P2 feedback from the previous review. This closes the last substantive code issue. All prior flagged items are now resolved.


Checklist

Code Quality & Standards

  • Follows Basic Memory's coding conventions in CLAUDE.md
  • Python 3.12+ type annotations and async patterns
  • No speculative getattr_RequestSpan deleted in 85ce163; all spans call .set_attributes() on the concrete logfire.LogfireSpan type
  • SQLAlchemy 2.0 best practices — no ORM changes
  • FastAPI and Typer conventions followed
  • 100-character line length maintained
  • No commented-out code blocks
  • Style carry-forward: with logfire.span(...) in call_get/put/patch/post/delete and all MCP tool files are sync context managers wrapping await calls — correct behavior, but async with logfire.span(...) is idiomatic for async code. Not a bug; noted for a future cleanup pass.

Testing & Documentation

  • Telemetry tests rewritten to patch logfire.span / logfire.metric_* directly
  • _METRIC_INSTRUMENTS cache correctly cleared in tests via monkeypatch.setattr(search_repository_base_module, "_METRIC_INSTRUMENTS", {}) so fake factories win on first lookup
  • Stale pyproject.toml comment removed (aaf4fcd)
  • 100% test coverage maintained
  • Manual verification still open: PR checklist item "Verify no regression in trace output when running with LOGFIRE_ENABLED=true" — not a blocker

Basic Memory Architecture

  • MCP tools follow atomic, composable design
  • No database changes
  • Local-first architecture preserved
  • Knowledge graph operations unaffected

Security & Performance

  • No hardcoded secrets or credentials
  • No new input validation gaps
  • Error handling preserved — transport errors call request_span.set_attributes(_transport_error_span_attrs(e)) before re-raising
  • 08e06e6 restores constant-cost metric recording: _METRIC_INSTRUMENTS dict caches logfire.metric_histogram/metric_counter lookups; the per-entity hot path in _log_vector_sync_complete no longer enters the OTel MeterProvider on every sample

Compatibility

  • No path separator assumptions (Windows-safe)
  • No emoji or unicode in console/log output

Issues

1. _metric_histogram / _metric_counter return type is Any [search_repository_base.py:49–64]

The caching helpers are typed -> Any because logfire doesn't expose a stable public type for metric instruments in its public API surface. This is defensible — the alternative is importing a private type — but it means callers lose type-checked access to .record() / .add(). Minor nit; not a blocker.

2. call_patch inconsistency (pre-existing) [mcp/tools/utils.py:416, 469–475]

call_patch doesn't initialize error_message = None at the top of the function like the other four call_* helpers, and the except HTTPStatusError block re-derives response_data + error_message from e.response rather than using the pre-computed variables set in the success path. Functionally correct (the exception can only come from raise_for_status() where those variables are already set), but diverges from the consistent pattern across call_get/put/post/delete. Pre-existing; not introduced by this PR.

3. Loguru context lost when telemetry is off (acknowledged trade-off) [mcp/tools/utils.py:176–265]

Carried from prior passes. The old _request_scope bound method/client_name/operation into Loguru's structural context even when logfire_enabled=False (the default). The new per-method spans only attach to the logfire handler, which isn't wired in default local dev mode. Inline logger.debug(f"Calling GET '{url}' ...") preserves the URL, so the most actionable field survives. PR description explicitly calls this out as an intentional trade-off.


Verdict

Approve. All previously flagged issues are closed:

  • Stale pyproject.toml comment → removed (aaf4fcd)
  • Speculative getattr in _RequestSpan → deleted entirely (85ce163)
  • Per-call instrument registration in the hot sync path → restored as _METRIC_INSTRUMENTS cache (08e06e6)

The three items above are: a typing nit (defensible), a pre-existing style divergence in call_patch, and an acknowledged intentional trade-off. None warrant blocking merge. The refactor is materially positive: telemetry.py 246 → 64 lines, MCP tools shed the contextualize + started_span pairing, metric instruments are cached correctly for high-entity sync workloads, and logfire.LogfireSpan is correctly threaded through all call sites.

The previous fix introduced a _METRIC_INSTRUMENTS cache + helper
functions to work around instrument re-creation in the per-entity
hot path. That was the same kind of wrapper we just deleted from
telemetry.py — cache to work around a wrapper around logfire is
not the right shape.

Proper fix: record these metrics where they belong — once per
batch, using the totals VectorSyncBatchResult already accumulates
(prepare_seconds_total, queue_wait_seconds_total, etc.). The
per-entity histogram calls in _log_vector_sync_complete are gone;
that function now only emits the slow-entity warning log.

Batch-level block now:
- shares one `batch_attrs` dict across all recordings (was
  repeated 7x)
- records 5 histograms + 6 counters per batch with direct
  logfire.metric_* calls, no cache, no helpers
- _METRIC_INSTRUMENTS + _metric_histogram + _metric_counter all
  removed

Tests updated: per-entity histogram counts (was 2 per 2-entity
batch) are now 1 per batch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 17, 2026

Claude finished @phernandez's task in 1m 51s —— View job


PR Review: refactor(core): rip telemetry wrappers, use logfire directly

Tasks:

  • Gather context and understand the PR
  • Review latest commit (009d735: batch-level vector-sync metrics)
  • Run checklist against current branch state
  • Post final review summary

Summary

Sixth and final review pass, triggered by 009d735 which moves vector-sync metric recording from the per-entity hot path to once-per-batch using the totals already accumulated in VectorSyncBatchResult. This is a better fix than the _METRIC_INSTRUMENTS cache introduced in 08e06e6 — instead of caching to work around per-entity overhead, it eliminates the overhead entirely by recording at the right level. All prior issues remain closed.


Checklist

Code Quality & Standards

  • Follows Basic Memory's coding conventions in CLAUDE.md
  • Python 3.12+ type annotations and async patterns
  • No speculative getattr_RequestSpan deleted in 85ce163; all spans use .set_attributes() on the concrete logfire.LogfireSpan type
  • SQLAlchemy 2.0 best practices — no ORM changes
  • FastAPI and Typer conventions followed
  • 100-character line length maintained
  • No commented-out code blocks
  • Style carry-forward: with logfire.span(...) in call_get/put/patch/post/delete and MCP tool files are sync context managers wrapping await calls — correct behavior, but async with logfire.span(...) is idiomatic for async code. Not a bug; noted for a future cleanup pass.

Testing & Documentation

  • Telemetry tests rewritten to patch logfire.span / logfire.metric_* directly
  • Per-entity histogram count assertions updated from 2→1 to reflect batch-level recording — test comment explains the design decision clearly
  • _METRIC_INSTRUMENTS monkeypatch removed from test (cache no longer exists)
  • Stale pyproject.toml comment removed (aaf4fcd)
  • 100% test coverage maintained
  • Manual verification still open: PR checklist item "Verify no regression in trace output when running with LOGFIRE_ENABLED=true" — not a blocker

Basic Memory Architecture

  • MCP tools follow atomic, composable design
  • No database changes
  • Local-first architecture preserved
  • Knowledge graph operations unaffected

Security & Performance

  • No hardcoded secrets or credentials
  • No new input validation gaps
  • Error handling preserved
  • 009d735 resolves the hot-path overhead correctly: metrics are now O(1) per batch regardless of entity count — no cache needed because recording happens once after the batch loop, not once per entity
  • batch_attrs dict is shared across all 11 metric recordings (was repeated 7× before)

Compatibility

  • No path separator assumptions (Windows-safe)
  • No emoji or unicode in console/log output

Issues

1. Metric name semantics shifted silently [search_repository_base.py:1079–1090]

vector_sync_prepare_seconds (and the other per-phase histograms) now record the batch total (result.prepare_seconds_total) rather than per-entity timing. The metric names don't reflect this change. A consumer looking at vector_sync_prepare_seconds in a dashboard would previously see a per-entity latency distribution; they now see per-batch accumulated totals. Functionally fine — batch totals are arguably more useful for capacity planning — but the name mismatch could confuse future readers. A follow-up rename to vector_sync_batch_prepare_seconds (matching the existing vector_sync_batch_total_seconds) would make this consistent. Not a blocker.

2. call_patch inconsistency (pre-existing) [mcp/tools/utils.py]

call_patch doesn't initialize error_message = None at the top of the function like the other four call_* helpers. Pre-existing; not introduced by this PR.

3. Loguru context lost when telemetry is off (acknowledged trade-off) [mcp/tools/utils.py:176–265]

Carried from prior passes. The old _request_scope bound method/client_name/operation into Loguru's structural context even when logfire_enabled=False (the default). PR description explicitly calls this out as intentional.


Verdict

Approve. 009d735 is the correct fix: moving metrics to batch level is semantically right, eliminates the per-entity overhead without a cache, and removes _METRIC_INSTRUMENTS + its two helper functions. The commit message says it well — "cache to work around a wrapper around logfire is not the right shape." Issue 1 above is a naming nit for a follow-up; Issues 2 and 3 are pre-existing / acknowledged.

Summary across all six commits:

  • aaf4fcd — stale pyproject.toml comment removed ✓
  • 85ce163_RequestSpan + speculative getattr deleted ✓
  • 7ce47d7_request_scope inlined, _extract_response_data tightened ✓
  • 08e06e6_METRIC_INSTRUMENTS cache (intermediate) ✓
  • 009d735 — batch-level metrics replace the cache entirely ✓

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 009d735d75

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1079 to +1080
logfire.metric_histogram("vector_sync_prepare_seconds", unit="s").record(
result.prepare_seconds_total, attributes=batch_attrs
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Split batch totals into distinct vector timing metrics

This now records result.prepare_seconds_total (and the other *_seconds_total values) under metric names like vector_sync_prepare_seconds that previously represented per-entity timings. _log_vector_sync_complete no longer emits per-entity histograms, so existing dashboards/alerts using these names will silently change units based on batch size and produce misleading latency distributions. Please either keep per-entity recording for these names or publish the aggregated values under new metric names.

Useful? React with 👍 / 👎.

"""
if "application/json" not in response.headers.get("content-type", ""):
return None
return response.json()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve ToolError wrapping on invalid JSON responses

Returning response.json() directly here allows JSONDecodeError to escape from call_get/call_post/... before they can raise ToolError, which breaks the helper contract that callers rely on for consistent error handling. This is user-visible when upstream intermediaries or misbehaving servers send malformed payloads with a JSON content type, because tool calls now raise raw parser exceptions instead of actionable ToolError messages.

Useful? React with 👍 / 👎.

@phernandez phernandez merged commit 1b39062 into main Apr 17, 2026
28 checks passed
@phernandez phernandez deleted the refactor/telemetry-direct-logfire branch April 17, 2026 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant