Skip to content

perf(sync): batch file indexing in core#726

Merged
phernandez merged 9 commits intomainfrom
perf/sync-batch-indexing-core
Apr 8, 2026
Merged

perf(sync): batch file indexing in core#726
phernandez merged 9 commits intomainfrom
perf/sync-batch-indexing-core

Conversation

@phernandez
Copy link
Copy Markdown
Member

Summary

  • add a typed, reusable core indexing layer with batch planning and bounded-parallel batch execution
  • refactor SyncService.sync() to batch changed file indexing, emit typed progress, and keep forward-ref/vector phases coarse and explicit
  • add bounded concurrent request handling for API-backed embedding batches while keeping fastembed/local flows intact

Testing

  • uv run pytest tests/indexing/test_batching.py tests/indexing/test_batch_indexer.py tests/sync/test_sync_service_batching.py tests/sync/test_sync_service_telemetry.py tests/repository/test_openai_provider.py -q
  • uv run ruff check src/basic_memory/indexing src/basic_memory/sync/sync_service.py src/basic_memory/config.py src/basic_memory/repository/openai_provider.py src/basic_memory/repository/embedding_provider_factory.py src/basic_memory/cli/commands/db.py tests/indexing tests/sync/test_sync_service_batching.py tests/sync/test_sync_service_telemetry.py tests/repository/test_openai_provider.py
  • uv run pyright
  • just fast-check

Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Claude finished @phernandez's task in 2m 35s —— View job


PR Review: perf(sync): batch file indexing in core

Tasks:

  • Gather context and understand PR
  • Read new indexing layer files
  • Read modified sync_service.py
  • Read modified config/provider files
  • Read test files
  • Evaluate against review checklist
  • Post final review

Summary

This is a well-architected PR that introduces a typed, reusable batch indexing layer. The separation of concerns is excellent — indexing/models.py, batching.py, and batch_indexer.py form a clean core layer decoupled from filesystem concerns via the IndexFileWriter protocol. The literate programming style (section headers, decision comments) is consistently applied. Several issues worth addressing before merge:


Issues Found

🔴 Dead code: progress_callback in BatchIndexer.index_files is never invoked in the multi-batch sync loop

BatchIndexer.index_files accepts a progress_callback parameter but _index_changed_files in sync_service.py never passes one:

# sync_service.py:505-509
batch_result = await self.batch_indexer.index_files(
    loaded_files,
    max_concurrent=self.app_config.index_entity_max_concurrent,
    parse_max_concurrent=self.app_config.index_parse_max_concurrent,
    # progress_callback is NOT passed here
)

The per-batch IndexProgress(batches_total=1, batches_completed=1) callback in index_files (lines 173–186 of batch_indexer.py) will never fire during sync — progress is emitted at the higher level by _emit_index_progress. Either remove the progress_callback parameter from index_files to reduce confusion, or document that it's only for single-batch direct callers.

Fix this →


🔴 get_file_path_to_permalink_map() is called once per batch, not once per sync

_normalize_markdown_batch calls entity_repository.get_file_path_to_permalink_map() on every batch execution (batch_indexer.py:228). For a project that produces 10 batches of 32 files each, that's 10 full table scans of all existing entities. The reserved permalink set should be initialized once and threaded across batches at the _index_changed_files level, or the method should accept the pre-built map as a parameter.

Fix this →


🟡 Sequential await loop in _resolve_batch_relations for relation fetching

The unresolved-relations fetch phase is sequential:

# batch_indexer.py:424-430
for entity_id in entity_ids:
    unresolved_relations.extend(
        await self.relation_repository.find_unresolved_relations_for_entity(entity_id)
    )

Resolution uses asyncio.gather (correctly), but the fetch phase does not. For a batch with many entity IDs this adds N sequential DB round-trips before any concurrent work begins. The fetch can also be done with asyncio.gather.

Fix this →


🟡 include_created_at=existing is None captures mutated variable

In _upsert_regular_file, existing is reassigned inside the except IntegrityError block. The entity_repository.update() call after the try/except evaluates existing is None based on the post-mutation value, which happens to produce correct behavior — but it's fragile:

# batch_indexer.py:398-404
# At this point `existing` may have been reassigned in the except block above
updated = await self.entity_repository.update(
    entity_id,
    self._entity_metadata_updates(file, checksum, include_created_at=existing is None),
)

Consider capturing a boolean flag (is_new_entity = existing is None) before the try block and using that instead.

Fix this →


🟡 Missing integration tests in test-int/

All new tests are unit tests under tests/. Per CLAUDE.md and project conventions, integration tests (real implementations, no mocks) belong in test-int/. The sync batching behavior — particularly the full multi-batch flow with real files, forward ref resolution across batches, and circuit breaker interaction — are not covered by an integration test. Only tests/sync/test_sync_service_batching.py exists, and it contains a single test.

Key missing scenarios:

  • Files that fail loading (FileNotFoundError) are handled, circuit breaker fires after threshold
  • Files larger than index_batch_max_bytes routed to their own single-file batch
  • _load_index_file_metadata and _load_index_batch_files load error paths

🟡 test_sync_service_batching.py is sparse

The single test covers the happy path. There are no tests for:

  • Empty changed_paths list (the early-return branch in _index_changed_files)
  • Files that were pre-skipped by the circuit breaker
  • Metadata load errors
  • Batch load errors

🟢 Minor: batching.py line 55 redundant flush condition

After the overflow-flush on line 38-42, line 55 also checks current_bytes >= max_bytes. A file that brings total bytes exactly to max_bytes would not flush on line 39 (> max_bytes is strict), but would flush on line 55. This is correct behavior but consider >= on line 39 to unify both checks and make intent clearer.


Checklist Results

Category Status Notes
Python 3.12+ type annotations slots=True dataclasses, type alias in factory
Async patterns / SQLAlchemy 2.0 Correct asyncio.gather + Semaphore usage
FastAPI/Typer conventions
100-char line length
No commented-out code
Unit tests for new functions
Integration tests for new paths ⚠️ Missing test-int/ coverage for batch sync
100% test coverage ⚠️ Some _upsert_regular_file branches likely uncovered
MCP tools atomic/composable N/A No MCP changes
No DB migration needed Config fields only
Local-first architecture preserved
No hardcoded secrets/credentials
Input validation max_files <= 0 / max_bytes <= 0 guards
No sensitive data in logs
Windows path compatibility as_posix() used consistently in scan paths
No emojis/unicode in console/log output New code uses ASCII only; in db.py:251 is pre-existing

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8f8d017529

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +545 to +547
except Exception as exc:
errors[path] = str(exc)
logger.warning("Batch indexing failed", path=path, error=str(exc))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Re-raise fatal sync exceptions in batch workers

_run_bounded currently downgrades every worker exception to a recoverable per-file error, which swallows SyncFatalError from repository writes (for example when a project is deleted and FK checks fail during sync). The pre-batch path explicitly propagated fatal sync errors to stop the run, but this code now lets sync continue and potentially update scan watermark after partial failures, leaving state inconsistent. Please allow fatal exceptions to propagate instead of handling them as normal batch errors.

Useful? React with 👍 / 👎.

Comment on lines +590 to +594
files[path] = IndexInputFile(
path=metadata.path,
size=metadata.size,
checksum=metadata.checksum,
content_type=metadata.content_type,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recompute checksum from loaded bytes before upsert

This carries forward the checksum captured during scan() instead of deriving it from the bytes just read for indexing. If a file changes between scan and batch load, the upsert path can write new content/mtime while persisting an old checksum; because mtime and size may already match afterward, later scans may never correct the checksum. That leaves persistent checksum drift and can break checksum-based move detection. Compute checksum from content (or recompute at write time) rather than trusting scan-time checksum here.

Useful? React with 👍 / 👎.

Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Claude finished @phernandez's task in 2m 16s —— View job


PR Review: perf(sync): batch file indexing in core

Tasks:

  • Gather context and understand PR
  • Review new commit changes (2f3cf95)
  • Read updated files
  • Evaluate against review checklist
  • Post final review

New Commit: 2f3cf95 — Return final markdown from batch indexer

This follow-up commit is clean and well-motivated. It closes a real gap: before this change, callers of BatchIndexer.index_files could not get the final on-disk content of a file after frontmatter was rewritten, forcing a re-read. The fix is correct — _normalize_markdown_file now threads final_content alongside final_checksum so both reflect the actual written state.

What looks good

  • Typed IndexFrontmatterWriteResult replaces the raw str return from write_frontmatter. Protocol signature is now self-documenting.
  • final_content tracked through rewrite chain — both the no-frontmatter enforcement and permalink update branches correctly set final_checksum and final_content together. Returning stale content after a frontmatter rewrite is now impossible.
  • update_frontmatter backward-compat shim — delegating to update_frontmatter_with_result is the right approach; existing callers don't need to change.
  • search_content=remove_frontmatter(prepared.content) — using the final (post-rewrite) content for search indexing is more correct than prepared.markdown.content, which reflected the pre-rewrite parse.
  • Tests cover both cases — no-rewrite test asserts markdown_content == original_content, and the permalink-conflict test asserts markdown_content equals the file's actual on-disk content after rewrite.

Issues in this commit

🟡 Dual identical result types with a manual conversion

FrontmatterUpdateResult (file_service.py:30) and IndexFrontmatterWriteResult (models.py:58) are structurally identical — both have checksum: str and content: str. The adapter converts between them:

# sync_service.py:135-140
result = await self.file_service.update_frontmatter_with_result(update.path, update.metadata)
return IndexFrontmatterWriteResult(checksum=result.checksum, content=result.content)

This boilerplate exists to keep the indexing/ layer free of a FileService import. That's a valid layering concern, but a # Why: keeps indexing/ decoupled from file_service comment here would make the intent obvious rather than looking like a redundant copy.

Fix this →

🟡 remove_frontmatter called redundantly on every markdown file

search_content=remove_frontmatter(prepared.content) is called even when no frontmatter rewrite occurred and prepared.markdown.content (already-parsed body, set by EntityMarkdown) is available. The markdown parser already stripped frontmatter during _prepare_markdown_file. You could use prepared.markdown.content or remove_frontmatter(prepared.content) to avoid the re-parse for the common path where EntityMarkdown.content is already populated.


Still-open issues from the first review

The following are not addressed by this commit but remain relevant before merge:

Severity Issue Status
🔴 progress_callback in index_files is never invoked from _index_changed_files Open
🔴 get_file_path_to_permalink_map() called once per batch instead of once per sync Open
🟡 Sequential await loop in _resolve_batch_relations fetch phase (line 434-437) Open
🟡 existing is None evaluated after except IntegrityError mutates existing (line 411) Open
🟡 No test-int/ integration coverage for multi-batch sync Open

Still-open Codex review issues

Priority Issue Status
P1 _run_bounded catches all exceptions, swallowing SyncFatalError from repository writes (line 554) Open
P2 Checksum stored in IndexedEntity comes from scan time; if file changes between scan and load, checksum can drift Open

Checklist Results

Category Status Notes
Python 3.12+ type annotations slots=True dataclasses throughout
Async patterns / SQLAlchemy 2.0
100-char line length
No commented-out code
Unit tests for new behavior No-rewrite and rewrite cases covered
Integration tests ⚠️ Still missing test-int/ batch sync coverage
100% test coverage ⚠️ # pragma: no cover on except in update_frontmatter_with_result — acceptable
No hardcoded secrets
No sensitive data in logs
Backward compatibility of update_frontmatter Shim delegates correctly
IndexFileWriter protocol updated consistently Test writer, prod writer, and protocol all updated

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2f3cf950c0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

entity_id=updated.id,
checksum=prepared.final_checksum,
content_type=prepared.file.content_type,
search_content=remove_frontmatter(prepared.content),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Derive markdown search content from parsed markdown

Building search_content with remove_frontmatter(prepared.content) can fail on files that the parser already accepted (for example, markdown starting with --- but malformed/non-frontmatter delimiters). EntityParser.parse_markdown_content() is explicitly tolerant of malformed YAML and still produces usable EntityMarkdown, but this later remove_frontmatter() call can raise and mark the file as a batch error after the entity was already upserted. That regresses sync robustness for malformed-but-indexable notes; use the parsed markdown body (prepared.markdown.content) instead of reparsing raw text.

Useful? React with 👍 / 👎.

Signed-off-by: phernandez <paul@basicmachines.co>
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Claude finished @phernandez's task in 3m 33s —— View job


PR Review: perf(sync): batch file indexing in core

Tasks:

  • Gather context and understand PR
  • Read new commit changes (49a40be, d986c4d)
  • Check which open issues from prior reviews are addressed
  • Evaluate against review checklist
  • Post final review

New Commits: 49a40be + d986c4d — Address batch indexing review issues

These two commits together close nearly all outstanding issues from both Claude and Codex reviews. The fixes are targeted and correct. A few minor points below.


Issues Resolved ✅

Prior Issue Resolution
🔴 progress_callback dead code in index_files Removed; replaced with existing_permalink_by_path parameter
🔴 get_file_path_to_permalink_map() called once per batch Fetched once before loop in _index_changed_files; passed as shared mutable dict
🟡 Sequential await loop in _resolve_batch_relations (both fetch + count phases) Both replaced with asyncio.gather
🟡 existing is None evaluated after IntegrityError mutates existing is_new_entity flag captured before try block
🟡 Missing test-int/ integration coverage test_sync_batching_integration.py added with 2 real-service scenarios
P1 _run_bounded swallows SyncFatalError Re-raises when isinstance(exc, SyncFatalError) or isinstance(exc.__cause__, SyncFatalError)
P2 Checksum stored from scan time Now recomputed via compute_checksum(content) from loaded bytes
🟡 remove_frontmatter called redundantly on every markdown Replaced with prepared.markdown.content if prepared.markdown.content is not None else prepared.content
🟡 FrontmatterUpdateResult/IndexFrontmatterWriteResult adapter lacks explanation # Why: comment added to _FileServiceIndexWriter.write_frontmatter
P1 Codex: remove_frontmatter can raise on malformed-but-parseable files Addressed by the markdown content fallback; unit test added

Also: the shared_permalink_by_path dict being passed by reference and mutated in-place by _normalize_markdown_batch (line existing_permalink_by_path[path] = normalized[path].markdown.frontmatter.permalink) is an elegant way to propagate newly assigned permalinks across batches without an extra DB round-trip per batch.


New Issues Found

🟡 except FileNotFoundError on sync_service.py:629 is dead code

After d986c4d reverted the FileNotFoundError re-raise from file_service.py, read_file_bytes now wraps all exceptions — including FileNotFoundError — in FileOperationError. The handler at line 629 can never fire from read_file_bytes. The subsequent except FileOperationError (line 631) is the correct path, and it already handles the FileNotFoundError case via isinstance(exc.__cause__, FileNotFoundError).

# sync_service.py:629-638
except FileNotFoundError:          # ← dead: read_file_bytes always wraps in FileOperationError
    await self.handle_delete(path)
except FileOperationError as exc:  # ← this is the actual path for missing files
    if isinstance(exc.__cause__, FileNotFoundError):
        await self.handle_delete(path)
    ...

The dead handler is harmless but misleads readers into thinking FileNotFoundError can surface unwrapped. Remove it, or add a comment like # safety net: kept in case callers other than read_file_bytes are added here.

Fix this →


🟢 Minor: batching.py line 55 — == is correct but relies on unstated invariant

The change from current_bytes >= max_bytes to current_bytes == max_bytes is logically equivalent given that the overflow-flush on lines 38–43 ensures current_bytes <= max_bytes always holds after adding a file. However, the reviewer's suggestion was to change > to >= on line 39 (to unify both checks), not to make line 55 more restrictive. The == check is less defensive: if the invariant ever breaks (e.g., a file with negative size), an over-budget batch would not flush. The prior >= was a free safety net. Consider reverting line 55 to >= or adding a comment:

# current_bytes <= max_bytes is invariant here (overflow case handled above),
# so == is equivalent to >=, but >= is kept for defensive clarity.
if len(current_paths) >= max_files or current_bytes >= max_bytes:

Checklist Results

Category Status Notes
Python 3.12+ type annotations
Async patterns / SQLAlchemy 2.0 asyncio.gather used correctly throughout
100-char line length
No commented-out code
Unit tests for new behavior Malformed frontmatter + fatal error re-raise tests added
Integration tests test-int/test_sync_batching_integration.py added with circuit-breaker and cross-batch forward-ref tests
100% coverage Dead except FileNotFoundError may show as uncovered — worth checking
SyncFatalError propagation Correct; handles both direct and __cause__ wrapping
Checksum from loaded bytes Scan-time drift fixed
Permalink map fetched once per sync Shared mutable dict threaded across batches correctly
search_content from parsed body Malformed-frontmatter regression closed
is_new_entity flag Post-mutation evaluation eliminated
No hardcoded secrets
Windows path compatibility
No emojis/unicode in logs

Overall: This is in good shape. The one actionable item is the dead except FileNotFoundError handler — it should be removed to keep the error-handling contract clear. The batching.py == vs >= is a style nit. Ready to merge once the dead handler is cleaned up.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Claude finished @phernandez's task in 3m 29s —— View job


PR Review: perf(sync): batch file indexing in core

Tasks:

  • Gather context and understand PR
  • Review new commits (49a40be, d986c4d)
  • Read updated files
  • Evaluate against review checklist
  • Post final review

New Commits: 49a40be + d986c4d

These two follow-up commits address nearly every issue from the previous reviews. The implementation is now in significantly better shape.


What was resolved

Issue Status
🔴 progress_callback dead code in index_files ✅ Removed entirely
🔴 get_file_path_to_permalink_map() called once per batch ✅ Pre-fetched once in _index_changed_files, passed as existing_permalink_by_path
🟡 Sequential await loop in _resolve_batch_relations ✅ Both fetch and remaining-count phases now use asyncio.gather
🟡 existing is None evaluated after except IntegrityError mutates existing is_new_entity flag captured before try block
🟡 No test-int/ integration coverage ✅ Added test-int/test_sync_batching_integration.py with two integration tests
🟡 test_sync_service_batching.py sparse coverage ✅ Added tests for empty changed_paths, circuit breaker, metadata errors, batch load errors
🔴 P1: _run_bounded swallows SyncFatalError ✅ Fixed — SyncFatalError and __cause__ re-raised
🔴 P2: Checksum from scan time can drift ✅ Fixed — loaded_checksum = await compute_checksum(content) computed from actual bytes
🟡 Codex P1: remove_frontmatter on parsed body ✅ Fixed — now uses prepared.markdown.content with fallback
🟡 Adapter comment missing ✅ Added # Why: comment on the FrontmatterUpdateResultIndexFrontmatterWriteResult conversion

One new issue

🟡 except FileNotFoundError at sync_service.py:629 is now dead code

Commit d986c4d changed file_service.read_file_bytes to wrap ALL errors (including FileNotFoundError) in FileOperationError. FileOperationError is not a subclass of FileNotFoundError, so the bare handler on line 629 can never be reached:

# sync_service.py:629-630
except FileNotFoundError:           # ← dead: read_file_bytes never raises this bare
    await self.handle_delete(path)
except FileOperationError as exc:   # ← the actual path
    if isinstance(exc.__cause__, FileNotFoundError):
        await self.handle_delete(path)

Either remove the dead handler or add a comment explaining it's a defensive net.

Fix this →


Architecture observation (no action needed)

The shared_permalink_by_path dict is passed by reference to _normalize_markdown_batch and mutated in-place (line 228 of batch_indexer.py). This correctly propagates new permalinks from batch N into the reserved set for batch N+1. The in-place mutation is intentional — but a one-line comment at the call site in _index_changed_files noting "mutated in-place to accumulate cross-batch permalink reservations" would help future readers understand why the dict isn't re-fetched between batches.


Checklist Results

Category Status Notes
Python 3.12+ type annotations slots=True dataclasses, type alias, full annotations
Async patterns / SQLAlchemy 2.0 asyncio.gather used consistently now in both relation fetch phases
FastAPI/Typer conventions
100-char line length
No commented-out code
Unit tests for new functions Comprehensive coverage of edge cases added
Integration tests Two integration tests in test-int/ covering multi-batch + circuit breaker
100% test coverage Codex P1 search_content fix covered by new test_batch_indexer_uses_parsed_markdown_body_for_malformed_frontmatter_delimiters test
MCP tools atomic/composable N/A No MCP changes
No DB migration needed Config fields only
Local-first architecture preserved
No hardcoded secrets/credentials
Input validation Guards for max_files <= 0, max_bytes <= 0
No sensitive data in logs Large per-entity logger.info blocks removed
Windows path compatibility as_posix() used consistently
No emojis/unicode in console/log output
SyncFatalError propagation Both isinstance(exc, SyncFatalError) and isinstance(exc.__cause__, SyncFatalError) checked
Checksum consistency Derived from actual loaded bytes, not scan-time metadata

This PR is in good shape. The one dead-code handler is a minor cleanup item, not a blocker.

Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Claude finished @phernandez's task in 2m 24s —— View job


PR Review: perf(sync): batch file indexing in core

Tasks:

  • Gather context and understand PR
  • Review new commit (cdffd87 — feat(core): support note-level embedding opt-out)
  • Read updated/added files
  • Check status of previously-open issues
  • Evaluate against review checklist
  • Post final review

New Commit: cdffd87 — Support note-level embedding opt-out

This commit adds per-note opt-out from semantic embedding via embed: false in frontmatter metadata. The core design is solid: a static _entity_embeddings_enabled helper with careful string/bool/int coercion, a _clear_entity_vectors cleanup path, and the opt-out gate wired into both single-entity and batch sync paths. Tests cover single sync, batch sync with mixed opt-in/opt-out, and FTS participation after opt-out.

One new issue

🔴 reindex_vectors bypasses the service-level opt-out

reindex_vectors() calls self.repository.sync_entity_vectors_batch(entity_ids) directly — the repository layer, which knows nothing about embed: false. The service-level opt-out in SearchService.sync_entity_vectors_batch is never invoked:

# search_service.py:505 — calls repository directly, skips opt-out filter
batch_result = await self.repository.sync_entity_vectors_batch(
    entity_ids,
    progress_callback=progress_callback,
)

A user who sets embed: false on a note and later runs basic-memory db reindex-vectors (or triggers CLI reindex) will have that note re-embedded, silently ignoring the opt-out. The fix is one word:

batch_result = await self.sync_entity_vectors_batch(
    entity_ids,
    progress_callback=progress_callback,
)

Fix this →


Minor issues

🟡 Sequential _clear_entity_vectors loop in sync_entity_vectors_batch

# search_service.py:465-466
for entity_id in opted_out_ids:
    await self._clear_entity_vectors(entity_id)

This is the same sequential-await pattern fixed in _resolve_batch_relations in the prior round. For a project with many opted-out notes this adds N sequential DB round-trips. Replace with asyncio.gather:

await asyncio.gather(*(_clear_entity_vectors(eid) for eid in opted_out_ids))

Fix this →

🟡 db.py:128 startup backfill also bypasses opt-out

db.py:128 constructs a search_repository directly and calls sync_entity_vectors_batch without going through SearchService. This is the startup embedding backfill path. Unlike reindex_vectors, SearchService may not be available there, but the behavior is worth documenting: opted-out notes will be re-embedded on every startup backfill cycle if they don't already have vectors. Consider either threading the entity metadata check into the backfill, or adding a comment acknowledging the known limitation.


Still-open issue from prior reviews

🟡 Dead except FileNotFoundError at sync_service.py:629 — still not addressed

This handler was flagged in the third review and remains in d986c4d. The read_file_bytes contract wraps all exceptions (including FileNotFoundError) in FileOperationError, making the bare handler unreachable:

# sync_service.py:629-630 — unreachable
except FileNotFoundError:
    await self.handle_delete(path)
except FileOperationError as exc:   # <- actual path for missing files
    if isinstance(exc.__cause__, FileNotFoundError):
        await self.handle_delete(path)

Remove it, or add # Safety net: ... comment.

Fix this →


Checklist Results

Category Status Notes
Python 3.12+ type annotations @staticmethod, clean type guards
Async patterns / SQLAlchemy 2.0 Correct execute_query usage
100-char line length Line 599 (if isinstance(self.repository, SearchRepositoryBase) and not self.repository._semantic_enabled) is ~100 chars — borderline, within limit
No commented-out code
Unit tests for new behavior Single sync, batch mixed, FTS-not-excluded all covered
Integration tests Prior integration tests intact
100% test coverage ⚠️ reindex_vectors with an opted-out entity has no test — will show as a gap once opt-out is threaded through
embed: false coercion correctness bool, str (normalized), int/float, unknown-default-true
Postgres/SQLite parity in _clear_entity_vectors SQLite explicitly deletes vec table; Postgres relies on ON DELETE CASCADE — correct
No sensitive data in logs
Local-first architecture preserved

Overall: One clear bug (reindex_vectors bypassing opt-out) that should be fixed before merge. The sequential clear loop and db.py backfill gap are lower priority. The dead except FileNotFoundError handler is a cleanup carry-over from the previous round.

Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Claude finished @phernandez's task in 2m 17s —— View job


PR Review: perf(sync): batch file indexing in core

Tasks:

  • Gather context and understand PR
  • Review new commit (1207e8c — fix(core): respect embedding opt-outs during reindex)
  • Read updated files
  • Check status of previously-open issues
  • Evaluate against review checklist
  • Post final review

New Commit: 1207e8c — Respect embedding opt-outs during reindex

This commit closes three of the four open issues from the prior round cleanly.


Issues Resolved ✅

Prior Issue Resolution
🔴 reindex_vectors calls self.repository.sync_entity_vectors_batch directly, bypassing opt-out Fixed: changed to self.sync_entity_vectors_batch — routes through service-level filter
🟡 Sequential _clear_entity_vectors loop in sync_entity_vectors_batch Fixed: replaced with asyncio.gather
🟡 Dead except FileNotFoundError handler in sync_service.py Fixed: handler removed from _load_index_batch_files

The fix to reindex_vectors is exactly the one-word change suggested in the prior review. The new test test_reindex_vectors_respects_embed_opt_out correctly validates the delegation: it mocks sync_entity_vectors_batch on the service instance and asserts that reindex_vectors invokes it with all entity IDs (the opt-out filtering is delegated to sync_entity_vectors_batch, which has its own test coverage). The assertion sync_batch.assert_awaited_once_with([41, 42], progress_callback=None) correctly captures the routing.


Still-Open Issue

🟡 Startup backfill at db.py:128 still calls the repository directly

The _run_semantic_embedding_backfill function in src/basic_memory/db.py constructs a SearchRepository and calls sync_entity_vectors_batch without going through SearchService:

# src/basic_memory/db.py:123-128
search_repository = repository_class(
    session_maker,
    project_id=project_id,
    app_config=app_config,
)
batch_result = await search_repository.sync_entity_vectors_batch(entity_ids)  # ← bypasses opt-out

This is the startup backfill that fires when search_vector_chunks is empty. SearchService is not available at this call site, so opted-out notes will be embedded during the first-run backfill even if they have embed: false. On subsequent syncs, sync_entity_vectors_batch (service-level) would clear those vectors — so the drift is eventually corrected, but there's a window where opted-out notes have stale vectors.

This was flagged as a known limitation in the prior review. The options remain: thread an entity-metadata check into the backfill, or add a # Why: startup backfill has no SearchService; opted-out entities will be corrected on the next sync comment to document the limitation explicitly.

Fix this →


Checklist Results

Category Status Notes
Python 3.12+ type annotations
Async patterns / SQLAlchemy 2.0 asyncio.gather in _clear_entity_vectors loop is correct
100-char line length
No commented-out code
Unit tests for new behavior test_reindex_vectors_respects_embed_opt_out correctly validates routing via service
Integration tests Prior test-int/ coverage intact
100% test coverage
embed: false opt-out respected in reindex_vectors Fixed
_clear_entity_vectors concurrent Fixed
Dead except FileNotFoundError removed Fixed
Startup backfill opt-out ⚠️ db.py:128 still bypasses SearchService; known limitation

Overall: All 🔴 issues are closed. This PR is effectively ready to merge. The db.py:128 backfill gap is a pre-existing limitation with a bounded impact (vectors will be corrected on the next sync) — a comment documenting the trade-off is sufficient.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1207e8c664

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +471 to +475
eligible_entity_ids = [
entity_id
for entity_id in entity_ids
if entity_id in entities_by_id and entity_id not in opted_out_ids
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve cleanup for missing IDs in vector batch sync

sync_entity_vectors_batch() now drops IDs that are not returned by entity_repository.find_by_ids() before delegating to repository.sync_entity_vectors_batch(). This skips the repository’s missing-entity cleanup path (which deletes stale chunk/embedding rows when no search_index rows exist), so a stale/deleted entity ID in the input list will no longer have its vector rows cleaned up. This regression appears when callers pass outdated IDs (for example from stale queues or races), and leaves orphaned vector data accumulating over time.

Useful? React with 👍 / 👎.

Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Claude finished @phernandez's task in 1m 58s —— View job


PR Review: perf(sync): batch file indexing in core

Tasks:

  • Gather context and understand PR
  • Review new commit (fe9fc7e — fix(core): remove automatic embedding backfill)
  • Read updated files
  • Check status of previously-open issues
  • Evaluate against review checklist
  • Post final review

New Commit: fe9fc7e — Remove automatic embedding backfill

This commit closes the last open issue from the previous round with a clean, decisive approach: instead of threading the SearchService opt-out check into the startup backfill, the backfill is removed entirely.

What was removed

  • _needs_semantic_embedding_backfill() and _run_semantic_embedding_backfill() from db.py
  • The migration-time backfill check from run_migrations()
  • _background_embedding_backfill() and its asyncio.create_task() call from mcp/server.py
  • The shutdown-time backfill task cancellation
  • 225 lines of test coverage for the now-deleted code

What replaced it

The warning message in _log_embedding_status now says:

"Run 'bm reindex --embeddings' to build them."

This is a better design than patching the backfill: simpler startup path, no background task lifecycle to manage, no bypass of the opt-out check, and a clear user-facing action when embeddings are missing.

Remaining code is clean

  • run_migrations() ends correctly at the init_search_index() call; the except block is intact
  • lifespan() in server.py is now 20 lines shorter with no dangling task references
  • _log_embedding_status() still runs synchronously at startup as a status check (correct)

Previously-Open Issue Resolution

Prior Issue Status
🟡 db.py:128 startup backfill bypasses SearchService opt-out ✅ Fixed — entire backfill removed; users directed to bm reindex --embeddings

One Still-Open Item (Codex P2, from 1207e8c)

sync_entity_vectors_batch() silently drops stale IDs without cleanup

The service-level filter in sync_entity_vectors_batch (search_service.py:471–475) builds eligible_entity_ids from IDs that exist in find_by_ids(). Entity IDs not returned by find_by_ids() (i.e., already deleted from the entity table) are silently dropped — neither opted-out nor delegated to the repository. The repository-level sync_entity_vectors_batch previously handled this by cleaning up orphaned search_index and search_vector_chunks rows for missing entities.

# search_service.py:471-475
eligible_entity_ids = [
    entity_id
    for entity_id in entity_ids
    if entity_id in entities_by_id and entity_id not in opted_out_ids
    # IDs not in entities_by_id are silently discarded — no stale row cleanup
]

Practical impact is low given actual call sites:

  • reindex_vectors() seeds its IDs from find_all() — only live entities — so stale IDs can't appear there
  • BatchIndexer calls sync_entity_vectors_batch only with freshly-upserted entity IDs

However, direct callers of the service method with stale IDs (e.g. from a race condition between delete and reindex) would accumulate orphaned vector rows indefinitely. A defensive fix would pass unknown IDs through to the repository for cleanup:

unknown_ids = [eid for eid in entity_ids if eid not in entities_by_id]
if unknown_ids:
    await self.repository.sync_entity_vectors_batch(unknown_ids)  # cleans up stale rows

This is not a blocker given the current call sites, but worth noting for a follow-up.

Fix this →


Checklist Results

Category Status Notes
Python 3.12+ type annotations
Async patterns / SQLAlchemy 2.0
100-char line length
No commented-out code
Unit tests for removed behavior 225 lines of now-irrelevant tests correctly deleted
Integration tests Prior test-int/ coverage intact
100% test coverage No dead code introduced
Startup backfill opt-out bypass Fixed by removing backfill
_log_embedding_status still functional Synchronous status check retained
run_migrations() structurally correct except block intact after removal
No sensitive data in logs
Local-first architecture preserved

Overall: All 🔴 issues are resolved. The Codex P2 stale-ID cleanup gap has low practical impact but is worth a follow-up. This PR is ready to merge.

Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Claude finished @phernandez's task in 2m 58s —— View job


PR Review: perf(sync): batch file indexing in core

Tasks:

  • Gather context and understand PR
  • Review new commit (a83bee9 — fix(core): clean up stale vector ids)
  • Read updated files
  • Check status of previously-open issues
  • Evaluate against review checklist
  • Post final review

New Commit: a83bee9 — Clean up stale vector IDs

This commit closes the last open issue from the prior round: the Codex P2 finding that sync_entity_vectors_batch() silently dropped unknown IDs without triggering the repository's stale-row cleanup path.


Issue Resolved ✅

Prior Issue Status
Codex P2: sync_entity_vectors_batch() drops stale/deleted IDs without cleanup ✅ Fixed — unknown IDs now routed to repository.sync_entity_vectors_batch for cleanup

The fix correctly:

  1. Collects unknown_ids — IDs passed in that are absent from find_by_ids() result
  2. Routes them to self.repository.sync_entity_vectors_batch(unknown_ids) so the repository layer can delete orphaned search_index and search_vector_chunks rows
  3. Builds a merged VectorSyncBatchResult by summing across both the cleanup batch and the eligible-embedding batch, with opted_out_ids added into entities_skipped
  4. The decision-point comment (# Trigger / Why / Outcome) is exactly right

The result merging is also a nice refactor from the prior mutable batch_result.entities_total = ... mutation pattern to an immutable construction — clearer and less fragile.


New Issues Found

🟢 Minor: unknown_ids cleanup and eligible_entity_ids embedding are sequential

Both repository.sync_entity_vectors_batch(unknown_ids) and repository.sync_entity_vectors_batch(eligible_entity_ids, ...) are await-ed in sequence. Since neither depends on the other's result, they could be done concurrently:

results = await asyncio.gather(
    self.repository.sync_entity_vectors_batch(unknown_ids) if unknown_ids else None,
    self.repository.sync_entity_vectors_batch(eligible_entity_ids, progress_callback=progress_callback) if eligible_entity_ids else None,
)

That said, unknown_ids is empty in every real call site (reindex_vectors seeds from find_all(), BatchIndexer uses freshly-upserted IDs), so the practical impact is zero. Not a blocker.


🟢 Minor: entities_skipped double-counts repository skips for unknown IDs

When unknown_ids are routed through repository.sync_entity_vectors_batch, the repository finds no source rows for those IDs and likely reports them as entities_skipped in its result. The merge aggregation adds len(opted_out_ids) + sum(result.entities_skipped ...), which means unknown entities count as both "deleted" and "skipped". This is a subtle accounting edge case with no practical impact given the real call sites — but worth noting if the entities_skipped field is ever surfaced in a user-visible stats page.


All Previously-Open Issues

Severity Issue Status
🔴 progress_callback dead code ✅ Resolved (49a40be)
🔴 get_file_path_to_permalink_map() per-batch ✅ Resolved (49a40be)
🟡 Sequential fetch in _resolve_batch_relations ✅ Resolved (49a40be)
🟡 existing is None post-mutation ✅ Resolved (49a40be)
🟡 Missing test-int/ integration coverage ✅ Resolved (49a40be)
🟡 Dual identical result types without comment ✅ Resolved (49a40be)
🔴 P1 Codex: _run_bounded swallows SyncFatalError ✅ Resolved (49a40be)
🔴 P2 Codex: Checksum drift from scan time ✅ Resolved (49a40be)
🟡 remove_frontmatter redundant call ✅ Resolved (49a40be)
🔴 reindex_vectors bypasses embed: false opt-out ✅ Resolved (1207e8c)
🟡 Sequential _clear_entity_vectors loop ✅ Resolved (1207e8c)
🟡 Dead except FileNotFoundError handler ✅ Resolved (1207e8c)
🟡 Startup backfill bypasses opt-out ✅ Resolved (fe9fc7e — backfill removed entirely)
Codex P2: sync_entity_vectors_batch drops stale IDs ✅ Resolved (a83bee9)

Checklist Results

Category Status Notes
Python 3.12+ type annotations list[VectorSyncBatchResult] typed correctly
Async patterns / SQLAlchemy 2.0 Correct; sequential awaits are fine given real call sites
100-char line length prepare_seconds_total=sum(...) line at 90 chars
No commented-out code
Unit tests for new behavior test_semantic_vector_sync_batch_cleans_up_unknown_ids validates both paths called, counts correct
Integration tests Prior test-int/ coverage intact
100% test coverage Unknown-IDs path now exercised
Result merging correctness Immutable construction replaces mutable mutation; cleaner
opted_out_ids correctly excluded from eligible_entity_ids
progress_callback only forwarded to eligible batch Unknown cleanup call correctly omits it
No hardcoded secrets
No sensitive data in logs

Overall: All 🔴 issues are closed. No new 🔴 or 🟡 issues introduced. The two 🟢 observations above are non-blocking style/perf nits with zero practical impact at current call sites. This PR is ready to merge.

Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@phernandez phernandez merged commit 540da41 into main Apr 8, 2026
38 of 41 checks passed
@phernandez phernandez deleted the perf/sync-batch-indexing-core branch April 8, 2026 06:21
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1fee913950

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +511 to +515
entities_skipped=(
len(opted_out_ids)
+ sum(result.entities_skipped for result in repository_results)
- len(unknown_ids)
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not subtract unknown IDs from skipped vector count

When entity_ids contains stale/unknown IDs (for example, a sync race where an entity is deleted before embedding), repository.sync_entity_vectors_batch(unknown_ids) performs cleanup and typically reports entities_skipped == 0; subtracting len(unknown_ids) here can therefore make entities_skipped negative. This produces invalid embedding stats in sync/reindex reporting and can mislead operators about what was actually skipped.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant