perf(sync): batch file indexing in core by phernandez · Pull Request #726 · basicmachines-co/basic-memory

phernandez · 2026-04-08T04:22:03Z

Summary

add a typed, reusable core indexing layer with batch planning and bounded-parallel batch execution
refactor SyncService.sync() to batch changed file indexing, emit typed progress, and keep forward-ref/vector phases coarse and explicit
add bounded concurrent request handling for API-backed embedding batches while keeping fastembed/local flows intact

Testing

uv run pytest tests/indexing/test_batching.py tests/indexing/test_batch_indexer.py tests/sync/test_sync_service_batching.py tests/sync/test_sync_service_telemetry.py tests/repository/test_openai_provider.py -q
uv run ruff check src/basic_memory/indexing src/basic_memory/sync/sync_service.py src/basic_memory/config.py src/basic_memory/repository/openai_provider.py src/basic_memory/repository/embedding_provider_factory.py src/basic_memory/cli/commands/db.py tests/indexing tests/sync/test_sync_service_batching.py tests/sync/test_sync_service_telemetry.py tests/repository/test_openai_provider.py
uv run pyright
just fast-check

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-08T04:24:09Z

Claude finished @phernandez's task in 2m 35s —— View job

PR Review: perf(sync): batch file indexing in core

Tasks:

Summary

This is a well-architected PR that introduces a typed, reusable batch indexing layer. The separation of concerns is excellent — indexing/models.py, batching.py, and batch_indexer.py form a clean core layer decoupled from filesystem concerns via the IndexFileWriter protocol. The literate programming style (section headers, decision comments) is consistently applied. Several issues worth addressing before merge:

Issues Found

🔴 Dead code: `progress_callback` in `BatchIndexer.index_files` is never invoked in the multi-batch sync loop

BatchIndexer.index_files accepts a progress_callback parameter but _index_changed_files in sync_service.py never passes one:

# sync_service.py:505-509
batch_result = await self.batch_indexer.index_files(
    loaded_files,
    max_concurrent=self.app_config.index_entity_max_concurrent,
    parse_max_concurrent=self.app_config.index_parse_max_concurrent,
    # progress_callback is NOT passed here
)

The per-batch IndexProgress(batches_total=1, batches_completed=1) callback in index_files (lines 173–186 of batch_indexer.py) will never fire during sync — progress is emitted at the higher level by _emit_index_progress. Either remove the progress_callback parameter from index_files to reduce confusion, or document that it's only for single-batch direct callers.

Fix this →

🔴 `get_file_path_to_permalink_map()` is called once per batch, not once per sync

_normalize_markdown_batch calls entity_repository.get_file_path_to_permalink_map() on every batch execution (batch_indexer.py:228). For a project that produces 10 batches of 32 files each, that's 10 full table scans of all existing entities. The reserved permalink set should be initialized once and threaded across batches at the _index_changed_files level, or the method should accept the pre-built map as a parameter.

Fix this →

🟡 Sequential `await` loop in `_resolve_batch_relations` for relation fetching

The unresolved-relations fetch phase is sequential:

# batch_indexer.py:424-430
for entity_id in entity_ids:
    unresolved_relations.extend(
        await self.relation_repository.find_unresolved_relations_for_entity(entity_id)
    )

Resolution uses asyncio.gather (correctly), but the fetch phase does not. For a batch with many entity IDs this adds N sequential DB round-trips before any concurrent work begins. The fetch can also be done with asyncio.gather.

Fix this →

🟡 `include_created_at=existing is None` captures mutated variable

In _upsert_regular_file, existing is reassigned inside the except IntegrityError block. The entity_repository.update() call after the try/except evaluates existing is None based on the post-mutation value, which happens to produce correct behavior — but it's fragile:

# batch_indexer.py:398-404
# At this point `existing` may have been reassigned in the except block above
updated = await self.entity_repository.update(
    entity_id,
    self._entity_metadata_updates(file, checksum, include_created_at=existing is None),
)

Consider capturing a boolean flag (is_new_entity = existing is None) before the try block and using that instead.

Fix this →

🟡 Missing integration tests in `test-int/`

All new tests are unit tests under tests/. Per CLAUDE.md and project conventions, integration tests (real implementations, no mocks) belong in test-int/. The sync batching behavior — particularly the full multi-batch flow with real files, forward ref resolution across batches, and circuit breaker interaction — are not covered by an integration test. Only tests/sync/test_sync_service_batching.py exists, and it contains a single test.

Key missing scenarios:

Files that fail loading (FileNotFoundError) are handled, circuit breaker fires after threshold
Files larger than index_batch_max_bytes routed to their own single-file batch
_load_index_file_metadata and _load_index_batch_files load error paths

🟡 `test_sync_service_batching.py` is sparse

The single test covers the happy path. There are no tests for:

Empty changed_paths list (the early-return branch in _index_changed_files)
Files that were pre-skipped by the circuit breaker
Metadata load errors
Batch load errors

🟢 Minor: `batching.py` line 55 redundant flush condition

After the overflow-flush on line 38-42, line 55 also checks current_bytes >= max_bytes. A file that brings total bytes exactly to max_bytes would not flush on line 39 (> max_bytes is strict), but would flush on line 55. This is correct behavior but consider >= on line 39 to unify both checks and make intent clearer.

Checklist Results

Category	Status	Notes
Python 3.12+ type annotations	✅	`slots=True` dataclasses, `type` alias in factory
Async patterns / SQLAlchemy 2.0	✅	Correct `asyncio.gather` + `Semaphore` usage
FastAPI/Typer conventions	✅
100-char line length	✅
No commented-out code	✅
Unit tests for new functions	✅
Integration tests for new paths	⚠️	Missing `test-int/` coverage for batch sync
100% test coverage	⚠️	Some `_upsert_regular_file` branches likely uncovered
MCP tools atomic/composable	N/A	No MCP changes
No DB migration needed	✅	Config fields only
Local-first architecture preserved	✅
No hardcoded secrets/credentials	✅
Input validation	✅	`max_files <= 0` / `max_bytes <= 0` guards
No sensitive data in logs	✅
Windows path compatibility	✅	`as_posix()` used consistently in scan paths
No emojis/unicode in console/log output	✅	New code uses ASCII only; `✓` in `db.py:251` is pre-existing

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8f8d017529

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-08T04:28:08Z

src/basic_memory/indexing/batch_indexer.py

+                except Exception as exc:
+                    errors[path] = str(exc)
+                    logger.warning("Batch indexing failed", path=path, error=str(exc))


Re-raise fatal sync exceptions in batch workers

_run_bounded currently downgrades every worker exception to a recoverable per-file error, which swallows SyncFatalError from repository writes (for example when a project is deleted and FK checks fail during sync). The pre-batch path explicitly propagated fatal sync errors to stop the run, but this code now lets sync continue and potentially update scan watermark after partial failures, leaving state inconsistent. Please allow fatal exceptions to propagate instead of handling them as normal batch errors.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-08T04:28:08Z

src/basic_memory/sync/sync_service.py

+                    files[path] = IndexInputFile(
+                        path=metadata.path,
+                        size=metadata.size,
+                        checksum=metadata.checksum,
+                        content_type=metadata.content_type,


Recompute checksum from loaded bytes before upsert

This carries forward the checksum captured during scan() instead of deriving it from the bytes just read for indexing. If a file changes between scan and batch load, the upsert path can write new content/mtime while persisting an old checksum; because mtime and size may already match afterward, later scans may never correct the checksum. That leaves persistent checksum drift and can break checksum-based move detection. Compute checksum from content (or recompute at write time) rather than trusting scan-time checksum here.

Useful? React with 👍 / 👎.

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-08T04:35:05Z

Claude finished @phernandez's task in 2m 16s —— View job

PR Review: perf(sync): batch file indexing in core

Tasks:

New Commit: `2f3cf95` — Return final markdown from batch indexer

This follow-up commit is clean and well-motivated. It closes a real gap: before this change, callers of BatchIndexer.index_files could not get the final on-disk content of a file after frontmatter was rewritten, forcing a re-read. The fix is correct — _normalize_markdown_file now threads final_content alongside final_checksum so both reflect the actual written state.

What looks good

Typed IndexFrontmatterWriteResult replaces the raw str return from write_frontmatter. Protocol signature is now self-documenting.
final_content tracked through rewrite chain — both the no-frontmatter enforcement and permalink update branches correctly set final_checksum and final_content together. Returning stale content after a frontmatter rewrite is now impossible.
update_frontmatter backward-compat shim — delegating to update_frontmatter_with_result is the right approach; existing callers don't need to change.
search_content=remove_frontmatter(prepared.content) — using the final (post-rewrite) content for search indexing is more correct than prepared.markdown.content, which reflected the pre-rewrite parse.
Tests cover both cases — no-rewrite test asserts markdown_content == original_content, and the permalink-conflict test asserts markdown_content equals the file's actual on-disk content after rewrite.

Issues in this commit

🟡 Dual identical result types with a manual conversion

FrontmatterUpdateResult (file_service.py:30) and IndexFrontmatterWriteResult (models.py:58) are structurally identical — both have checksum: str and content: str. The adapter converts between them:

# sync_service.py:135-140
result = await self.file_service.update_frontmatter_with_result(update.path, update.metadata)
return IndexFrontmatterWriteResult(checksum=result.checksum, content=result.content)

This boilerplate exists to keep the indexing/ layer free of a FileService import. That's a valid layering concern, but a # Why: keeps indexing/ decoupled from file_service comment here would make the intent obvious rather than looking like a redundant copy.

Fix this →

🟡 `remove_frontmatter` called redundantly on every markdown file

search_content=remove_frontmatter(prepared.content) is called even when no frontmatter rewrite occurred and prepared.markdown.content (already-parsed body, set by EntityMarkdown) is available. The markdown parser already stripped frontmatter during _prepare_markdown_file. You could use prepared.markdown.content or remove_frontmatter(prepared.content) to avoid the re-parse for the common path where EntityMarkdown.content is already populated.

Still-open issues from the first review

The following are not addressed by this commit but remain relevant before merge:

Severity	Issue	Status
🔴	`progress_callback` in `index_files` is never invoked from `_index_changed_files`	Open
🔴	`get_file_path_to_permalink_map()` called once per batch instead of once per sync	Open
🟡	Sequential `await` loop in `_resolve_batch_relations` fetch phase (line 434-437)	Open
🟡	`existing is None` evaluated after `except IntegrityError` mutates `existing` (line 411)	Open
🟡	No `test-int/` integration coverage for multi-batch sync	Open

Still-open Codex review issues

Priority	Issue	Status
P1	`_run_bounded` catches all exceptions, swallowing `SyncFatalError` from repository writes (line 554)	Open
P2	Checksum stored in `IndexedEntity` comes from scan time; if file changes between scan and load, checksum can drift	Open

Checklist Results

Category	Status	Notes
Python 3.12+ type annotations	✅	`slots=True` dataclasses throughout
Async patterns / SQLAlchemy 2.0	✅
100-char line length	✅
No commented-out code	✅
Unit tests for new behavior	✅	No-rewrite and rewrite cases covered
Integration tests	⚠️	Still missing `test-int/` batch sync coverage
100% test coverage	⚠️	`# pragma: no cover` on `except` in `update_frontmatter_with_result` — acceptable
No hardcoded secrets	✅
No sensitive data in logs	✅
Backward compatibility of `update_frontmatter`	✅	Shim delegates correctly
`IndexFileWriter` protocol updated consistently	✅	Test writer, prod writer, and protocol all updated

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2f3cf950c0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-08T04:42:39Z

src/basic_memory/indexing/batch_indexer.py

+            entity_id=updated.id,
+            checksum=prepared.final_checksum,
+            content_type=prepared.file.content_type,
+            search_content=remove_frontmatter(prepared.content),


Derive markdown search content from parsed markdown

Building search_content with remove_frontmatter(prepared.content) can fail on files that the parser already accepted (for example, markdown starting with --- but malformed/non-frontmatter delimiters). EntityParser.parse_markdown_content() is explicitly tolerant of malformed YAML and still produces usable EntityMarkdown, but this later remove_frontmatter() call can raise and mark the file as a batch error after the entity was already upserted. That regresses sync robustness for malformed-but-indexable notes; use the parsed markdown body (prepared.markdown.content) instead of reparsing raw text.

Useful? React with 👍 / 👎.

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-08T05:23:20Z

Claude finished @phernandez's task in 3m 33s —— View job

PR Review: perf(sync): batch file indexing in core

Tasks:

Gather context and understand PR
Read new commit changes (49a40be, d986c4d)
Check which open issues from prior reviews are addressed
Evaluate against review checklist
Post final review

New Commits: `49a40be` + `d986c4d` — Address batch indexing review issues

These two commits together close nearly all outstanding issues from both Claude and Codex reviews. The fixes are targeted and correct. A few minor points below.

Issues Resolved ✅

Prior Issue	Resolution
🔴 `progress_callback` dead code in `index_files`	Removed; replaced with `existing_permalink_by_path` parameter
🔴 `get_file_path_to_permalink_map()` called once per batch	Fetched once before loop in `_index_changed_files`; passed as shared mutable dict
🟡 Sequential `await` loop in `_resolve_batch_relations` (both fetch + count phases)	Both replaced with `asyncio.gather`
🟡 `existing is None` evaluated after `IntegrityError` mutates `existing`	`is_new_entity` flag captured before try block
🟡 Missing `test-int/` integration coverage	`test_sync_batching_integration.py` added with 2 real-service scenarios
P1 `_run_bounded` swallows `SyncFatalError`	Re-raises when `isinstance(exc, SyncFatalError) or isinstance(exc.__cause__, SyncFatalError)`
P2 Checksum stored from scan time	Now recomputed via `compute_checksum(content)` from loaded bytes
🟡 `remove_frontmatter` called redundantly on every markdown	Replaced with `prepared.markdown.content if prepared.markdown.content is not None else prepared.content`
🟡 `FrontmatterUpdateResult`/`IndexFrontmatterWriteResult` adapter lacks explanation	`# Why:` comment added to `_FileServiceIndexWriter.write_frontmatter`
P1 Codex: `remove_frontmatter` can raise on malformed-but-parseable files	Addressed by the markdown content fallback; unit test added

Also: the shared_permalink_by_path dict being passed by reference and mutated in-place by _normalize_markdown_batch (line existing_permalink_by_path[path] = normalized[path].markdown.frontmatter.permalink) is an elegant way to propagate newly assigned permalinks across batches without an extra DB round-trip per batch.

New Issues Found

🟡 `except FileNotFoundError` on `sync_service.py:629` is dead code

After d986c4d reverted the FileNotFoundError re-raise from file_service.py, read_file_bytes now wraps all exceptions — including FileNotFoundError — in FileOperationError. The handler at line 629 can never fire from read_file_bytes. The subsequent except FileOperationError (line 631) is the correct path, and it already handles the FileNotFoundError case via isinstance(exc.__cause__, FileNotFoundError).

# sync_service.py:629-638
except FileNotFoundError:          # ← dead: read_file_bytes always wraps in FileOperationError
    await self.handle_delete(path)
except FileOperationError as exc:  # ← this is the actual path for missing files
    if isinstance(exc.__cause__, FileNotFoundError):
        await self.handle_delete(path)
    ...

The dead handler is harmless but misleads readers into thinking FileNotFoundError can surface unwrapped. Remove it, or add a comment like # safety net: kept in case callers other than read_file_bytes are added here.

Fix this →

🟢 Minor: `batching.py` line 55 — `==` is correct but relies on unstated invariant

The change from current_bytes >= max_bytes to current_bytes == max_bytes is logically equivalent given that the overflow-flush on lines 38–43 ensures current_bytes <= max_bytes always holds after adding a file. However, the reviewer's suggestion was to change > to >= on line 39 (to unify both checks), not to make line 55 more restrictive. The == check is less defensive: if the invariant ever breaks (e.g., a file with negative size), an over-budget batch would not flush. The prior >= was a free safety net. Consider reverting line 55 to >= or adding a comment:

# current_bytes <= max_bytes is invariant here (overflow case handled above),
# so == is equivalent to >=, but >= is kept for defensive clarity.
if len(current_paths) >= max_files or current_bytes >= max_bytes:

Checklist Results

Category	Status	Notes
Python 3.12+ type annotations	✅
Async patterns / SQLAlchemy 2.0	✅	`asyncio.gather` used correctly throughout
100-char line length	✅
No commented-out code	✅
Unit tests for new behavior	✅	Malformed frontmatter + fatal error re-raise tests added
Integration tests	✅	`test-int/test_sync_batching_integration.py` added with circuit-breaker and cross-batch forward-ref tests
100% coverage	✅	Dead `except FileNotFoundError` may show as uncovered — worth checking
`SyncFatalError` propagation	✅	Correct; handles both direct and `__cause__` wrapping
Checksum from loaded bytes	✅	Scan-time drift fixed
Permalink map fetched once per sync	✅	Shared mutable dict threaded across batches correctly
`search_content` from parsed body	✅	Malformed-frontmatter regression closed
`is_new_entity` flag	✅	Post-mutation evaluation eliminated
No hardcoded secrets	✅
Windows path compatibility	✅
No emojis/unicode in logs	✅

Overall: This is in good shape. The one actionable item is the dead except FileNotFoundError handler — it should be removed to keep the error-handling contract clear. The batching.py == vs >= is a style nit. Ready to merge once the dead handler is cleaned up.

github-actions · 2026-04-08T05:23:22Z

Claude finished @phernandez's task in 3m 29s —— View job

PR Review: perf(sync): batch file indexing in core

Tasks:

New Commits: `49a40be` + `d986c4d`

These two follow-up commits address nearly every issue from the previous reviews. The implementation is now in significantly better shape.

What was resolved

Issue	Status
🔴 `progress_callback` dead code in `index_files`	✅ Removed entirely
🔴 `get_file_path_to_permalink_map()` called once per batch	✅ Pre-fetched once in `_index_changed_files`, passed as `existing_permalink_by_path`
🟡 Sequential `await` loop in `_resolve_batch_relations`	✅ Both fetch and remaining-count phases now use `asyncio.gather`
🟡 `existing is None` evaluated after `except IntegrityError` mutates `existing`	✅ `is_new_entity` flag captured before try block
🟡 No `test-int/` integration coverage	✅ Added `test-int/test_sync_batching_integration.py` with two integration tests
🟡 `test_sync_service_batching.py` sparse coverage	✅ Added tests for empty `changed_paths`, circuit breaker, metadata errors, batch load errors
🔴 P1: `_run_bounded` swallows `SyncFatalError`	✅ Fixed — `SyncFatalError` and `__cause__` re-raised
🔴 P2: Checksum from scan time can drift	✅ Fixed — `loaded_checksum = await compute_checksum(content)` computed from actual bytes
🟡 Codex P1: `remove_frontmatter` on parsed body	✅ Fixed — now uses `prepared.markdown.content` with fallback
🟡 Adapter comment missing	✅ Added `# Why:` comment on the `FrontmatterUpdateResult` → `IndexFrontmatterWriteResult` conversion

One new issue

🟡 `except FileNotFoundError` at `sync_service.py:629` is now dead code

Commit d986c4d changed file_service.read_file_bytes to wrap ALL errors (including FileNotFoundError) in FileOperationError. FileOperationError is not a subclass of FileNotFoundError, so the bare handler on line 629 can never be reached:

# sync_service.py:629-630
except FileNotFoundError:           # ← dead: read_file_bytes never raises this bare
    await self.handle_delete(path)
except FileOperationError as exc:   # ← the actual path
    if isinstance(exc.__cause__, FileNotFoundError):
        await self.handle_delete(path)

Either remove the dead handler or add a comment explaining it's a defensive net.

Fix this →

Architecture observation (no action needed)

The shared_permalink_by_path dict is passed by reference to _normalize_markdown_batch and mutated in-place (line 228 of batch_indexer.py). This correctly propagates new permalinks from batch N into the reserved set for batch N+1. The in-place mutation is intentional — but a one-line comment at the call site in _index_changed_files noting "mutated in-place to accumulate cross-batch permalink reservations" would help future readers understand why the dict isn't re-fetched between batches.

Checklist Results

Category	Status	Notes
Python 3.12+ type annotations	✅	`slots=True` dataclasses, `type` alias, full annotations
Async patterns / SQLAlchemy 2.0	✅	`asyncio.gather` used consistently now in both relation fetch phases
FastAPI/Typer conventions	✅
100-char line length	✅
No commented-out code	✅
Unit tests for new functions	✅	Comprehensive coverage of edge cases added
Integration tests	✅	Two integration tests in `test-int/` covering multi-batch + circuit breaker
100% test coverage	✅	Codex P1 `search_content` fix covered by new `test_batch_indexer_uses_parsed_markdown_body_for_malformed_frontmatter_delimiters` test
MCP tools atomic/composable	N/A	No MCP changes
No DB migration needed	✅	Config fields only
Local-first architecture preserved	✅
No hardcoded secrets/credentials	✅
Input validation	✅	Guards for `max_files <= 0`, `max_bytes <= 0`
No sensitive data in logs	✅	Large per-entity `logger.info` blocks removed
Windows path compatibility	✅	`as_posix()` used consistently
No emojis/unicode in console/log output	✅
`SyncFatalError` propagation	✅	Both `isinstance(exc, SyncFatalError)` and `isinstance(exc.__cause__, SyncFatalError)` checked
Checksum consistency	✅	Derived from actual loaded bytes, not scan-time metadata

This PR is in good shape. The one dead-code handler is a minor cleanup item, not a blocker.

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-08T05:35:14Z

Claude finished @phernandez's task in 2m 24s —— View job

PR Review: perf(sync): batch file indexing in core

Tasks:

Gather context and understand PR
Review new commit (cdffd87 — feat(core): support note-level embedding opt-out)
Read updated/added files
Check status of previously-open issues
Evaluate against review checklist
Post final review

New Commit: `cdffd87` — Support note-level embedding opt-out

This commit adds per-note opt-out from semantic embedding via embed: false in frontmatter metadata. The core design is solid: a static _entity_embeddings_enabled helper with careful string/bool/int coercion, a _clear_entity_vectors cleanup path, and the opt-out gate wired into both single-entity and batch sync paths. Tests cover single sync, batch sync with mixed opt-in/opt-out, and FTS participation after opt-out.

One new issue

🔴 `reindex_vectors` bypasses the service-level opt-out

reindex_vectors() calls self.repository.sync_entity_vectors_batch(entity_ids) directly — the repository layer, which knows nothing about embed: false. The service-level opt-out in SearchService.sync_entity_vectors_batch is never invoked:

# search_service.py:505 — calls repository directly, skips opt-out filter
batch_result = await self.repository.sync_entity_vectors_batch(
    entity_ids,
    progress_callback=progress_callback,
)

A user who sets embed: false on a note and later runs basic-memory db reindex-vectors (or triggers CLI reindex) will have that note re-embedded, silently ignoring the opt-out. The fix is one word:

batch_result = await self.sync_entity_vectors_batch(
    entity_ids,
    progress_callback=progress_callback,
)

Fix this →

Minor issues

🟡 Sequential `_clear_entity_vectors` loop in `sync_entity_vectors_batch`

# search_service.py:465-466
for entity_id in opted_out_ids:
    await self._clear_entity_vectors(entity_id)

This is the same sequential-await pattern fixed in _resolve_batch_relations in the prior round. For a project with many opted-out notes this adds N sequential DB round-trips. Replace with asyncio.gather:

await asyncio.gather(*(_clear_entity_vectors(eid) for eid in opted_out_ids))

Fix this →

🟡 `db.py:128` startup backfill also bypasses opt-out

db.py:128 constructs a search_repository directly and calls sync_entity_vectors_batch without going through SearchService. This is the startup embedding backfill path. Unlike reindex_vectors, SearchService may not be available there, but the behavior is worth documenting: opted-out notes will be re-embedded on every startup backfill cycle if they don't already have vectors. Consider either threading the entity metadata check into the backfill, or adding a comment acknowledging the known limitation.

Still-open issue from prior reviews

🟡 Dead `except FileNotFoundError` at `sync_service.py:629` — still not addressed

This handler was flagged in the third review and remains in d986c4d. The read_file_bytes contract wraps all exceptions (including FileNotFoundError) in FileOperationError, making the bare handler unreachable:

# sync_service.py:629-630 — unreachable
except FileNotFoundError:
    await self.handle_delete(path)
except FileOperationError as exc:   # <- actual path for missing files
    if isinstance(exc.__cause__, FileNotFoundError):
        await self.handle_delete(path)

Remove it, or add # Safety net: ... comment.

Fix this →

Checklist Results

Category	Status	Notes
Python 3.12+ type annotations	✅	`@staticmethod`, clean type guards
Async patterns / SQLAlchemy 2.0	✅	Correct `execute_query` usage
100-char line length	✅	Line 599 (`if isinstance(self.repository, SearchRepositoryBase) and not self.repository._semantic_enabled`) is ~100 chars — borderline, within limit
No commented-out code	✅
Unit tests for new behavior	✅	Single sync, batch mixed, FTS-not-excluded all covered
Integration tests	✅	Prior integration tests intact
100% test coverage	⚠️	`reindex_vectors` with an opted-out entity has no test — will show as a gap once opt-out is threaded through
`embed: false` coercion correctness	✅	bool, str (normalized), int/float, unknown-default-true
Postgres/SQLite parity in `_clear_entity_vectors`	✅	SQLite explicitly deletes vec table; Postgres relies on `ON DELETE CASCADE` — correct
No sensitive data in logs	✅
Local-first architecture preserved	✅

Overall: One clear bug (reindex_vectors bypassing opt-out) that should be fixed before merge. The sequential clear loop and db.py backfill gap are lower priority. The dead except FileNotFoundError handler is a cleanup carry-over from the previous round.

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-08T05:44:07Z

Claude finished @phernandez's task in 2m 17s —— View job

PR Review: perf(sync): batch file indexing in core

Tasks:

Gather context and understand PR
Review new commit (1207e8c — fix(core): respect embedding opt-outs during reindex)
Read updated files
Check status of previously-open issues
Evaluate against review checklist
Post final review

New Commit: `1207e8c` — Respect embedding opt-outs during reindex

This commit closes three of the four open issues from the prior round cleanly.

Issues Resolved ✅

Prior Issue	Resolution
🔴 `reindex_vectors` calls `self.repository.sync_entity_vectors_batch` directly, bypassing opt-out	Fixed: changed to `self.sync_entity_vectors_batch` — routes through service-level filter
🟡 Sequential `_clear_entity_vectors` loop in `sync_entity_vectors_batch`	Fixed: replaced with `asyncio.gather`
🟡 Dead `except FileNotFoundError` handler in `sync_service.py`	Fixed: handler removed from `_load_index_batch_files`

The fix to reindex_vectors is exactly the one-word change suggested in the prior review. The new test test_reindex_vectors_respects_embed_opt_out correctly validates the delegation: it mocks sync_entity_vectors_batch on the service instance and asserts that reindex_vectors invokes it with all entity IDs (the opt-out filtering is delegated to sync_entity_vectors_batch, which has its own test coverage). The assertion sync_batch.assert_awaited_once_with([41, 42], progress_callback=None) correctly captures the routing.

Still-Open Issue

🟡 Startup backfill at `db.py:128` still calls the repository directly

The _run_semantic_embedding_backfill function in src/basic_memory/db.py constructs a SearchRepository and calls sync_entity_vectors_batch without going through SearchService:

# src/basic_memory/db.py:123-128
search_repository = repository_class(
    session_maker,
    project_id=project_id,
    app_config=app_config,
)
batch_result = await search_repository.sync_entity_vectors_batch(entity_ids)  # ← bypasses opt-out

This is the startup backfill that fires when search_vector_chunks is empty. SearchService is not available at this call site, so opted-out notes will be embedded during the first-run backfill even if they have embed: false. On subsequent syncs, sync_entity_vectors_batch (service-level) would clear those vectors — so the drift is eventually corrected, but there's a window where opted-out notes have stale vectors.

This was flagged as a known limitation in the prior review. The options remain: thread an entity-metadata check into the backfill, or add a # Why: startup backfill has no SearchService; opted-out entities will be corrected on the next sync comment to document the limitation explicitly.

Fix this →

Checklist Results

Category	Status	Notes
Python 3.12+ type annotations	✅
Async patterns / SQLAlchemy 2.0	✅	`asyncio.gather` in `_clear_entity_vectors` loop is correct
100-char line length	✅
No commented-out code	✅
Unit tests for new behavior	✅	`test_reindex_vectors_respects_embed_opt_out` correctly validates routing via service
Integration tests	✅	Prior `test-int/` coverage intact
100% test coverage	✅
`embed: false` opt-out respected in `reindex_vectors`	✅	Fixed
`_clear_entity_vectors` concurrent	✅	Fixed
Dead `except FileNotFoundError` removed	✅	Fixed
Startup backfill opt-out	⚠️	`db.py:128` still bypasses `SearchService`; known limitation

Overall: All 🔴 issues are closed. This PR is effectively ready to merge. The db.py:128 backfill gap is a pre-existing limitation with a bounded impact (vectors will be corrected on the next sync) — a comment documenting the trade-off is sufficient.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1207e8c664

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-08T05:51:20Z

src/basic_memory/services/search_service.py

+        eligible_entity_ids = [
+            entity_id
+            for entity_id in entity_ids
+            if entity_id in entities_by_id and entity_id not in opted_out_ids
+        ]


Preserve cleanup for missing IDs in vector batch sync

sync_entity_vectors_batch() now drops IDs that are not returned by entity_repository.find_by_ids() before delegating to repository.sync_entity_vectors_batch(). This skips the repository’s missing-entity cleanup path (which deletes stale chunk/embedding rows when no search_index rows exist), so a stale/deleted entity ID in the input list will no longer have its vector rows cleaned up. This regression appears when callers pass outdated IDs (for example from stale queues or races), and leaves orphaned vector data accumulating over time.

Useful? React with 👍 / 👎.

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-08T05:55:36Z

Claude finished @phernandez's task in 1m 58s —— View job

PR Review: perf(sync): batch file indexing in core

Tasks:

Gather context and understand PR
Review new commit (fe9fc7e — fix(core): remove automatic embedding backfill)
Read updated files
Check status of previously-open issues
Evaluate against review checklist
Post final review

New Commit: `fe9fc7e` — Remove automatic embedding backfill

This commit closes the last open issue from the previous round with a clean, decisive approach: instead of threading the SearchService opt-out check into the startup backfill, the backfill is removed entirely.

What was removed

_needs_semantic_embedding_backfill() and _run_semantic_embedding_backfill() from db.py
The migration-time backfill check from run_migrations()
_background_embedding_backfill() and its asyncio.create_task() call from mcp/server.py
The shutdown-time backfill task cancellation
225 lines of test coverage for the now-deleted code

What replaced it

The warning message in _log_embedding_status now says:

"Run 'bm reindex --embeddings' to build them."

This is a better design than patching the backfill: simpler startup path, no background task lifecycle to manage, no bypass of the opt-out check, and a clear user-facing action when embeddings are missing.

Remaining code is clean

run_migrations() ends correctly at the init_search_index() call; the except block is intact
lifespan() in server.py is now 20 lines shorter with no dangling task references
_log_embedding_status() still runs synchronously at startup as a status check (correct)

Previously-Open Issue Resolution

Prior Issue	Status
🟡 `db.py:128` startup backfill bypasses `SearchService` opt-out	✅ Fixed — entire backfill removed; users directed to `bm reindex --embeddings`

One Still-Open Item (Codex P2, from `1207e8c`)

`sync_entity_vectors_batch()` silently drops stale IDs without cleanup

The service-level filter in sync_entity_vectors_batch (search_service.py:471–475) builds eligible_entity_ids from IDs that exist in find_by_ids(). Entity IDs not returned by find_by_ids() (i.e., already deleted from the entity table) are silently dropped — neither opted-out nor delegated to the repository. The repository-level sync_entity_vectors_batch previously handled this by cleaning up orphaned search_index and search_vector_chunks rows for missing entities.

# search_service.py:471-475
eligible_entity_ids = [
    entity_id
    for entity_id in entity_ids
    if entity_id in entities_by_id and entity_id not in opted_out_ids
    # IDs not in entities_by_id are silently discarded — no stale row cleanup
]

Practical impact is low given actual call sites:

reindex_vectors() seeds its IDs from find_all() — only live entities — so stale IDs can't appear there
BatchIndexer calls sync_entity_vectors_batch only with freshly-upserted entity IDs

However, direct callers of the service method with stale IDs (e.g. from a race condition between delete and reindex) would accumulate orphaned vector rows indefinitely. A defensive fix would pass unknown IDs through to the repository for cleanup:

unknown_ids = [eid for eid in entity_ids if eid not in entities_by_id]
if unknown_ids:
    await self.repository.sync_entity_vectors_batch(unknown_ids)  # cleans up stale rows

This is not a blocker given the current call sites, but worth noting for a follow-up.

Fix this →

Checklist Results

Category	Status	Notes
Python 3.12+ type annotations	✅
Async patterns / SQLAlchemy 2.0	✅
100-char line length	✅
No commented-out code	✅
Unit tests for removed behavior	✅	225 lines of now-irrelevant tests correctly deleted
Integration tests	✅	Prior `test-int/` coverage intact
100% test coverage	✅	No dead code introduced
Startup backfill opt-out bypass	✅	Fixed by removing backfill
`_log_embedding_status` still functional	✅	Synchronous status check retained
`run_migrations()` structurally correct	✅	`except` block intact after removal
No sensitive data in logs	✅
Local-first architecture preserved	✅

Overall: All 🔴 issues are resolved. The Codex P2 stale-ID cleanup gap has low practical impact but is worth a follow-up. This PR is ready to merge.

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-08T06:06:39Z

Claude finished @phernandez's task in 2m 58s —— View job

PR Review: perf(sync): batch file indexing in core

Tasks:

Gather context and understand PR
Review new commit (a83bee9 — fix(core): clean up stale vector ids)
Read updated files
Check status of previously-open issues
Evaluate against review checklist
Post final review

New Commit: `a83bee9` — Clean up stale vector IDs

This commit closes the last open issue from the prior round: the Codex P2 finding that sync_entity_vectors_batch() silently dropped unknown IDs without triggering the repository's stale-row cleanup path.

Issue Resolved ✅

Prior Issue	Status
Codex P2: `sync_entity_vectors_batch()` drops stale/deleted IDs without cleanup	✅ Fixed — unknown IDs now routed to `repository.sync_entity_vectors_batch` for cleanup

The fix correctly:

Collects unknown_ids — IDs passed in that are absent from find_by_ids() result
Routes them to self.repository.sync_entity_vectors_batch(unknown_ids) so the repository layer can delete orphaned search_index and search_vector_chunks rows
Builds a merged VectorSyncBatchResult by summing across both the cleanup batch and the eligible-embedding batch, with opted_out_ids added into entities_skipped
The decision-point comment (# Trigger / Why / Outcome) is exactly right

The result merging is also a nice refactor from the prior mutable batch_result.entities_total = ... mutation pattern to an immutable construction — clearer and less fragile.

New Issues Found

🟢 Minor: `unknown_ids` cleanup and `eligible_entity_ids` embedding are sequential

Both repository.sync_entity_vectors_batch(unknown_ids) and repository.sync_entity_vectors_batch(eligible_entity_ids, ...) are await-ed in sequence. Since neither depends on the other's result, they could be done concurrently:

results = await asyncio.gather(
    self.repository.sync_entity_vectors_batch(unknown_ids) if unknown_ids else None,
    self.repository.sync_entity_vectors_batch(eligible_entity_ids, progress_callback=progress_callback) if eligible_entity_ids else None,
)

That said, unknown_ids is empty in every real call site (reindex_vectors seeds from find_all(), BatchIndexer uses freshly-upserted IDs), so the practical impact is zero. Not a blocker.

🟢 Minor: `entities_skipped` double-counts repository skips for unknown IDs

When unknown_ids are routed through repository.sync_entity_vectors_batch, the repository finds no source rows for those IDs and likely reports them as entities_skipped in its result. The merge aggregation adds len(opted_out_ids) + sum(result.entities_skipped ...), which means unknown entities count as both "deleted" and "skipped". This is a subtle accounting edge case with no practical impact given the real call sites — but worth noting if the entities_skipped field is ever surfaced in a user-visible stats page.

All Previously-Open Issues

Severity	Issue	Status
🔴 `progress_callback` dead code	✅ Resolved (`49a40be`)
🔴 `get_file_path_to_permalink_map()` per-batch	✅ Resolved (`49a40be`)
🟡 Sequential fetch in `_resolve_batch_relations`	✅ Resolved (`49a40be`)
🟡 `existing is None` post-mutation	✅ Resolved (`49a40be`)
🟡 Missing `test-int/` integration coverage	✅ Resolved (`49a40be`)
🟡 Dual identical result types without comment	✅ Resolved (`49a40be`)
🔴 P1 Codex: `_run_bounded` swallows `SyncFatalError`	✅ Resolved (`49a40be`)
🔴 P2 Codex: Checksum drift from scan time	✅ Resolved (`49a40be`)
🟡 `remove_frontmatter` redundant call	✅ Resolved (`49a40be`)
🔴 `reindex_vectors` bypasses `embed: false` opt-out	✅ Resolved (`1207e8c`)
🟡 Sequential `_clear_entity_vectors` loop	✅ Resolved (`1207e8c`)
🟡 Dead `except FileNotFoundError` handler	✅ Resolved (`1207e8c`)
🟡 Startup backfill bypasses opt-out	✅ Resolved (`fe9fc7e` — backfill removed entirely)
Codex P2: `sync_entity_vectors_batch` drops stale IDs	✅ Resolved (`a83bee9`)

Checklist Results

Category	Status	Notes
Python 3.12+ type annotations	✅	`list[VectorSyncBatchResult]` typed correctly
Async patterns / SQLAlchemy 2.0	✅	Correct; sequential awaits are fine given real call sites
100-char line length	✅	`prepare_seconds_total=sum(...)` line at 90 chars
No commented-out code	✅
Unit tests for new behavior	✅	`test_semantic_vector_sync_batch_cleans_up_unknown_ids` validates both paths called, counts correct
Integration tests	✅	Prior `test-int/` coverage intact
100% test coverage	✅	Unknown-IDs path now exercised
Result merging correctness	✅	Immutable construction replaces mutable mutation; cleaner
`opted_out_ids` correctly excluded from `eligible_entity_ids`	✅
`progress_callback` only forwarded to eligible batch	✅	Unknown cleanup call correctly omits it
No hardcoded secrets	✅
No sensitive data in logs	✅

Overall: All 🔴 issues are closed. No new 🔴 or 🟡 issues introduced. The two 🟢 observations above are non-blocking style/perf nits with zero practical impact at current call sites. This PR is ready to merge.

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-08T06:17:24Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1fee913950

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-08T06:40:03Z

src/basic_memory/services/search_service.py

+            entities_skipped=(
+                len(opted_out_ids)
+                + sum(result.entities_skipped for result in repository_results)
+                - len(unknown_ids)
+            ),


Do not subtract unknown IDs from skipped vector count

When entity_ids contains stale/unknown IDs (for example, a sync race where an entity is deleted before embedding), repository.sync_entity_vectors_batch(unknown_ids) performs cleanup and typically reports entities_skipped == 0; subtracting len(unknown_ids) here can therefore make entities_skipped negative. This produces invalid embedding stats in sync/reindex reporting and can mislead operators about what was actually skipped.

Useful? React with 👍 / 👎.

perf(sync): batch file indexing in core

8f8d017

Signed-off-by: phernandez <paul@basicmachines.co>

chatgpt-codex-connector bot reviewed Apr 8, 2026

View reviewed changes

perf(sync): return final markdown from batch indexer

2f3cf95

Signed-off-by: phernandez <paul@basicmachines.co>

chatgpt-codex-connector bot reviewed Apr 8, 2026

View reviewed changes

phernandez added 2 commits April 8, 2026 00:21

fix(core): address batch indexing review issues

49a40be

Signed-off-by: phernandez <paul@basicmachines.co>

fix(core): preserve file service read contract

d986c4d

Signed-off-by: phernandez <paul@basicmachines.co>

feat(core): support note-level embedding opt-out

cdffd87

Signed-off-by: phernandez <paul@basicmachines.co>

fix(core): respect embedding opt-outs during reindex

1207e8c

Signed-off-by: phernandez <paul@basicmachines.co>

chatgpt-codex-connector bot reviewed Apr 8, 2026

View reviewed changes

fix(core): remove automatic embedding backfill

fe9fc7e

Signed-off-by: phernandez <paul@basicmachines.co>

fix(core): clean up stale vector ids

a83bee9

Signed-off-by: phernandez <paul@basicmachines.co>

fix(core): parallelize stale vector cleanup

1fee913

Signed-off-by: phernandez <paul@basicmachines.co>

phernandez merged commit 540da41 into main Apr 8, 2026
38 of 41 checks passed

phernandez deleted the perf/sync-batch-indexing-core branch April 8, 2026 06:21

chatgpt-codex-connector bot reviewed Apr 8, 2026

View reviewed changes

Conversation

phernandez commented Apr 8, 2026

Summary

Testing

Uh oh!

github-actions bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: perf(sync): batch file indexing in core

Summary

Issues Found

🔴 Dead code: progress_callback in BatchIndexer.index_files is never invoked in the multi-batch sync loop

🔴 get_file_path_to_permalink_map() is called once per batch, not once per sync

🟡 Sequential await loop in _resolve_batch_relations for relation fetching

🟡 include_created_at=existing is None captures mutated variable

🟡 Missing integration tests in test-int/

🟡 test_sync_service_batching.py is sparse

🟢 Minor: batching.py line 55 redundant flush condition

Checklist Results

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: perf(sync): batch file indexing in core

New Commit: 2f3cf95 — Return final markdown from batch indexer

What looks good

Issues in this commit

🟡 Dual identical result types with a manual conversion

🟡 remove_frontmatter called redundantly on every markdown file

Still-open issues from the first review

Still-open Codex review issues

Checklist Results

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: perf(sync): batch file indexing in core

New Commits: 49a40be + d986c4d — Address batch indexing review issues

Issues Resolved ✅

New Issues Found

🟡 except FileNotFoundError on sync_service.py:629 is dead code

🟢 Minor: batching.py line 55 — == is correct but relies on unstated invariant

Checklist Results

Uh oh!

github-actions bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: perf(sync): batch file indexing in core

New Commits: 49a40be + d986c4d

What was resolved

One new issue

🟡 except FileNotFoundError at sync_service.py:629 is now dead code

Architecture observation (no action needed)

Checklist Results

Uh oh!

github-actions bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: perf(sync): batch file indexing in core

New Commit: cdffd87 — Support note-level embedding opt-out

One new issue

🔴 reindex_vectors bypasses the service-level opt-out

Minor issues

🟡 Sequential _clear_entity_vectors loop in sync_entity_vectors_batch

🟡 db.py:128 startup backfill also bypasses opt-out

Still-open issue from prior reviews

🟡 Dead except FileNotFoundError at sync_service.py:629 — still not addressed

Checklist Results

github-actions bot commented Apr 8, 2026 •

edited

Loading

🔴 Dead code: `progress_callback` in `BatchIndexer.index_files` is never invoked in the multi-batch sync loop

🔴 `get_file_path_to_permalink_map()` is called once per batch, not once per sync

🟡 Sequential `await` loop in `_resolve_batch_relations` for relation fetching

🟡 `include_created_at=existing is None` captures mutated variable

🟡 Missing integration tests in `test-int/`

🟡 `test_sync_service_batching.py` is sparse

🟢 Minor: `batching.py` line 55 redundant flush condition

github-actions bot commented Apr 8, 2026 •

edited

Loading

New Commit: `2f3cf95` — Return final markdown from batch indexer

🟡 `remove_frontmatter` called redundantly on every markdown file

github-actions bot commented Apr 8, 2026 •

edited

Loading

New Commits: `49a40be` + `d986c4d` — Address batch indexing review issues

🟡 `except FileNotFoundError` on `sync_service.py:629` is dead code

🟢 Minor: `batching.py` line 55 — `==` is correct but relies on unstated invariant

github-actions bot commented Apr 8, 2026 •

edited

Loading

New Commits: `49a40be` + `d986c4d`

🟡 `except FileNotFoundError` at `sync_service.py:629` is now dead code

github-actions bot commented Apr 8, 2026 •

edited

Loading

New Commit: `cdffd87` — Support note-level embedding opt-out

🔴 `reindex_vectors` bypasses the service-level opt-out

🟡 Sequential `_clear_entity_vectors` loop in `sync_entity_vectors_batch`

🟡 `db.py:128` startup backfill also bypasses opt-out

🟡 Dead `except FileNotFoundError` at `sync_service.py:629` — still not addressed

github-actions bot commented Apr 8, 2026 •

edited

Loading

New Commit: `1207e8c` — Respect embedding opt-outs during reindex

🟡 Startup backfill at `db.py:128` still calls the repository directly

github-actions bot commented Apr 8, 2026 •

edited

Loading

New Commit: `fe9fc7e` — Remove automatic embedding backfill

One Still-Open Item (Codex P2, from `1207e8c`)

`sync_entity_vectors_batch()` silently drops stale IDs without cleanup

github-actions bot commented Apr 8, 2026 •

edited

Loading

New Commit: `a83bee9` — Clean up stale vector IDs

🟢 Minor: `unknown_ids` cleanup and `eligible_entity_ids` embedding are sequential

🟢 Minor: `entities_skipped` double-counts repository skips for unknown IDs

github-actions bot commented Apr 8, 2026 •

edited

Loading