Reliability hardening suite#261
Conversation
Add typed timeout errors for LLM and embedding calls and wrap async provider operations with asyncio.wait_for. Cover base, LiteLLM, and OpenRouter async paths with regression tests.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add typed wrapping and error-level logging around high-risk broad exception paths while preserving debug tracebacks. Cover connection, provider, loader, pipeline, retrieval, and history validation error behavior.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add regression tests for async context manager cleanup, close failure propagation, and inner-exception preservation.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a typed latency budget error and enforce Context budgets before retrieval phases, helper I/O, graph config probes, Cypher calls, and completion LLM calls. Cover propagation and phase gating with regression tests.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an explicit integration marker, run marked real-FalkorDB tests in CI, document docker-compose usage, and expose the FalkorDB browser port in local compose.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Validate Python distributions before trusted PyPI publishing, upload release artifacts, enable manual docs deploys, and add Dependabot coverage for actions and Python dependencies.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR implements a comprehensive latency budget enforcement system across the SDK. It introduces new exception types for timeout and budget scenarios, adds timeout parameters throughout provider APIs, integrates budget checkpoints into all retrieval strategies, and strengthens error handling throughout the codebase. Supporting changes include CI/infrastructure updates for integration testing and release automation. ChangesLatency Budget Enforcement System
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Deep-dive review of the reliability hardening suite. Solid intent and real value, but three load-bearing issues need addressing before merge. Posted as line comments — summary below.
Blocking (Critical):
- C1:
connection.pysilently wraps every DB error inDatabaseError— breaking change for any downstreamexcept falkordb.<Specific>Errorclause, andFalkorDBConnectionis a re-exported public class. - C2: the latency budget is never wired into provider
timeout=. Once a slow LLM call is in flight, the budget can fly past the threshold and nothing aborts it. - C3:
aembed_documents(..., timeout=10)applies the timeout per batch (and per binary-split sub-call), not as an overall deadline — the public contract reads as 10s but wall-clock total can be 50s+.
Wins worth keeping: the MENTIONED_IN cosine-rank merge from #259, the history-validation refactor in api/main.py (error messages were wrong before, are right now), the secret-leakage test in test_providers.py, the PdfLoader open/finally fix. Tests pass cleanly (393/393 on touched files).
| exc, | ||
| ) | ||
| logger.debug("Non-transient FalkorDB query failure details", exc_info=True) | ||
| raise DatabaseError(f"FalkorDB query failed: {exc}") from exc |
There was a problem hiding this comment.
🔴 C1 — Breaking change to a public class, undocumented.
Previously the original exception (falkordb.ResponseError, redis.ConnectionError, etc.) propagated untouched. Now every non-transient error is wrapped in DatabaseError. Downstream code like:
try:
await conn.query(...)
except falkordb.ResponseError as e: # ← silently stops matching
if 'index' in str(e): ...…breaks. FalkorDBConnection is re-exported from graphrag_sdk/__init__.py:21, so this is a public-class contract change.
Either (a) preserve the original type (just raise it untouched and emit the structured ERROR log alongside), or (b) advertise as breaking — major bump + CHANGELOG migration note. As written it's an opaque API change buried in a 1300-line PR.
| "FalkorDB query failure details", | ||
| exc_info=(type(last_exc), last_exc, last_exc.__traceback__), | ||
| ) | ||
| raise DatabaseError(f"FalkorDB query failed: {last_exc}") from last_exc |
There was a problem hiding this comment.
🔴 C1 (same issue) — retries-exhausted path.
raise last_exc → raise DatabaseError(...) from last_exc is the same contract change in the transient-retry-exhausted path. Same fix applies: re-raise the original.
| ctx.log(f"MultiPath [1/9]: {len(all_keywords)} keywords extracted") | ||
|
|
||
| # 2. Embed question only | ||
| ctx.ensure_budget("MultiPath question embedding") |
There was a problem hiding this comment.
🔴 C2 — Budget never reaches the network call.
This (and ~25 other ctx.ensure_budget(...) sites) only check at phase boundaries. Once await self._embedder.aembed_query(query) is in flight with timeout=None, an unresponsive provider can hang for minutes — the budget check on the next phase only notices after the call returns.
This is the load-bearing missing wire of the whole PR. One-liner fix at each provider call site:
remaining_s = (ctx.remaining_budget_ms or 0) / 1000.0 or None
query_vector = await self._embedder.aembed_query(query, timeout=remaining_s)The timeout= plumbing is already in place on every provider method — just connect the two halves. Without this, the PR enforces a 'phase-boundary budget,' which is useful but not what the description claims.
| batch = texts[start : start + self.batch_size] | ||
| results.extend(await binary_split_retry_async(self._raw_embed_async, batch, **kwargs)) | ||
| results.extend( | ||
| await binary_split_retry_async( |
There was a problem hiding this comment.
🔴 C3 — timeout is per-batch, not a deadline.
With aembed_documents(texts=[100], timeout=10) and batch_size=20, the public API reads as 10s, but wall-clock total can reach 50s+ (5 batches × 10s), plus extra sub-calls from binary_split_retry_async. The new test test_aembed_documents_timeout_is_not_binary_split only proves 'no retry on timeout' on a single batch.
Options:
# Option A: rename to clarify
async def aembed_documents(self, texts, *, per_batch_timeout=None, ...):
# Option B: implement an actual deadline
deadline = time.monotonic() + timeout if timeout else None
for start in range(0, len(texts), self.batch_size):
remaining = deadline - time.monotonic() if deadline else None
if remaining is not None and remaining <= 0:
raise EmbeddingTimeoutError(...)
await binary_split_retry_async(..., timeout=remaining)| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| async def _wait_for_provider_call( |
There was a problem hiding this comment.
🟠 H1 — _-prefixed helpers imported across modules.
Leading underscore = module-private in Python. But litellm.py:21 and openrouter.py:21 both do:
from graphrag_sdk.core.providers.base import (
_validate_timeout,
_wait_for_provider_call,
)Reads as a smell every time. Either move to a providers/_timeout.py private module (cross-module use stays internal to the package), or drop the underscore and treat them as a documented internal contract.
| **kwargs, | ||
| ) | ||
|
|
||
| async def astream(self, prompt: str, **kwargs: Any) -> AsyncIterator[str]: |
There was a problem hiding this comment.
🟡 M2 — astream has no timeout= parameter.
async def astream(self, prompt: str, **kwargs: Any) -> AsyncIterator[str]:PR claims 'enforce LLM and embedding timeouts' but streaming is exempt. Streaming hangs are a real failure mode — a server can dribble tokens forever or stall mid-stream. Either expose timeout= here (and apply per-chunk), or document the exemption.
| logger.debug("Failed to write graph config node", exc_info=True) | ||
|
|
||
| async def _validate_graph_config(self) -> None: | ||
| async def _validate_graph_config(self, ctx: Context | None = None) -> None: |
There was a problem hiding this comment.
🟡 M3 — Positional ctx adds subclass-break risk.
async def _validate_graph_config(self, ctx: Context | None = None) -> None:This is a private method by name but it's on a public class. Any subclass overriding _validate_graph_config(self) will break silently if a future caller passes ctx positionally. Cheap fix:
async def _validate_graph_config(self, *, ctx: Context | None = None) -> None:Keyword-only matches the rest of the codebase's *, ctx=... convention.
| if operation == "graph config query": | ||
| raise LatencyBudgetExceededError("budget exhausted before config query") | ||
|
|
||
| ctx.ensure_budget = exhaust_budget # type: ignore[method-assign] |
There was a problem hiding this comment.
🟡 M4 — Monkey-patching ctx.ensure_budget per-instance is brittle.
ctx.ensure_budget = exhaust_budget # type: ignore[method-assign]Also at test_retrieval.py:179, 195 and similar pattern in test_multi_path_retrieval.py. If ensure_budget becomes a cached_property or moves to a parent class, these tests silently no-op (the patch lands on the instance, the call site reaches the class method).
Cleaner:
monkeypatch.setattr(Context, 'ensure_budget', exhaust_budget)
# or a Context subclass with overridden behavior| github-actions: | ||
| patterns: | ||
| - "*" | ||
|
|
There was a problem hiding this comment.
🟡 M6 — Verify dependabot pip ecosystem works with PEP-621 pyproject.toml.
- package-ecosystem: "pip"
directory: "/graphrag_sdk"Dependabot's pip parser handles [project.dependencies] but has known gaps with [project.optional-dependencies] groups (it tends to only update [project.dependencies] and miss the dev/extras extras). Worth a dry run before relying on it, or switch to uv ecosystem if your uv.lock is the source of truth.
| image: falkordb/falkordb:v4.18.0 | ||
| ports: | ||
| - "6379:6379" | ||
| - "3000:3000" |
There was a problem hiding this comment.
🟡 M7 — Browser UI binds to all interfaces by default.
ports:
- "3000:3000"Fine on a laptop, surprising on a shared dev host or CI runner — exposes the FalkorDB Browser UI on 0.0.0.0:3000. Safer default:
ports:
- "127.0.0.1:3000:3000"Or at minimum a one-line note in CONTRIBUTING.md so contributors aren't surprised.
Summary
Review
Testing
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Chores
Tests