Reliability hardening suite by Naseem77 · Pull Request #261 · FalkorDB/GraphRAG-SDK

Naseem77 · 2026-05-18T16:54:49Z

Summary

enforce LLM and embedding timeouts with typed timeout errors
improve error visibility for broad exception handling paths
cover GraphRAG async context manager cleanup behavior
enforce latency budgets across retrieval and LLM phases
mark and wire real FalkorDB integration tests
tighten release automation for docs, PyPI checks, artifacts, and Dependabot

Review

ran the local review process after each task and only committed approved fixes

Testing

targeted provider, facade, retrieval, integration-marker, YAML, and build checks were run during each task

Summary by CodeRabbit

New Features
- Timeout support for LLM and embedding calls; latency-budget enforcement across retrieval flows; new latency/timeout error types.
Bug Fixes
- More robust loader/database error logging and typed error handling.
Documentation
- Updated development setup to use Docker Compose and guidance for running integration tests.
Chores
- CI/CD/workflow improvements (publish artifacts, docs manual trigger) and Dependabot weekly updates.
Tests
- Expanded test coverage for timeouts, budgets, and related behaviors.

Add typed timeout errors for LLM and embedding calls and wrap async provider operations with asyncio.wait_for. Cover base, LiteLLM, and OpenRouter async paths with regression tests.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add typed wrapping and error-level logging around high-risk broad exception paths while preserving debug tracebacks. Cover connection, provider, loader, pipeline, retrieval, and history validation error behavior.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add regression tests for async context manager cleanup, close failure propagation, and inner-exception preservation.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add a typed latency budget error and enforce Context budgets before retrieval phases, helper I/O, graph config probes, Cypher calls, and completion LLM calls. Cover propagation and phase gating with regression tests.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add an explicit integration marker, run marked real-FalkorDB tests in CI, document docker-compose usage, and expose the FalkorDB browser port in local compose.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Validate Python distributions before trusted PyPI publishing, upload release artifacts, enable manual docs deploys, and add Dependabot coverage for actions and Python dependencies.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

coderabbitai · 2026-05-18T16:55:04Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4c39b1a3-35ea-4dee-a0dc-5eefa0c9d43d

📥 Commits

Reviewing files that changed from the base of the PR and between 9dfa72b and 3eca0b4.

📒 Files selected for processing (1)

graphrag_sdk/tests/test_facade.py

🚧 Files skipped from review as they are similar to previous changes (1)

graphrag_sdk/tests/test_facade.py

📝 Walkthrough

Walkthrough

This PR implements a comprehensive latency budget enforcement system across the SDK. It introduces new exception types for timeout and budget scenarios, adds timeout parameters throughout provider APIs, integrates budget checkpoints into all retrieval strategies, and strengthens error handling throughout the codebase. Supporting changes include CI/infrastructure updates for integration testing and release automation.

Changes

Latency Budget Enforcement System

Layer / File(s)	Summary
Exception hierarchy and context budget types `graphrag_sdk/src/graphrag_sdk/core/exceptions.py`, `graphrag_sdk/src/graphrag_sdk/core/context.py`, `graphrag_sdk/src/graphrag_sdk/__init__.py`	`LatencyBudgetExceededError`, `LLMTimeoutError`, and `EmbeddingTimeoutError` are added to the exception hierarchy. `Context.ensure_budget(operation: str)` method checks remaining latency budget and raises the typed exception when exhausted. The new exception is exported in package `__all__`.
Provider base timeout infrastructure `graphrag_sdk/src/graphrag_sdk/core/providers/base.py`, `graphrag_sdk/src/graphrag_sdk/core/providers/_retry.py`	Shared `_validate_timeout()` and `_wait_for_provider_call()` helpers handle timeout enforcement across providers. `Embedder.aembed_query` and `aembed_documents` accept optional `timeout` and wrap sync implementations with timeout guards raising `EmbeddingTimeoutError`. `LLMInterface.ainvoke`, `ainvoke_messages`, `ainvoke_with_model`, and `abatch_invoke` accept optional `timeout` and use the timeout wrapper for re-raising `LLMTimeoutError` while allowing other exceptions through retry loops. Retry logic for embeddings explicitly re-raises `EmbeddingTimeoutError`.
LiteLLM and OpenRouter timeout implementation `graphrag_sdk/src/graphrag_sdk/core/providers/litellm.py`, `graphrag_sdk/src/graphrag_sdk/core/providers/openrouter.py`	Both providers add `timeout` parameters to `ainvoke`, `ainvoke_messages`, and embedder methods. Timeout calls are wrapped with `_wait_for_provider_call`, mapping expiry to typed timeout errors. Post-retry error logging summarizes exceptions without leaking secrets.
Retrieval strategies context propagation `graphrag_sdk/src/graphrag_sdk/retrieval/strategies/chunk_retrieval.py`, `cypher_generation.py`, `entity_discovery.py`, `local.py`, `multi_path.py`, `relationship_expansion.py`, `result_assembly.py`, `base.py`	All retrieval functions accept optional `Context` parameter. `ctx.ensure_budget()` is called before each major async operation (graph queries, vector searches, LLM calls). `LatencyBudgetExceededError` is explicitly re-raised without wrapping in retrieval errors. `MultiPathRetrieval._execute` orchestrates budget enforcement across all phases and handles budget failures distinctly in parallel paths.
GraphRAG facade integration `graphrag_sdk/src/graphrag_sdk/api/main.py`	`retrieve()` validates graph config and passes `Context` with budget checkpoints for config validation, retrieval strategy search, and reranking. `completion()` adds budget checkpoints before question rewriting, retrieval, and final LLM call. `_validate_graph_config()` now accepts optional `ctx` and enforces budgets around graph-config queries and embedder dimension probes. History validation stricter for dict entries: validates role values and requires `content` to be string.
Database and error handling standardization `graphrag_sdk/src/graphrag_sdk/core/connection.py`, `graphrag_sdk/src/graphrag_sdk/ingestion/loaders/*.py`, `graphrag_sdk/src/graphrag_sdk/ingestion/pipeline.py`, `graphrag_sdk/src/graphrag_sdk/retrieval/strategies/base.py`	FalkorDB queries now raise `DatabaseError` on both non-transient failures and after retry exhaustion, with detailed logging including attempt count. Loaders (PDF/text) and ingestion pipeline add structured error logging (error message + debug stack trace) before raising typed exceptions. Retrieval strategy base class explicitly handles `LatencyBudgetExceededError` separately from generic exceptions.
CI, Docker, and infrastructure `.github/dependabot.yml`, `.github/workflows/ci.yml`, `.github/workflows/docs.yml`, `.github/workflows/pypi-publish.yaml`, `docker-compose.yml`, `graphrag_sdk/pyproject.toml`, `CONTRIBUTING.md`	Dependabot enabled for GitHub Actions and Python dependencies with weekly schedule. CI job runs real FalkorDB integration tests via pytest marker. Docs workflow adds manual dispatch trigger. PyPI workflow installs `twine`, runs build validation, uploads artifacts. Docker Compose exposes port 3000 and adds healthcheck start period. Contributing docs updated with Docker Compose FalkorDB setup and integration test instructions. Pytest integration marker added.
Comprehensive test coverage `graphrag_sdk/tests/test_*.py`	Tests verify exception hierarchy, `Context.ensure_budget()` behavior, provider timeout handling (LiteLLM/OpenRouter), retrieval strategy budget enforcement, facade integration with multiple budget checkpoints, database error typing, loader error logging, pipeline exception wrapping, and integration test classification. Tests assert which downstream calls are skipped when budget exhausted early.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Production Hardening & Release Readiness #257: This PR directly implements Production Hardening items from the roadmap including typed LLM/embedding timeouts, latency-budget system with Context.ensure_budget() and LatencyBudgetExceededError, improved error logging with stack traces, and CI/release automation.

Possibly related PRs

FalkorDB/GraphRAG-SDK#253: Modifies GraphRAG.completion flow; related surface-area with this PR's facade/completion changes.
FalkorDB/GraphRAG-SDK#247: Previously adjusted the integration CI job; closely related to this PR’s CI integration-test change.

Poem

🐰 Whiskers twitch with budget delight,

Timeouts counted in millisecond flight,
Retrievals pause when limits are near,
Errors logged clearly, nothing to fear,
Hops and tests align — the rabbit cheers!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.47% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Reliability hardening suite' is directly related to the main objectives of the PR, which include timeout enforcement, error handling improvements, latency budget enforcement, and enhanced test coverage.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch reliability-hardening-suite

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

galshubeli

Deep-dive review of the reliability hardening suite. Solid intent and real value, but three load-bearing issues need addressing before merge. Posted as line comments — summary below.

Blocking (Critical):

C1: connection.py silently wraps every DB error in DatabaseError — breaking change for any downstream except falkordb.<Specific>Error clause, and FalkorDBConnection is a re-exported public class.
C2: the latency budget is never wired into provider timeout=. Once a slow LLM call is in flight, the budget can fly past the threshold and nothing aborts it.
C3: aembed_documents(..., timeout=10) applies the timeout per batch (and per binary-split sub-call), not as an overall deadline — the public contract reads as 10s but wall-clock total can be 50s+.

Wins worth keeping: the MENTIONED_IN cosine-rank merge from #259, the history-validation refactor in api/main.py (error messages were wrong before, are right now), the secret-leakage test in test_providers.py, the PdfLoader open/finally fix. Tests pass cleanly (393/393 on touched files).

galshubeli · 2026-05-19T08:46:23Z

+                        exc,
+                    )
+                    logger.debug("Non-transient FalkorDB query failure details", exc_info=True)
+                    raise DatabaseError(f"FalkorDB query failed: {exc}") from exc


🔴 C1 — Breaking change to a public class, undocumented.

Previously the original exception (falkordb.ResponseError, redis.ConnectionError, etc.) propagated untouched. Now every non-transient error is wrapped in DatabaseError. Downstream code like:

try: await conn.query(...) except falkordb.ResponseError as e: # ← silently stops matching if 'index' in str(e): ...

…breaks. FalkorDBConnection is re-exported from graphrag_sdk/__init__.py:21, so this is a public-class contract change.

Either (a) preserve the original type (just raise it untouched and emit the structured ERROR log alongside), or (b) advertise as breaking — major bump + CHANGELOG migration note. As written it's an opaque API change buried in a 1300-line PR.

galshubeli · 2026-05-19T08:46:24Z

+                "FalkorDB query failure details",
+                exc_info=(type(last_exc), last_exc, last_exc.__traceback__),
+            )
+        raise DatabaseError(f"FalkorDB query failed: {last_exc}") from last_exc


🔴 C1 (same issue) — retries-exhausted path.

raise last_exc → raise DatabaseError(...) from last_exc is the same contract change in the transient-retry-exhausted path. Same fix applies: re-raise the original.

galshubeli · 2026-05-19T08:46:24Z

        ctx.log(f"MultiPath [1/9]: {len(all_keywords)} keywords extracted")

        # 2. Embed question only
+        ctx.ensure_budget("MultiPath question embedding")


🔴 C2 — Budget never reaches the network call.

This (and ~25 other ctx.ensure_budget(...) sites) only check at phase boundaries. Once await self._embedder.aembed_query(query) is in flight with timeout=None, an unresponsive provider can hang for minutes — the budget check on the next phase only notices after the call returns.

This is the load-bearing missing wire of the whole PR. One-liner fix at each provider call site:

remaining_s = (ctx.remaining_budget_ms or 0) / 1000.0 or None query_vector = await self._embedder.aembed_query(query, timeout=remaining_s)

The timeout= plumbing is already in place on every provider method — just connect the two halves. Without this, the PR enforces a 'phase-boundary budget,' which is useful but not what the description claims.

galshubeli · 2026-05-19T08:46:24Z

            batch = texts[start : start + self.batch_size]
-            results.extend(await binary_split_retry_async(self._raw_embed_async, batch, **kwargs))
+            results.extend(
+                await binary_split_retry_async(


🔴 C3 — timeout is per-batch, not a deadline.

With aembed_documents(texts=[100], timeout=10) and batch_size=20, the public API reads as 10s, but wall-clock total can reach 50s+ (5 batches × 10s), plus extra sub-calls from binary_split_retry_async. The new test test_aembed_documents_timeout_is_not_binary_split only proves 'no retry on timeout' on a single batch.

Options:

# Option A: rename to clarify async def aembed_documents(self, texts, *, per_batch_timeout=None, ...): # Option B: implement an actual deadline deadline = time.monotonic() + timeout if timeout else None for start in range(0, len(texts), self.batch_size): remaining = deadline - time.monotonic() if deadline else None if remaining is not None and remaining <= 0: raise EmbeddingTimeoutError(...) await binary_split_retry_async(..., timeout=remaining)

galshubeli · 2026-05-19T08:46:24Z

 logger = logging.getLogger(__name__)


+async def _wait_for_provider_call(


🟠 H1 — _-prefixed helpers imported across modules.

Leading underscore = module-private in Python. But litellm.py:21 and openrouter.py:21 both do:

from graphrag_sdk.core.providers.base import ( _validate_timeout, _wait_for_provider_call, )

Reads as a smell every time. Either move to a providers/_timeout.py private module (cross-module use stays internal to the package), or drop the underscore and treat them as a documented internal contract.

galshubeli · 2026-05-19T08:46:24Z

+            **kwargs,
+        )

    async def astream(self, prompt: str, **kwargs: Any) -> AsyncIterator[str]:


🟡 M2 — astream has no timeout= parameter.

async def astream(self, prompt: str, **kwargs: Any) -> AsyncIterator[str]:

PR claims 'enforce LLM and embedding timeouts' but streaming is exempt. Streaming hangs are a real failure mode — a server can dribble tokens forever or stall mid-stream. Either expose timeout= here (and apply per-chunk), or document the exemption.

galshubeli · 2026-05-19T08:46:24Z

            logger.debug("Failed to write graph config node", exc_info=True)

-    async def _validate_graph_config(self) -> None:
+    async def _validate_graph_config(self, ctx: Context | None = None) -> None:


🟡 M3 — Positional ctx adds subclass-break risk.

async def _validate_graph_config(self, ctx: Context | None = None) -> None:

This is a private method by name but it's on a public class. Any subclass overriding _validate_graph_config(self) will break silently if a future caller passes ctx positionally. Cheap fix:

async def _validate_graph_config(self, *, ctx: Context | None = None) -> None:

Keyword-only matches the rest of the codebase's *, ctx=... convention.

galshubeli · 2026-05-19T08:46:24Z

+            if operation == "graph config query":
+                raise LatencyBudgetExceededError("budget exhausted before config query")
+
+        ctx.ensure_budget = exhaust_budget  # type: ignore[method-assign]


🟡 M4 — Monkey-patching ctx.ensure_budget per-instance is brittle.

ctx.ensure_budget = exhaust_budget # type: ignore[method-assign]

Also at test_retrieval.py:179, 195 and similar pattern in test_multi_path_retrieval.py. If ensure_budget becomes a cached_property or moves to a parent class, these tests silently no-op (the patch lands on the instance, the call site reaches the class method).

Cleaner:

monkeypatch.setattr(Context, 'ensure_budget', exhaust_budget) # or a Context subclass with overridden behavior

galshubeli · 2026-05-19T08:46:24Z

+      github-actions:
+        patterns:
+          - "*"
+


🟡 M6 — Verify dependabot pip ecosystem works with PEP-621 pyproject.toml.

- package-ecosystem: "pip" directory: "/graphrag_sdk"

Dependabot's pip parser handles [project.dependencies] but has known gaps with [project.optional-dependencies] groups (it tends to only update [project.dependencies] and miss the dev/extras extras). Worth a dry run before relying on it, or switch to uv ecosystem if your uv.lock is the source of truth.

galshubeli · 2026-05-19T08:46:24Z

    image: falkordb/falkordb:v4.18.0
    ports:
      - "6379:6379"
+      - "3000:3000"


🟡 M7 — Browser UI binds to all interfaces by default.

ports: - "3000:3000"

Fine on a laptop, surprising on a shared dev host or CI runner — exposes the FalkorDB Browser UI on 0.0.0.0:3000. Safer default:

ports: - "127.0.0.1:3000:3000"

Or at minimum a one-line note in CONTRIBUTING.md so contributors aren't surprised.

Naseem77 added 6 commits May 18, 2026 16:54

Enforce provider async timeouts

a6f7fe8

Add typed timeout errors for LLM and embedding calls and wrap async provider operations with asyncio.wait_for. Cover base, LiteLLM, and OpenRouter async paths with regression tests.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Cover GraphRAG async context cleanup

7e7ec1d

Add regression tests for async context manager cleanup, close failure propagation, and inner-exception preservation.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Mark real FalkorDB integration tests

cd760dc

Add an explicit integration marker, run marked real-FalkorDB tests in CI, document docker-compose usage, and expose the FalkorDB browser port in local compose.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Tighten release automation

8002ada

Validate Python distributions before trusted PyPI publishing, upload release artifacts, enable manual docs deploys, and add Dependabot coverage for actions and Python dependencies.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-code-quality Bot found potential problems May 18, 2026

View reviewed changes

Comment thread graphrag_sdk/tests/test_facade.py Fixed

Address facade context manager review warning

9dfa72b

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-code-quality Bot found potential problems May 18, 2026

View reviewed changes

Comment thread graphrag_sdk/tests/test_facade.py Fixed

Address remaining facade test reachability warning

3eca0b4

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Naseem77 requested a review from galshubeli May 18, 2026 17:19

Naseem77 linked an issue May 18, 2026 that may be closed by this pull request

Production Hardening & Release Readiness #257

Open

6 tasks

Naseem77 mentioned this pull request May 18, 2026

Production Hardening & Release Readiness #257

Open

6 tasks

galshubeli reviewed May 19, 2026

View reviewed changes

		logger = logging.getLogger(__name__)


		async def _wait_for_provider_call(

Conversation

Naseem77 commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Uh oh!

galshubeli left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Naseem77 commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading

galshubeli left a comment •

edited

Loading