Skip to content

Reliability hardening suite#261

Open
Naseem77 wants to merge 8 commits into
mainfrom
reliability-hardening-suite
Open

Reliability hardening suite#261
Naseem77 wants to merge 8 commits into
mainfrom
reliability-hardening-suite

Conversation

@Naseem77
Copy link
Copy Markdown
Contributor

@Naseem77 Naseem77 commented May 18, 2026

Summary

  • enforce LLM and embedding timeouts with typed timeout errors
  • improve error visibility for broad exception handling paths
  • cover GraphRAG async context manager cleanup behavior
  • enforce latency budgets across retrieval and LLM phases
  • mark and wire real FalkorDB integration tests
  • tighten release automation for docs, PyPI checks, artifacts, and Dependabot

Review

  • ran the local review process after each task and only committed approved fixes

Testing

  • targeted provider, facade, retrieval, integration-marker, YAML, and build checks were run during each task

Summary by CodeRabbit

  • New Features

    • Timeout support for LLM and embedding calls; latency-budget enforcement across retrieval flows; new latency/timeout error types.
  • Bug Fixes

    • More robust loader/database error logging and typed error handling.
  • Documentation

    • Updated development setup to use Docker Compose and guidance for running integration tests.
  • Chores

    • CI/CD/workflow improvements (publish artifacts, docs manual trigger) and Dependabot weekly updates.
  • Tests

    • Expanded test coverage for timeouts, budgets, and related behaviors.

Review Change Stack

Naseem77 added 6 commits May 18, 2026 16:54
Add typed timeout errors for LLM and embedding calls and wrap async provider operations with asyncio.wait_for. Cover base, LiteLLM, and OpenRouter async paths with regression tests.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add typed wrapping and error-level logging around high-risk broad exception paths while preserving debug tracebacks. Cover connection, provider, loader, pipeline, retrieval, and history validation error behavior.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add regression tests for async context manager cleanup, close failure propagation, and inner-exception preservation.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a typed latency budget error and enforce Context budgets before retrieval phases, helper I/O, graph config probes, Cypher calls, and completion LLM calls. Cover propagation and phase gating with regression tests.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an explicit integration marker, run marked real-FalkorDB tests in CI, document docker-compose usage, and expose the FalkorDB browser port in local compose.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Validate Python distributions before trusted PyPI publishing, upload release artifacts, enable manual docs deploys, and add Dependabot coverage for actions and Python dependencies.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4c39b1a3-35ea-4dee-a0dc-5eefa0c9d43d

📥 Commits

Reviewing files that changed from the base of the PR and between 9dfa72b and 3eca0b4.

📒 Files selected for processing (1)
  • graphrag_sdk/tests/test_facade.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • graphrag_sdk/tests/test_facade.py

📝 Walkthrough

Walkthrough

This PR implements a comprehensive latency budget enforcement system across the SDK. It introduces new exception types for timeout and budget scenarios, adds timeout parameters throughout provider APIs, integrates budget checkpoints into all retrieval strategies, and strengthens error handling throughout the codebase. Supporting changes include CI/infrastructure updates for integration testing and release automation.

Changes

Latency Budget Enforcement System

Layer / File(s) Summary
Exception hierarchy and context budget types
graphrag_sdk/src/graphrag_sdk/core/exceptions.py, graphrag_sdk/src/graphrag_sdk/core/context.py, graphrag_sdk/src/graphrag_sdk/__init__.py
LatencyBudgetExceededError, LLMTimeoutError, and EmbeddingTimeoutError are added to the exception hierarchy. Context.ensure_budget(operation: str) method checks remaining latency budget and raises the typed exception when exhausted. The new exception is exported in package __all__.
Provider base timeout infrastructure
graphrag_sdk/src/graphrag_sdk/core/providers/base.py, graphrag_sdk/src/graphrag_sdk/core/providers/_retry.py
Shared _validate_timeout() and _wait_for_provider_call() helpers handle timeout enforcement across providers. Embedder.aembed_query and aembed_documents accept optional timeout and wrap sync implementations with timeout guards raising EmbeddingTimeoutError. LLMInterface.ainvoke, ainvoke_messages, ainvoke_with_model, and abatch_invoke accept optional timeout and use the timeout wrapper for re-raising LLMTimeoutError while allowing other exceptions through retry loops. Retry logic for embeddings explicitly re-raises EmbeddingTimeoutError.
LiteLLM and OpenRouter timeout implementation
graphrag_sdk/src/graphrag_sdk/core/providers/litellm.py, graphrag_sdk/src/graphrag_sdk/core/providers/openrouter.py
Both providers add timeout parameters to ainvoke, ainvoke_messages, and embedder methods. Timeout calls are wrapped with _wait_for_provider_call, mapping expiry to typed timeout errors. Post-retry error logging summarizes exceptions without leaking secrets.
Retrieval strategies context propagation
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/chunk_retrieval.py, cypher_generation.py, entity_discovery.py, local.py, multi_path.py, relationship_expansion.py, result_assembly.py, base.py
All retrieval functions accept optional Context parameter. ctx.ensure_budget() is called before each major async operation (graph queries, vector searches, LLM calls). LatencyBudgetExceededError is explicitly re-raised without wrapping in retrieval errors. MultiPathRetrieval._execute orchestrates budget enforcement across all phases and handles budget failures distinctly in parallel paths.
GraphRAG facade integration
graphrag_sdk/src/graphrag_sdk/api/main.py
retrieve() validates graph config and passes Context with budget checkpoints for config validation, retrieval strategy search, and reranking. completion() adds budget checkpoints before question rewriting, retrieval, and final LLM call. _validate_graph_config() now accepts optional ctx and enforces budgets around graph-config queries and embedder dimension probes. History validation stricter for dict entries: validates role values and requires content to be string.
Database and error handling standardization
graphrag_sdk/src/graphrag_sdk/core/connection.py, graphrag_sdk/src/graphrag_sdk/ingestion/loaders/*.py, graphrag_sdk/src/graphrag_sdk/ingestion/pipeline.py, graphrag_sdk/src/graphrag_sdk/retrieval/strategies/base.py
FalkorDB queries now raise DatabaseError on both non-transient failures and after retry exhaustion, with detailed logging including attempt count. Loaders (PDF/text) and ingestion pipeline add structured error logging (error message + debug stack trace) before raising typed exceptions. Retrieval strategy base class explicitly handles LatencyBudgetExceededError separately from generic exceptions.
CI, Docker, and infrastructure
.github/dependabot.yml, .github/workflows/ci.yml, .github/workflows/docs.yml, .github/workflows/pypi-publish.yaml, docker-compose.yml, graphrag_sdk/pyproject.toml, CONTRIBUTING.md
Dependabot enabled for GitHub Actions and Python dependencies with weekly schedule. CI job runs real FalkorDB integration tests via pytest marker. Docs workflow adds manual dispatch trigger. PyPI workflow installs twine, runs build validation, uploads artifacts. Docker Compose exposes port 3000 and adds healthcheck start period. Contributing docs updated with Docker Compose FalkorDB setup and integration test instructions. Pytest integration marker added.
Comprehensive test coverage
graphrag_sdk/tests/test_*.py
Tests verify exception hierarchy, Context.ensure_budget() behavior, provider timeout handling (LiteLLM/OpenRouter), retrieval strategy budget enforcement, facade integration with multiple budget checkpoints, database error typing, loader error logging, pipeline exception wrapping, and integration test classification. Tests assert which downstream calls are skipped when budget exhausted early.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

  • Production Hardening & Release Readiness #257: This PR directly implements Production Hardening items from the roadmap including typed LLM/embedding timeouts, latency-budget system with Context.ensure_budget() and LatencyBudgetExceededError, improved error logging with stack traces, and CI/release automation.

Possibly related PRs

  • FalkorDB/GraphRAG-SDK#253: Modifies GraphRAG.completion flow; related surface-area with this PR's facade/completion changes.
  • FalkorDB/GraphRAG-SDK#247: Previously adjusted the integration CI job; closely related to this PR’s CI integration-test change.

Poem

🐰 Whiskers twitch with budget delight,

Timeouts counted in millisecond flight,
Retrievals pause when limits are near,
Errors logged clearly, nothing to fear,
Hops and tests align — the rabbit cheers!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.47% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Reliability hardening suite' is directly related to the main objectives of the PR, which include timeout enforcement, error handling improvements, latency budget enforcement, and enhanced test coverage.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch reliability-hardening-suite

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread graphrag_sdk/tests/test_facade.py Fixed
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread graphrag_sdk/tests/test_facade.py Fixed
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Naseem77 Naseem77 requested a review from galshubeli May 18, 2026 17:19
@Naseem77 Naseem77 linked an issue May 18, 2026 that may be closed by this pull request
6 tasks
Copy link
Copy Markdown
Collaborator

@galshubeli galshubeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep-dive review of the reliability hardening suite. Solid intent and real value, but three load-bearing issues need addressing before merge. Posted as line comments — summary below.

Blocking (Critical):

  • C1: connection.py silently wraps every DB error in DatabaseError — breaking change for any downstream except falkordb.<Specific>Error clause, and FalkorDBConnection is a re-exported public class.
  • C2: the latency budget is never wired into provider timeout=. Once a slow LLM call is in flight, the budget can fly past the threshold and nothing aborts it.
  • C3: aembed_documents(..., timeout=10) applies the timeout per batch (and per binary-split sub-call), not as an overall deadline — the public contract reads as 10s but wall-clock total can be 50s+.

Wins worth keeping: the MENTIONED_IN cosine-rank merge from #259, the history-validation refactor in api/main.py (error messages were wrong before, are right now), the secret-leakage test in test_providers.py, the PdfLoader open/finally fix. Tests pass cleanly (393/393 on touched files).

exc,
)
logger.debug("Non-transient FalkorDB query failure details", exc_info=True)
raise DatabaseError(f"FalkorDB query failed: {exc}") from exc
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 C1 — Breaking change to a public class, undocumented.

Previously the original exception (falkordb.ResponseError, redis.ConnectionError, etc.) propagated untouched. Now every non-transient error is wrapped in DatabaseError. Downstream code like:

try:
    await conn.query(...)
except falkordb.ResponseError as e:    # ← silently stops matching
    if 'index' in str(e): ...

…breaks. FalkorDBConnection is re-exported from graphrag_sdk/__init__.py:21, so this is a public-class contract change.

Either (a) preserve the original type (just raise it untouched and emit the structured ERROR log alongside), or (b) advertise as breaking — major bump + CHANGELOG migration note. As written it's an opaque API change buried in a 1300-line PR.

"FalkorDB query failure details",
exc_info=(type(last_exc), last_exc, last_exc.__traceback__),
)
raise DatabaseError(f"FalkorDB query failed: {last_exc}") from last_exc
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 C1 (same issue) — retries-exhausted path.

raise last_excraise DatabaseError(...) from last_exc is the same contract change in the transient-retry-exhausted path. Same fix applies: re-raise the original.

ctx.log(f"MultiPath [1/9]: {len(all_keywords)} keywords extracted")

# 2. Embed question only
ctx.ensure_budget("MultiPath question embedding")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 C2 — Budget never reaches the network call.

This (and ~25 other ctx.ensure_budget(...) sites) only check at phase boundaries. Once await self._embedder.aembed_query(query) is in flight with timeout=None, an unresponsive provider can hang for minutes — the budget check on the next phase only notices after the call returns.

This is the load-bearing missing wire of the whole PR. One-liner fix at each provider call site:

remaining_s = (ctx.remaining_budget_ms or 0) / 1000.0 or None
query_vector = await self._embedder.aembed_query(query, timeout=remaining_s)

The timeout= plumbing is already in place on every provider method — just connect the two halves. Without this, the PR enforces a 'phase-boundary budget,' which is useful but not what the description claims.

batch = texts[start : start + self.batch_size]
results.extend(await binary_split_retry_async(self._raw_embed_async, batch, **kwargs))
results.extend(
await binary_split_retry_async(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 C3 — timeout is per-batch, not a deadline.

With aembed_documents(texts=[100], timeout=10) and batch_size=20, the public API reads as 10s, but wall-clock total can reach 50s+ (5 batches × 10s), plus extra sub-calls from binary_split_retry_async. The new test test_aembed_documents_timeout_is_not_binary_split only proves 'no retry on timeout' on a single batch.

Options:

# Option A: rename to clarify
async def aembed_documents(self, texts, *, per_batch_timeout=None, ...):

# Option B: implement an actual deadline
deadline = time.monotonic() + timeout if timeout else None
for start in range(0, len(texts), self.batch_size):
    remaining = deadline - time.monotonic() if deadline else None
    if remaining is not None and remaining <= 0:
        raise EmbeddingTimeoutError(...)
    await binary_split_retry_async(..., timeout=remaining)

logger = logging.getLogger(__name__)


async def _wait_for_provider_call(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 H1 — _-prefixed helpers imported across modules.

Leading underscore = module-private in Python. But litellm.py:21 and openrouter.py:21 both do:

from graphrag_sdk.core.providers.base import (
    _validate_timeout,
    _wait_for_provider_call,
)

Reads as a smell every time. Either move to a providers/_timeout.py private module (cross-module use stays internal to the package), or drop the underscore and treat them as a documented internal contract.

**kwargs,
)

async def astream(self, prompt: str, **kwargs: Any) -> AsyncIterator[str]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 M2 — astream has no timeout= parameter.

async def astream(self, prompt: str, **kwargs: Any) -> AsyncIterator[str]:

PR claims 'enforce LLM and embedding timeouts' but streaming is exempt. Streaming hangs are a real failure mode — a server can dribble tokens forever or stall mid-stream. Either expose timeout= here (and apply per-chunk), or document the exemption.

logger.debug("Failed to write graph config node", exc_info=True)

async def _validate_graph_config(self) -> None:
async def _validate_graph_config(self, ctx: Context | None = None) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 M3 — Positional ctx adds subclass-break risk.

async def _validate_graph_config(self, ctx: Context | None = None) -> None:

This is a private method by name but it's on a public class. Any subclass overriding _validate_graph_config(self) will break silently if a future caller passes ctx positionally. Cheap fix:

async def _validate_graph_config(self, *, ctx: Context | None = None) -> None:

Keyword-only matches the rest of the codebase's *, ctx=... convention.

if operation == "graph config query":
raise LatencyBudgetExceededError("budget exhausted before config query")

ctx.ensure_budget = exhaust_budget # type: ignore[method-assign]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 M4 — Monkey-patching ctx.ensure_budget per-instance is brittle.

ctx.ensure_budget = exhaust_budget  # type: ignore[method-assign]

Also at test_retrieval.py:179, 195 and similar pattern in test_multi_path_retrieval.py. If ensure_budget becomes a cached_property or moves to a parent class, these tests silently no-op (the patch lands on the instance, the call site reaches the class method).

Cleaner:

monkeypatch.setattr(Context, 'ensure_budget', exhaust_budget)
# or a Context subclass with overridden behavior

Comment thread .github/dependabot.yml
github-actions:
patterns:
- "*"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 M6 — Verify dependabot pip ecosystem works with PEP-621 pyproject.toml.

- package-ecosystem: "pip"
  directory: "/graphrag_sdk"

Dependabot's pip parser handles [project.dependencies] but has known gaps with [project.optional-dependencies] groups (it tends to only update [project.dependencies] and miss the dev/extras extras). Worth a dry run before relying on it, or switch to uv ecosystem if your uv.lock is the source of truth.

Comment thread docker-compose.yml
image: falkordb/falkordb:v4.18.0
ports:
- "6379:6379"
- "3000:3000"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 M7 — Browser UI binds to all interfaces by default.

ports:
  - "3000:3000"

Fine on a laptop, surprising on a shared dev host or CI runner — exposes the FalkorDB Browser UI on 0.0.0.0:3000. Safer default:

ports:
  - "127.0.0.1:3000:3000"

Or at minimum a one-line note in CONTRIBUTING.md so contributors aren't surprised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Production Hardening & Release Readiness

2 participants