docs(retrieval): recommend SentenceTokenCapChunking for CypherFirst by galshubeli · Pull Request #253 · FalkorDB/GraphRAG-SDK

galshubeli · 2026-05-14T13:26:50Z

Summary

Adds a "Recommended ingestion config" section to CypherFirstAggregationStrategy's docstring pointing users at SentenceTokenCapChunking. Cosmetic / discoverability — the strategy itself works the same.

Why

The default chunker (FixedSizeChunking) splits on a character window with no awareness of sentence or word boundaries, so entity names get truncated at chunk boundaries ("Wayne Enterprises" → "Wayne En"). The stub entities never merge with their full forms during resolution, so cypher counts come back inflated.

On the internal 7-question aggregation benchmark, switching the benchmark's ingestion config from FixedSizeChunking(chunk_size=1000) to SentenceTokenCapChunking(max_tokens=512, overlap_sentences=2) moved the score from 6/7 (intermittent) to 7/7 (stable across three runs).

Stacking

This PR is stacked on #255 (feat/cypher-aggregation-accuracy). After #255 merges, GitHub will auto-retarget this to main and the diff stays a 33-line docstring change.

Related: #254 makes SentenceTokenCapChunking the SDK's default chunker — that's a behavior change for all callers, kept separate from this docs-only PR.

Test plan

All existing tests pass on the parent branch
No new tests needed (docstring only)

Adds a "Recommended ingestion config" section to CypherFirstAggregationStrategy's docstring. The default chunker (FixedSizeChunking) splits on a character window with no sentence/paragraph awareness, so entity names get truncated at chunk boundaries — "Wayne Enterprises" becomes "Wayne En" in one chunk and "terprises..." in the next. The resulting stub entities never merge with their full forms during resolution, so cypher counts come back inflated and "which X" lists pick up phantoms. On the internal 7-question aggregation benchmark, switching from FixedSizeChunking(chunk_size=1000) to SentenceTokenCapChunking(max_tokens=512, overlap_sentences=2) moved the score from 6/7 (intermittent) to 7/7 (stable across three runs). The post-ingest graph state went from 11-14 organization nodes including "Glo" / "Initech System" / "Wayne En" stubs to exactly 10 clean orgs, and from 66-80 Person nodes (with Carla / Carla Okafor duplicates) to exactly 56 distinct persons. The short-form duplicate fix is a side benefit: sentence-aware chunks follow natural prose boundaries with overlap, so a person's first mention almost always lands in the same chunk as their later short-form references — per-chunk FastCoref then has the antecedent it needs. This docstring is the smallest useful change. A separate PR will propose changing the SDK default chunker; that's a larger conversation since it affects every existing caller. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-14T13:29:39Z

📝 Walkthrough

Walkthrough

This PR introduces CypherFirstAggregationStrategy, a new retrieval strategy that routes aggregation-style questions deterministically through database queries. It enhances Cypher generation, result assembly, system-prompt handling, and adds the complete strategy with supporting utilities and tests.

Changes

Cypher-First Aggregation Strategy

Layer / File(s)	Summary
Package exports and public API `graphrag_sdk/src/graphrag_sdk/__init__.py`, `graphrag_sdk/src/graphrag_sdk/retrieval/strategies/__init__.py`	New strategy and phrase-extractor interfaces exported at package façade level for public consumption.
System prompt authority rule and detection `graphrag_sdk/src/graphrag_sdk/api/main.py`	Conditional cypher-authority instruction (`_CYPHER_AUTH_RULE`) detects authoritative cypher-derived content in retriever results and injects rule into system prompt during completion to prefer deterministic quantitative answers.
Cypher query generation and validation enhancements `graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py`	Pure-aggregation detection gates conditional LIMIT injection; unsupported namespaced functions (apoc., gds., db.*) are rejected during validation; schema prompts updated with BOTH/intersection guidance; zero-row fallback retries with widened typed labels (except for pure aggregations); metadata tracked for fallback status and row counts.
Result assembly and metadata flow `graphrag_sdk/src/graphrag_sdk/retrieval/strategies/result_assembly.py`, `graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py`	Cypher result cap increased to 100 rows with explicit truncation messaging; result metadata merged to expose fallback behavior and truncation flags; MultiPathRetrieval wired to pass cypher metadata through orchestration layer.
CypherFirstAggregationStrategy core implementation `graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py`	Intent classification (numeric\_math, aggregation, rag) routes queries to four sub-paths: numeric extraction and Python computation, shared-property hybrid via fuzzy intersection, multi-candidate Cypher table with row-count scoring, or RAG fallback. Includes intent detectors, phrase extractors (pluggable interface with default regex-based English impl), fuzzy matching, markdown table formatting, numeric coercion, path metadata tagging, and negation-existential "No" handling.
Comprehensive test coverage `graphrag_sdk/tests/test_cypher_first.py`, `graphrag_sdk/tests/test_cypher_generation.py`, `graphrag_sdk/tests/test_facade.py`	Unit tests for intent detection, shape predicates, extraction, fuzzy intersection, formatting; end-to-end routing scenarios with mocks; Cypher validation/sanitization coverage including LIMIT injection and label-widening fallback; system-prompt authority rule detection.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

Naseem77

Poem

🐰 A strategy hops through the query tree,
Finding aggregates deterministically,
Cypher paths bloom, fuzzy intersections dance,
While fallbacks waltz if numbers don't prance. ✨📊

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 19.17% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	⚠️ Warning	The title describes a documentation update recommending SentenceTokenCapChunking for CypherFirst, but the raw_summary shows this PR implements a major Cypher-first aggregation retrieval strategy (~1,200 lines), extensive bug fixes, and helper functions—far beyond a simple documentation change.	Update the title to accurately reflect the primary change: 'feat(retrieval): add CypherFirstAggregationStrategy with intelligent query routing' or similar, since the documentation recommendation is incidental to the feature implementation.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/cypher-first-recommend-sentence-chunker

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR introduces a “Cypher-first” aggregation-focused retrieval strategy and extends existing Cypher retrieval plumbing so deterministic graph-query results can be surfaced (and treated as authoritative) in the final LLM prompt when applicable.

Changes:

Add CypherFirstAggregationStrategy with intent routing (RAG vs aggregation vs numeric-math) plus helpers (table formatting, fuzzy intersect, phrase extractors).
Enhance text-to-Cypher validation/sanitization (row-limit behavior, reject dotted namespaces, typed-label widening fallback) and propagate cypher execution metadata through MultiPathRetrieval → assembled results.
Conditionally append a cypher-authority rule to the system prompt only when authoritative cypher results are present; add/expand tests for these behaviors.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
graphrag_sdk/tests/test_facade.py	Adds tests ensuring the cypher-authority prompt rule is injected only when authoritative cypher context is present.
graphrag_sdk/tests/test_cypher_generation.py	Expands validator/sanitizer tests; updates execute-cypher retrieval tests for new metadata + fallback behavior.
graphrag_sdk/tests/test_cypher_first.py	New tests covering cypher-first helper functions and mocked end-to-end routing behavior.
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/result_assembly.py	Renames/expands cypher results section, raises cypher context cap, and merges cypher diagnostics into result metadata.
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py	Updates cypher execution integration to accept and pass through cypher metadata into assembled results.
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py	Adds pure-aggregation detection for LIMIT injection, dotted-namespace rejection, and typed-label widening retry with diagnostics.
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py	New cypher-first aggregation strategy implementation (intent detection, hybrid paths, numeric math path, table formatting, phrase extraction).
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/init.py	Exposes the new cypher-first strategy and extractor interfaces from the strategies package.
graphrag_sdk/src/graphrag_sdk/api/main.py	Adds conditional system-prompt addendum injection based on presence of authoritative cypher results.
graphrag_sdk/src/graphrag_sdk/init.py	Exports the new cypher-first strategy and phrase extractor types at the top-level package.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+                cypher = None
+            else:
+                cypher = _sanitize_cypher(cypher)
+                result = await self._s._graph.query_raw(cypher)
+                for row in (result.result_set or []):


+        if intent == "rag":
+            # Non-aggregation questions don't benefit from any of the
+            # cypher-first mechanics — hand straight to the fallback.
+            return await self._rag_path.maybe_handle(query, ctx)
+


+    r"\bboth\s+\w+(?:\s+\S+)*?\s+and\s+\w+\b",
+    r"\bbetween\s+\w+\s+and\s+\w+\b",
+    r"\bexactly\s+\d+\b", r"\bat\s+least\s+\d+\b",
+    r"\b(?:does|do|has|have|is|are)\s+(?:[A-Z]\w*\s+){1,4}(?:have|has|contain|own|run|host)\b",


+        batch_cypher = (
+            "MATCH (o:Organization)<-[:RELATES]-(p:Person) "
+            "OPTIONAL MATCH (p)-[:MENTIONED_IN]->(c:Chunk) "
+            "RETURN o.name AS org, p.name AS person, "
+            "  p.description AS desc, collect(DISTINCT c.text) AS chunks"


+# Default row cap auto-injected when the LLM's query lacks a LIMIT.
+# Pure aggregations (count/sum/avg over no group-by) return one row, so
+# they skip injection entirely; group-by lists need enough rows that a 10-org
+# breakdown isn't truncated.
+_DEFAULT_ROW_LIMIT = 100


+# Optional addendum appended to the system prompt only when the retriever
+# produced an "Authoritative Graph Query Results" section (currently emitted
+# by ``CypherFirstAggregationStrategy`` and by ``MultiPathRetrieval`` with
+# ``enable_cypher=True``). Kept out of the base prompts so callers who never
+# use cypher retrieval don't pay the prompt-token cost or risk subtle


coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py`:
- Around line 509-514: The loop over result.result_set currently appends every
numeric-looking cell per row, which can overcount; change the logic in the block
after result = await self._s._graph.query_raw(cypher) so that for each row you
scan cells and append only the single intended numeric value (e.g., the first
non-None result of _coerce_number(cell)) instead of every numeric cell—update
the nested loops that call _coerce_number and push into values to break after
the first numeric is found (or otherwise select by expected column index/name)
to ensure one numeric aggregation per row.
- Around line 541-544: The current debug string construction in cypher_first.py
that builds f"source_values ({len(values)} rows): {', '.join(... for v in
values)}" can produce huge payloads; modify the formatter used in the
function/method that contains variables op, ans_str, values, cypher (the block
building that f-string) to cap how many values are serialized (e.g., limit to
first N rows or first M characters), and when truncated append a clear indicator
like "… (truncated X rows)" so the log includes the count but not all data;
ensure the same truncated representation is used wherever source_values is
rendered to avoid oversized retrieval contexts.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eeba0db7-0789-4b0e-a209-87fe58171da6

📥 Commits

Reviewing files that changed from the base of the PR and between a90ad18 and 37e233a.

📒 Files selected for processing (10)

graphrag_sdk/src/graphrag_sdk/__init__.py
graphrag_sdk/src/graphrag_sdk/api/main.py
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/__init__.py
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py
graphrag_sdk/src/graphrag_sdk/retrieval/strategies/result_assembly.py
graphrag_sdk/tests/test_cypher_first.py
graphrag_sdk/tests/test_cypher_generation.py
graphrag_sdk/tests/test_facade.py

coderabbitai · 2026-05-14T13:34:15Z

+                result = await self._s._graph.query_raw(cypher)
+                for row in (result.result_set or []):
+                    for cell in row:
+                        v = _coerce_number(cell)
+                        if v is not None:
+                            values.append(v)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restrict numeric aggregation to one intended value per row.

This currently aggregates every numeric-looking cell in each row, which can silently skew results when the generated Cypher returns extra columns.

💡 Proposed fix

- for row in (result.result_set or []): - for cell in row: - v = _coerce_number(cell) - if v is not None: - values.append(v) + for row in result.result_set: + if not row: + continue + # Numeric path expects one value per row; use the first column only. + v = _coerce_number(row[0]) + if v is not None: + values.append(v)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py` around lines 509 - 514, The loop over result.result_set currently appends every numeric-looking cell per row, which can overcount; change the logic in the block after result = await self._s._graph.query_raw(cypher) so that for each row you scan cells and append only the single intended numeric value (e.g., the first non-None result of _coerce_number(cell)) instead of every numeric cell—update the nested loops that call _coerce_number and push into values to break after the first numeric is found (or otherwise select by expected column index/name) to ensure one numeric aggregation per row.

coderabbitai · 2026-05-14T13:34:16Z

+            f"computed {op} = {ans_str}\n"
+            f"source_values ({len(values)} rows): "
+            f"{', '.join(str(int(v)) if v == int(v) else f'{v:.2f}' for v in values)}\n"
+            f"cypher: {cypher}"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Cap echoed source_values to prevent oversized retrieval context.

Serializing all extracted values into one line can produce very large context payloads on high-row queries and hurt completion reliability.

💡 Proposed fix

- body = ( - f"computed {op} = {ans_str}\n" - f"source_values ({len(values)} rows): " - f"{', '.join(str(int(v)) if v == int(v) else f'{v:.2f}' for v in values)}\n" - f"cypher: {cypher}" - ) + preview_cap = 50 + preview = values[:preview_cap] + preview_suffix = ( + f", ... (+{len(values) - preview_cap} more)" if len(values) > preview_cap else "" + ) + body = ( + f"computed {op} = {ans_str}\n" + f"source_values ({len(values)} rows): " + f"{', '.join(str(int(v)) if v == int(v) else f'{v:.2f}' for v in preview)}" + f"{preview_suffix}\n" + f"cypher: {cypher}" + )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py` around lines 541 - 544, The current debug string construction in cypher_first.py that builds f"source_values ({len(values)} rows): {', '.join(... for v in values)}" can produce huge payloads; modify the formatter used in the function/method that contains variables op, ans_str, values, cypher (the block building that f-string) to cap how many values are serialized (e.g., limit to first N rows or first M characters), and when truncated append a clear indicator like "… (truncated X rows)" so the log includes the count but not all data; ensure the same truncated representation is used wherever source_values is rendered to avoid oversized retrieval contexts.

galshubeli requested a review from Copilot May 14, 2026 13:26

Copilot started reviewing on behalf of galshubeli May 14, 2026 13:27 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

galshubeli mentioned this pull request May 14, 2026

feat(retrieval): CypherFirstAggregationStrategy + cypher-path accuracy fixes #255

Open

5 tasks

galshubeli changed the title ~~Docs/cypher first recommend sentence chunker~~ docs(retrieval): recommend SentenceTokenCapChunking for CypherFirst May 14, 2026

galshubeli changed the base branch from main to feat/cypher-aggregation-accuracy May 14, 2026 14:17

galshubeli merged commit ef42a09 into feat/cypher-aggregation-accuracy May 17, 2026
9 checks passed

galshubeli deleted the docs/cypher-first-recommend-sentence-chunker branch May 17, 2026 14:09

This was referenced May 18, 2026

feat(schema): schema-driven attributes end-to-end with persistent ontology #256

Open

Reliability hardening suite #261

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(retrieval): recommend SentenceTokenCapChunking for CypherFirst#253

docs(retrieval): recommend SentenceTokenCapChunking for CypherFirst#253
galshubeli merged 1 commit into
feat/cypher-aggregation-accuracyfrom
docs/cypher-first-recommend-sentence-chunker

galshubeli commented May 14, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (2 warnings)

Uh oh!

Copilot AI left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 14, 2026

Uh oh!

coderabbitai Bot May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

galshubeli commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Stacking

Test plan

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (2 warnings)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

galshubeli commented May 14, 2026 •

edited

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading