docs(retrieval): recommend SentenceTokenCapChunking for CypherFirst#253
Conversation
Adds a "Recommended ingestion config" section to CypherFirstAggregationStrategy's docstring. The default chunker (FixedSizeChunking) splits on a character window with no sentence/paragraph awareness, so entity names get truncated at chunk boundaries — "Wayne Enterprises" becomes "Wayne En" in one chunk and "terprises..." in the next. The resulting stub entities never merge with their full forms during resolution, so cypher counts come back inflated and "which X" lists pick up phantoms. On the internal 7-question aggregation benchmark, switching from FixedSizeChunking(chunk_size=1000) to SentenceTokenCapChunking(max_tokens=512, overlap_sentences=2) moved the score from 6/7 (intermittent) to 7/7 (stable across three runs). The post-ingest graph state went from 11-14 organization nodes including "Glo" / "Initech System" / "Wayne En" stubs to exactly 10 clean orgs, and from 66-80 Person nodes (with Carla / Carla Okafor duplicates) to exactly 56 distinct persons. The short-form duplicate fix is a side benefit: sentence-aware chunks follow natural prose boundaries with overlap, so a person's first mention almost always lands in the same chunk as their later short-form references — per-chunk FastCoref then has the antecedent it needs. This docstring is the smallest useful change. A separate PR will propose changing the SDK default chunker; that's a larger conversation since it affects every existing caller. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR introduces ChangesCypher-First Aggregation Strategy
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR introduces a “Cypher-first” aggregation-focused retrieval strategy and extends existing Cypher retrieval plumbing so deterministic graph-query results can be surfaced (and treated as authoritative) in the final LLM prompt when applicable.
Changes:
- Add
CypherFirstAggregationStrategywith intent routing (RAG vs aggregation vs numeric-math) plus helpers (table formatting, fuzzy intersect, phrase extractors). - Enhance text-to-Cypher validation/sanitization (row-limit behavior, reject dotted namespaces, typed-label widening fallback) and propagate cypher execution metadata through
MultiPathRetrieval→ assembled results. - Conditionally append a cypher-authority rule to the system prompt only when authoritative cypher results are present; add/expand tests for these behaviors.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| graphrag_sdk/tests/test_facade.py | Adds tests ensuring the cypher-authority prompt rule is injected only when authoritative cypher context is present. |
| graphrag_sdk/tests/test_cypher_generation.py | Expands validator/sanitizer tests; updates execute-cypher retrieval tests for new metadata + fallback behavior. |
| graphrag_sdk/tests/test_cypher_first.py | New tests covering cypher-first helper functions and mocked end-to-end routing behavior. |
| graphrag_sdk/src/graphrag_sdk/retrieval/strategies/result_assembly.py | Renames/expands cypher results section, raises cypher context cap, and merges cypher diagnostics into result metadata. |
| graphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.py | Updates cypher execution integration to accept and pass through cypher metadata into assembled results. |
| graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.py | Adds pure-aggregation detection for LIMIT injection, dotted-namespace rejection, and typed-label widening retry with diagnostics. |
| graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py | New cypher-first aggregation strategy implementation (intent detection, hybrid paths, numeric math path, table formatting, phrase extraction). |
| graphrag_sdk/src/graphrag_sdk/retrieval/strategies/init.py | Exposes the new cypher-first strategy and extractor interfaces from the strategies package. |
| graphrag_sdk/src/graphrag_sdk/api/main.py | Adds conditional system-prompt addendum injection based on presence of authoritative cypher results. |
| graphrag_sdk/src/graphrag_sdk/init.py | Exports the new cypher-first strategy and phrase extractor types at the top-level package. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| cypher = None | ||
| else: | ||
| cypher = _sanitize_cypher(cypher) | ||
| result = await self._s._graph.query_raw(cypher) | ||
| for row in (result.result_set or []): |
| if intent == "rag": | ||
| # Non-aggregation questions don't benefit from any of the | ||
| # cypher-first mechanics — hand straight to the fallback. | ||
| return await self._rag_path.maybe_handle(query, ctx) | ||
|
|
| r"\bboth\s+\w+(?:\s+\S+)*?\s+and\s+\w+\b", | ||
| r"\bbetween\s+\w+\s+and\s+\w+\b", | ||
| r"\bexactly\s+\d+\b", r"\bat\s+least\s+\d+\b", | ||
| r"\b(?:does|do|has|have|is|are)\s+(?:[A-Z]\w*\s+){1,4}(?:have|has|contain|own|run|host)\b", |
| batch_cypher = ( | ||
| "MATCH (o:Organization)<-[:RELATES]-(p:Person) " | ||
| "OPTIONAL MATCH (p)-[:MENTIONED_IN]->(c:Chunk) " | ||
| "RETURN o.name AS org, p.name AS person, " | ||
| " p.description AS desc, collect(DISTINCT c.text) AS chunks" |
| # Default row cap auto-injected when the LLM's query lacks a LIMIT. | ||
| # Pure aggregations (count/sum/avg over no group-by) return one row, so | ||
| # they skip injection entirely; group-by lists need enough rows that a 10-org | ||
| # breakdown isn't truncated. | ||
| _DEFAULT_ROW_LIMIT = 100 |
| # Optional addendum appended to the system prompt only when the retriever | ||
| # produced an "Authoritative Graph Query Results" section (currently emitted | ||
| # by ``CypherFirstAggregationStrategy`` and by ``MultiPathRetrieval`` with | ||
| # ``enable_cypher=True``). Kept out of the base prompts so callers who never | ||
| # use cypher retrieval don't pay the prompt-token cost or risk subtle |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py`:
- Around line 509-514: The loop over result.result_set currently appends every
numeric-looking cell per row, which can overcount; change the logic in the block
after result = await self._s._graph.query_raw(cypher) so that for each row you
scan cells and append only the single intended numeric value (e.g., the first
non-None result of _coerce_number(cell)) instead of every numeric cell—update
the nested loops that call _coerce_number and push into values to break after
the first numeric is found (or otherwise select by expected column index/name)
to ensure one numeric aggregation per row.
- Around line 541-544: The current debug string construction in cypher_first.py
that builds f"source_values ({len(values)} rows): {', '.join(... for v in
values)}" can produce huge payloads; modify the formatter used in the
function/method that contains variables op, ans_str, values, cypher (the block
building that f-string) to cap how many values are serialized (e.g., limit to
first N rows or first M characters), and when truncated append a clear indicator
like "… (truncated X rows)" so the log includes the count but not all data;
ensure the same truncated representation is used wherever source_values is
rendered to avoid oversized retrieval contexts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: eeba0db7-0789-4b0e-a209-87fe58171da6
📒 Files selected for processing (10)
graphrag_sdk/src/graphrag_sdk/__init__.pygraphrag_sdk/src/graphrag_sdk/api/main.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/__init__.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/result_assembly.pygraphrag_sdk/tests/test_cypher_first.pygraphrag_sdk/tests/test_cypher_generation.pygraphrag_sdk/tests/test_facade.py
| result = await self._s._graph.query_raw(cypher) | ||
| for row in (result.result_set or []): | ||
| for cell in row: | ||
| v = _coerce_number(cell) | ||
| if v is not None: | ||
| values.append(v) |
There was a problem hiding this comment.
Restrict numeric aggregation to one intended value per row.
This currently aggregates every numeric-looking cell in each row, which can silently skew results when the generated Cypher returns extra columns.
💡 Proposed fix
- for row in (result.result_set or []):
- for cell in row:
- v = _coerce_number(cell)
- if v is not None:
- values.append(v)
+ for row in result.result_set:
+ if not row:
+ continue
+ # Numeric path expects one value per row; use the first column only.
+ v = _coerce_number(row[0])
+ if v is not None:
+ values.append(v)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py` around
lines 509 - 514, The loop over result.result_set currently appends every
numeric-looking cell per row, which can overcount; change the logic in the block
after result = await self._s._graph.query_raw(cypher) so that for each row you
scan cells and append only the single intended numeric value (e.g., the first
non-None result of _coerce_number(cell)) instead of every numeric cell—update
the nested loops that call _coerce_number and push into values to break after
the first numeric is found (or otherwise select by expected column index/name)
to ensure one numeric aggregation per row.
| f"computed {op} = {ans_str}\n" | ||
| f"source_values ({len(values)} rows): " | ||
| f"{', '.join(str(int(v)) if v == int(v) else f'{v:.2f}' for v in values)}\n" | ||
| f"cypher: {cypher}" |
There was a problem hiding this comment.
Cap echoed source_values to prevent oversized retrieval context.
Serializing all extracted values into one line can produce very large context payloads on high-row queries and hurt completion reliability.
💡 Proposed fix
- body = (
- f"computed {op} = {ans_str}\n"
- f"source_values ({len(values)} rows): "
- f"{', '.join(str(int(v)) if v == int(v) else f'{v:.2f}' for v in values)}\n"
- f"cypher: {cypher}"
- )
+ preview_cap = 50
+ preview = values[:preview_cap]
+ preview_suffix = (
+ f", ... (+{len(values) - preview_cap} more)" if len(values) > preview_cap else ""
+ )
+ body = (
+ f"computed {op} = {ans_str}\n"
+ f"source_values ({len(values)} rows): "
+ f"{', '.join(str(int(v)) if v == int(v) else f'{v:.2f}' for v in preview)}"
+ f"{preview_suffix}\n"
+ f"cypher: {cypher}"
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py` around
lines 541 - 544, The current debug string construction in cypher_first.py that
builds f"source_values ({len(values)} rows): {', '.join(... for v in values)}"
can produce huge payloads; modify the formatter used in the function/method that
contains variables op, ans_str, values, cypher (the block building that
f-string) to cap how many values are serialized (e.g., limit to first N rows or
first M characters), and when truncated append a clear indicator like "…
(truncated X rows)" so the log includes the count but not all data; ensure the
same truncated representation is used wherever source_values is rendered to
avoid oversized retrieval contexts.
Summary
Adds a "Recommended ingestion config" section to
CypherFirstAggregationStrategy's docstring pointing users atSentenceTokenCapChunking. Cosmetic / discoverability — the strategy itself works the same.Why
The default chunker (
FixedSizeChunking) splits on a character window with no awareness of sentence or word boundaries, so entity names get truncated at chunk boundaries ("Wayne Enterprises"→"Wayne En"). The stub entities never merge with their full forms during resolution, so cypher counts come back inflated.On the internal 7-question aggregation benchmark, switching the benchmark's ingestion config from
FixedSizeChunking(chunk_size=1000)toSentenceTokenCapChunking(max_tokens=512, overlap_sentences=2)moved the score from 6/7 (intermittent) to 7/7 (stable across three runs).Stacking
This PR is stacked on #255 (
feat/cypher-aggregation-accuracy). After #255 merges, GitHub will auto-retarget this tomainand the diff stays a 33-line docstring change.Related: #254 makes
SentenceTokenCapChunkingthe SDK's default chunker — that's a behavior change for all callers, kept separate from this docs-only PR.Test plan