From 37e233ab21a7e0a144db4f5567c2dd2bf55fbedd Mon Sep 17 00:00:00 2001
From: Gal Shubeli <galshubeli93@gmail.com>
Date: Thu, 14 May 2026 16:19:05 +0300
Subject: [PATCH] docs(retrieval): recommend SentenceTokenCapChunking for
 cypher-first
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds a "Recommended ingestion config" section to
CypherFirstAggregationStrategy's docstring.

The default chunker (FixedSizeChunking) splits on a character window
with no sentence/paragraph awareness, so entity names get truncated at
chunk boundaries — "Wayne Enterprises" becomes "Wayne En" in one chunk
and "terprises..." in the next. The resulting stub entities never merge
with their full forms during resolution, so cypher counts come back
inflated and "which X" lists pick up phantoms.

On the internal 7-question aggregation benchmark, switching from
FixedSizeChunking(chunk_size=1000) to
SentenceTokenCapChunking(max_tokens=512, overlap_sentences=2) moved the
score from 6/7 (intermittent) to 7/7 (stable across three runs). The
post-ingest graph state went from 11-14 organization nodes including
"Glo" / "Initech System" / "Wayne En" stubs to exactly 10 clean orgs,
and from 66-80 Person nodes (with Carla / Carla Okafor duplicates) to
exactly 56 distinct persons.

The short-form duplicate fix is a side benefit: sentence-aware chunks
follow natural prose boundaries with overlap, so a person's first
mention almost always lands in the same chunk as their later short-form
references — per-chunk FastCoref then has the antecedent it needs.

This docstring is the smallest useful change. A separate PR will propose
changing the SDK default chunker; that's a larger conversation since it
affects every existing caller.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../retrieval/strategies/cypher_first.py      | 33 +++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py b/graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py
index 7f10be38..24cb0037 100644
--- a/graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py
+++ b/graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py
@@ -927,6 +927,39 @@ class CypherFirstAggregationStrategy(RetrievalStrategy):
     extraction-quality issues — not strategy bugs — and should be
     addressed in the ingestion pipeline (resolver, coref, dedup).
 
+    Recommended ingestion config
+    ----------------------------
+    For aggregation accuracy, ingest with a **sentence- or paragraph-aware
+    chunker**, not the default :class:`FixedSizeChunking`. Character-window
+    chunking can split entity names mid-token at chunk boundaries
+    (``"Wayne Enterprises"`` → ``"Wayne En"``), and the resulting stub
+    entities never merge with their full forms during resolution. On the
+    internal 7-question aggregation benchmark, switching from
+    ``FixedSizeChunking(chunk_size=1000)`` to
+    ``SentenceTokenCapChunking(max_tokens=512, overlap_sentences=2)``
+    moved the score from 6/7 (intermittent) to 7/7 (stable across runs)
+    by eliminating every truncation stub and almost every short-form
+    duplicate (the latter benefit because per-chunk FastCoref binds
+    references to antecedents that now reliably land in the same chunk)::
+
+        from graphrag_sdk import (
+            CypherFirstAggregationStrategy,
+            FastCorefResolver,
+            GraphExtraction,
+            LLMVerifiedResolution,
+            SentenceTokenCapChunking,
+        )
+
+        chunker = SentenceTokenCapChunking(max_tokens=512, overlap_sentences=2)
+        extractor = GraphExtraction(llm=llm, coref_resolver=FastCorefResolver())
+        resolver = LLMVerifiedResolution(llm=llm, embedder=embedder)
+
+        await rag.ingest(text=doc, chunker=chunker,
+                         extractor=extractor, resolver=resolver)
+
+    ``StructuralChunking`` is a reasonable alternative when the source
+    text has explicit paragraph/header structure (e.g. Markdown).
+
     Args:
         graph_store: Required for Cypher execution.
         vector_store: Required for the RAG fallback.