From 37e233ab21a7e0a144db4f5567c2dd2bf55fbedd Mon Sep 17 00:00:00 2001 From: Gal Shubeli Date: Thu, 14 May 2026 16:19:05 +0300 Subject: [PATCH] docs(retrieval): recommend SentenceTokenCapChunking for cypher-first MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds a "Recommended ingestion config" section to CypherFirstAggregationStrategy's docstring. The default chunker (FixedSizeChunking) splits on a character window with no sentence/paragraph awareness, so entity names get truncated at chunk boundaries — "Wayne Enterprises" becomes "Wayne En" in one chunk and "terprises..." in the next. The resulting stub entities never merge with their full forms during resolution, so cypher counts come back inflated and "which X" lists pick up phantoms. On the internal 7-question aggregation benchmark, switching from FixedSizeChunking(chunk_size=1000) to SentenceTokenCapChunking(max_tokens=512, overlap_sentences=2) moved the score from 6/7 (intermittent) to 7/7 (stable across three runs). The post-ingest graph state went from 11-14 organization nodes including "Glo" / "Initech System" / "Wayne En" stubs to exactly 10 clean orgs, and from 66-80 Person nodes (with Carla / Carla Okafor duplicates) to exactly 56 distinct persons. The short-form duplicate fix is a side benefit: sentence-aware chunks follow natural prose boundaries with overlap, so a person's first mention almost always lands in the same chunk as their later short-form references — per-chunk FastCoref then has the antecedent it needs. This docstring is the smallest useful change. A separate PR will propose changing the SDK default chunker; that's a larger conversation since it affects every existing caller. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../retrieval/strategies/cypher_first.py | 33 +++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py b/graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py index 7f10be38..24cb0037 100644 --- a/graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py +++ b/graphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_first.py @@ -927,6 +927,39 @@ class CypherFirstAggregationStrategy(RetrievalStrategy): extraction-quality issues — not strategy bugs — and should be addressed in the ingestion pipeline (resolver, coref, dedup). + Recommended ingestion config + ---------------------------- + For aggregation accuracy, ingest with a **sentence- or paragraph-aware + chunker**, not the default :class:`FixedSizeChunking`. Character-window + chunking can split entity names mid-token at chunk boundaries + (``"Wayne Enterprises"`` → ``"Wayne En"``), and the resulting stub + entities never merge with their full forms during resolution. On the + internal 7-question aggregation benchmark, switching from + ``FixedSizeChunking(chunk_size=1000)`` to + ``SentenceTokenCapChunking(max_tokens=512, overlap_sentences=2)`` + moved the score from 6/7 (intermittent) to 7/7 (stable across runs) + by eliminating every truncation stub and almost every short-form + duplicate (the latter benefit because per-chunk FastCoref binds + references to antecedents that now reliably land in the same chunk):: + + from graphrag_sdk import ( + CypherFirstAggregationStrategy, + FastCorefResolver, + GraphExtraction, + LLMVerifiedResolution, + SentenceTokenCapChunking, + ) + + chunker = SentenceTokenCapChunking(max_tokens=512, overlap_sentences=2) + extractor = GraphExtraction(llm=llm, coref_resolver=FastCorefResolver()) + resolver = LLMVerifiedResolution(llm=llm, embedder=embedder) + + await rag.ingest(text=doc, chunker=chunker, + extractor=extractor, resolver=resolver) + + ``StructuralChunking`` is a reasonable alternative when the source + text has explicit paragraph/header structure (e.g. Markdown). + Args: graph_store: Required for Cypher execution. vector_store: Required for the RAG fallback.