feat(ingestion): default to SentenceTokenCapChunking in ingest()/upda…#254
feat(ingestion): default to SentenceTokenCapChunking in ingest()/upda…#254galshubeli wants to merge 1 commit into
Conversation
…te() Changes the default chunker that ``GraphRAG.ingest()`` and ``GraphRAG.update()`` fall back to when the caller doesn't pass an explicit ``chunker=``. Was ``FixedSizeChunking()``; now ``SentenceTokenCapChunking()`` (sentence-aware, max_tokens=512, overlap_sentences=2 — the strategy's own defaults). Why --- ``FixedSizeChunking`` splits on a hard character window with no awareness of sentence, word, or paragraph boundaries. When the window cuts through an entity name, the per-chunk LLM extractor produces a stub entity for the fragment (``"Wayne Enterprises"`` → ``"Wayne En"`` in chunk N plus unparsable text in chunk N+1). These stubs never merge with their full forms during resolution because their embeddings differ enough that LLMVerifiedResolution scores them below the soft threshold. This silently inflates cypher counts and pollutes "which X" lists. The strategy that surfaced this — ``CypherFirstAggregationStrategy`` — was hitting a 6/7 ceiling on the internal aggregation benchmark with one question failing because of these stubs. Switching to ``SentenceTokenCapChunking`` cleared the benchmark to 7/7 stable across three runs, and the post-ingest graph state went from 11-14 organization nodes (including ``Glo`` / ``Initech System`` / ``Wayne En``) to exactly 10 clean orgs, and from 66-80 ``Person`` nodes (with ``Carla`` / ``Carla Okafor`` duplicates) to exactly 56 distinct persons — matching the corpus. A side benefit: sentence-aware chunks with 2-sentence overlap almost always keep a person's first mention in the same chunk as their later short-form references, so per-chunk FastCoref now binds ``Carla → Carla Okafor`` reliably. That eliminates the short-form-duplicate class too, not just the truncation stubs. Compatibility ------------- ``FixedSizeChunking`` remains exported and fully supported — callers who explicitly pass ``chunker=FixedSizeChunking()`` get unchanged behavior. Existing tests (748 passed, 24 skipped) pass without modification: no test in the suite asserts on chunk count or content shape from the default chunker, so switching defaults doesn't break the suite. Callers who relied on the previous default and want to keep it should pass ``chunker=FixedSizeChunking()`` explicitly. The docstrings call out the new default and reference ``FixedSizeChunking`` as the opt-in character-window alternative. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughGraphRAG's default chunking strategy shifts from fixed-size to sentence-aware tokenization. The import is updated, API docstrings document the new ChangesDefault Chunking Strategy Update
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Updates GraphRAG’s ingestion defaults so callers who don’t explicitly pass chunker= use sentence-aware chunking, improving extraction quality by reducing mid-entity boundary splits.
Changes:
- Switch default chunker for
GraphRAG.ingest()/_ingest_single()fromFixedSizeChunking()toSentenceTokenCapChunking(). - Switch default chunker for
GraphRAG.update()fromFixedSizeChunking()toSentenceTokenCapChunking(). - Update public docstrings to reflect the new default chunker.
Comments suppressed due to low confidence (1)
graphrag_sdk/src/graphrag_sdk/api/main.py:1019
- Same concern as ingest(): update() now defaults to SentenceTokenCapChunking(), but that chunker can produce over-cap chunks when a single sentence exceeds max_tokens. This can inflate Step 2 verification prompts and cause provider context-limit errors. Consider fixing the chunker to enforce the cap or documenting the over-cap edge case.
pipeline = IngestionPipeline(
loader=loader or TextLoader(), # unused (text is provided below)
chunker=chunker or SentenceTokenCapChunking(),
extractor=extractor or self._default_extractor(),
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| — sentence-aware, never splits entity names at chunk boundaries. | ||
| Override with ``chunker=FixedSizeChunking(...)`` if you need | ||
| character-window chunking. |
| pipeline = IngestionPipeline( | ||
| loader=loader or TextLoader(), | ||
| chunker=chunker or FixedSizeChunking(), | ||
| chunker=chunker or SentenceTokenCapChunking(), | ||
| extractor=extractor or self._default_extractor(), |
…te()
Changes the default chunker that
GraphRAG.ingest()andGraphRAG.update()fall back to when the caller doesn't pass an explicitchunker=. WasFixedSizeChunking(); nowSentenceTokenCapChunking()(sentence-aware, max_tokens=512, overlap_sentences=2 — the strategy's own defaults).Why
FixedSizeChunkingsplits on a hard character window with no awareness of sentence, word, or paragraph boundaries. When the window cuts through an entity name, the per-chunk LLM extractor produces a stub entity for the fragment ("Wayne Enterprises"→"Wayne En"in chunk N plus unparsable text in chunk N+1). These stubs never merge with their full forms during resolution because their embeddings differ enough that LLMVerifiedResolution scores them below the soft threshold.This silently inflates cypher counts and pollutes "which X" lists. The strategy that surfaced this —
CypherFirstAggregationStrategy— was hitting a 6/7 ceiling on the internal aggregation benchmark with one question failing because of these stubs. Switching toSentenceTokenCapChunkingcleared the benchmark to 7/7 stable across three runs, and the post-ingest graph state went from 11-14 organization nodes (includingGlo/Initech System/Wayne En) to exactly 10 clean orgs, and from 66-80Personnodes (withCarla/Carla Okaforduplicates) to exactly 56 distinct persons — matching the corpus.A side benefit: sentence-aware chunks with 2-sentence overlap almost always keep a person's first mention in the same chunk as their later short-form references, so per-chunk FastCoref now binds
Carla → Carla Okaforreliably. That eliminates the short-form-duplicate class too, not just the truncation stubs.Compatibility
FixedSizeChunkingremains exported and fully supported — callers who explicitly passchunker=FixedSizeChunking()get unchanged behavior. Existing tests (748 passed, 24 skipped) pass without modification: no test in the suite asserts on chunk count or content shape from the default chunker, so switching defaults doesn't break the suite.Callers who relied on the previous default and want to keep it should pass
chunker=FixedSizeChunking()explicitly. The docstrings call out the new default and referenceFixedSizeChunkingas the opt-in character-window alternative.Summary
Brief description of what this PR does and why.
Changes
Test Plan
pytest tests/ -q)ruff check src/)Notes
Any additional context for reviewers.
Summary by CodeRabbit