feat(ingestion): default to SentenceTokenCapChunking in ingest()/upda… by galshubeli · Pull Request #254 · FalkorDB/GraphRAG-SDK

galshubeli · 2026-05-14T13:27:25Z

…te()

Changes the default chunker that GraphRAG.ingest() and GraphRAG.update() fall back to when the caller doesn't pass an explicit chunker=. Was FixedSizeChunking(); now SentenceTokenCapChunking() (sentence-aware, max_tokens=512, overlap_sentences=2 — the strategy's own defaults).

Why

FixedSizeChunking splits on a hard character window with no awareness of sentence, word, or paragraph boundaries. When the window cuts through an entity name, the per-chunk LLM extractor produces a stub entity for the fragment ("Wayne Enterprises" → "Wayne En" in chunk N plus unparsable text in chunk N+1). These stubs never merge with their full forms during resolution because their embeddings differ enough that LLMVerifiedResolution scores them below the soft threshold.

This silently inflates cypher counts and pollutes "which X" lists. The strategy that surfaced this — CypherFirstAggregationStrategy — was hitting a 6/7 ceiling on the internal aggregation benchmark with one question failing because of these stubs. Switching to SentenceTokenCapChunking cleared the benchmark to 7/7 stable across three runs, and the post-ingest graph state went from 11-14 organization nodes (including Glo / Initech System / Wayne En) to exactly 10 clean orgs, and from 66-80 Person nodes (with Carla / Carla Okafor duplicates) to exactly 56 distinct persons — matching the corpus.

A side benefit: sentence-aware chunks with 2-sentence overlap almost always keep a person's first mention in the same chunk as their later short-form references, so per-chunk FastCoref now binds Carla → Carla Okafor reliably. That eliminates the short-form-duplicate class too, not just the truncation stubs.

Compatibility

FixedSizeChunking remains exported and fully supported — callers who explicitly pass chunker=FixedSizeChunking() get unchanged behavior. Existing tests (748 passed, 24 skipped) pass without modification: no test in the suite asserts on chunk count or content shape from the default chunker, so switching defaults doesn't break the suite.

Callers who relied on the previous default and want to keep it should pass chunker=FixedSizeChunking() explicitly. The docstrings call out the new default and reference FixedSizeChunking as the opt-in character-window alternative.

Summary

Brief description of what this PR does and why.

Changes

Change 1
Change 2

Test Plan

All existing tests pass (pytest tests/ -q)
New tests added for new functionality (if applicable)
Lint passes (ruff check src/)

Notes

Any additional context for reviewers.

Summary by CodeRabbit

New Features
- Updated default text chunking strategy to use sentence-aware processing with improved entity name handling for data ingestion, updates, and change application.

…te() Changes the default chunker that ``GraphRAG.ingest()`` and ``GraphRAG.update()`` fall back to when the caller doesn't pass an explicit ``chunker=``. Was ``FixedSizeChunking()``; now ``SentenceTokenCapChunking()`` (sentence-aware, max_tokens=512, overlap_sentences=2 — the strategy's own defaults). Why --- ``FixedSizeChunking`` splits on a hard character window with no awareness of sentence, word, or paragraph boundaries. When the window cuts through an entity name, the per-chunk LLM extractor produces a stub entity for the fragment (``"Wayne Enterprises"`` → ``"Wayne En"`` in chunk N plus unparsable text in chunk N+1). These stubs never merge with their full forms during resolution because their embeddings differ enough that LLMVerifiedResolution scores them below the soft threshold. This silently inflates cypher counts and pollutes "which X" lists. The strategy that surfaced this — ``CypherFirstAggregationStrategy`` — was hitting a 6/7 ceiling on the internal aggregation benchmark with one question failing because of these stubs. Switching to ``SentenceTokenCapChunking`` cleared the benchmark to 7/7 stable across three runs, and the post-ingest graph state went from 11-14 organization nodes (including ``Glo`` / ``Initech System`` / ``Wayne En``) to exactly 10 clean orgs, and from 66-80 ``Person`` nodes (with ``Carla`` / ``Carla Okafor`` duplicates) to exactly 56 distinct persons — matching the corpus. A side benefit: sentence-aware chunks with 2-sentence overlap almost always keep a person's first mention in the same chunk as their later short-form references, so per-chunk FastCoref now binds ``Carla → Carla Okafor`` reliably. That eliminates the short-form-duplicate class too, not just the truncation stubs. Compatibility ------------- ``FixedSizeChunking`` remains exported and fully supported — callers who explicitly pass ``chunker=FixedSizeChunking()`` get unchanged behavior. Existing tests (748 passed, 24 skipped) pass without modification: no test in the suite asserts on chunk count or content shape from the default chunker, so switching defaults doesn't break the suite. Callers who relied on the previous default and want to keep it should pass ``chunker=FixedSizeChunking()`` explicitly. The docstrings call out the new default and reference ``FixedSizeChunking`` as the opt-in character-window alternative. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-14T13:29:53Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: af7f5953-ecf8-44fe-b9b3-beb737752bad

📥 Commits

Reviewing files that changed from the base of the PR and between a90ad18 and 2d220aa.

📒 Files selected for processing (1)

graphrag_sdk/src/graphrag_sdk/api/main.py

📝 Walkthrough

Walkthrough

GraphRAG's default chunking strategy shifts from fixed-size to sentence-aware tokenization. The import is updated, API docstrings document the new SentenceTokenCapChunking() default and its entity-boundary preservation, and both ingest pipeline constructions apply the new default chunker.

Changes

Default Chunking Strategy Update

Layer / File(s)	Summary
Import and API documentation updates `graphrag_sdk/src/graphrag_sdk/api/main.py`	`SentenceTokenCapChunking` is imported; `ingest()` and `apply_changes()` docstrings updated to document sentence-aware chunking with `max_tokens=512` and `overlap_sentences=2`, preserving entity boundaries.
Default chunker in ingest pipeline `graphrag_sdk/src/graphrag_sdk/api/main.py`	`IngestionPipeline` in `_ingest_single()` and `update()` defaults to `SentenceTokenCapChunking()` instead of `FixedSizeChunking()` when no chunker is explicitly provided.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 A sentence-wise chunker hops into place,
No more fixed boxes—entity grace!
Tokens align where the boundaries are,
Docs and code dance beneath one star.
The ingest pipeline learns to see—
One small change, one new decree. 🌟

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: switching the default chunking strategy from FixedSizeChunking to SentenceTokenCapChunking in the ingest() and update() methods.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/default-sentence-aware-chunker

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Updates GraphRAG’s ingestion defaults so callers who don’t explicitly pass chunker= use sentence-aware chunking, improving extraction quality by reducing mid-entity boundary splits.

Changes:

Switch default chunker for GraphRAG.ingest()/_ingest_single() from FixedSizeChunking() to SentenceTokenCapChunking().
Switch default chunker for GraphRAG.update() from FixedSizeChunking() to SentenceTokenCapChunking().
Update public docstrings to reflect the new default chunker.

Comments suppressed due to low confidence (1)

graphrag_sdk/src/graphrag_sdk/api/main.py:1019

Same concern as ingest(): update() now defaults to SentenceTokenCapChunking(), but that chunker can produce over-cap chunks when a single sentence exceeds max_tokens. This can inflate Step 2 verification prompts and cause provider context-limit errors. Consider fixing the chunker to enforce the cap or documenting the over-cap edge case.

        pipeline = IngestionPipeline(
            loader=loader or TextLoader(),  # unused (text is provided below)
            chunker=chunker or SentenceTokenCapChunking(),
            extractor=extractor or self._default_extractor(),

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+          — sentence-aware, never splits entity names at chunk boundaries.
+          Override with ``chunker=FixedSizeChunking(...)`` if you need
+          character-window chunking.


        pipeline = IngestionPipeline(
            loader=loader or TextLoader(),
-            chunker=chunker or FixedSizeChunking(),
+            chunker=chunker or SentenceTokenCapChunking(),
            extractor=extractor or self._default_extractor(),


galshubeli requested a review from Copilot May 14, 2026 13:27

Copilot started reviewing on behalf of galshubeli May 14, 2026 13:28 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

This was referenced May 14, 2026

feat(retrieval): CypherFirstAggregationStrategy + cypher-path accuracy fixes #255

Open

docs(retrieval): recommend SentenceTokenCapChunking for CypherFirst #253

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingestion): default to SentenceTokenCapChunking in ingest()/upda…#254

feat(ingestion): default to SentenceTokenCapChunking in ingest()/upda…#254
galshubeli wants to merge 1 commit into
mainfrom
feat/default-sentence-aware-chunker

galshubeli commented May 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

galshubeli commented May 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Compatibility

Summary

Changes

Test Plan

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

galshubeli commented May 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading