Skip to content

feat(ingestion): default to SentenceTokenCapChunking in ingest()/upda…#254

Open
galshubeli wants to merge 1 commit into
mainfrom
feat/default-sentence-aware-chunker
Open

feat(ingestion): default to SentenceTokenCapChunking in ingest()/upda…#254
galshubeli wants to merge 1 commit into
mainfrom
feat/default-sentence-aware-chunker

Conversation

@galshubeli
Copy link
Copy Markdown
Collaborator

@galshubeli galshubeli commented May 14, 2026

…te()

Changes the default chunker that GraphRAG.ingest() and GraphRAG.update() fall back to when the caller doesn't pass an explicit chunker=. Was FixedSizeChunking(); now SentenceTokenCapChunking() (sentence-aware, max_tokens=512, overlap_sentences=2 — the strategy's own defaults).

Why

FixedSizeChunking splits on a hard character window with no awareness of sentence, word, or paragraph boundaries. When the window cuts through an entity name, the per-chunk LLM extractor produces a stub entity for the fragment ("Wayne Enterprises""Wayne En" in chunk N plus unparsable text in chunk N+1). These stubs never merge with their full forms during resolution because their embeddings differ enough that LLMVerifiedResolution scores them below the soft threshold.

This silently inflates cypher counts and pollutes "which X" lists. The strategy that surfaced this — CypherFirstAggregationStrategy — was hitting a 6/7 ceiling on the internal aggregation benchmark with one question failing because of these stubs. Switching to SentenceTokenCapChunking cleared the benchmark to 7/7 stable across three runs, and the post-ingest graph state went from 11-14 organization nodes (including Glo / Initech System / Wayne En) to exactly 10 clean orgs, and from 66-80 Person nodes (with Carla / Carla Okafor duplicates) to exactly 56 distinct persons — matching the corpus.

A side benefit: sentence-aware chunks with 2-sentence overlap almost always keep a person's first mention in the same chunk as their later short-form references, so per-chunk FastCoref now binds Carla → Carla Okafor reliably. That eliminates the short-form-duplicate class too, not just the truncation stubs.

Compatibility

FixedSizeChunking remains exported and fully supported — callers who explicitly pass chunker=FixedSizeChunking() get unchanged behavior. Existing tests (748 passed, 24 skipped) pass without modification: no test in the suite asserts on chunk count or content shape from the default chunker, so switching defaults doesn't break the suite.

Callers who relied on the previous default and want to keep it should pass chunker=FixedSizeChunking() explicitly. The docstrings call out the new default and reference FixedSizeChunking as the opt-in character-window alternative.

Summary

Brief description of what this PR does and why.

Changes

  • Change 1
  • Change 2

Test Plan

  • All existing tests pass (pytest tests/ -q)
  • New tests added for new functionality (if applicable)
  • Lint passes (ruff check src/)

Notes

Any additional context for reviewers.

Summary by CodeRabbit

  • New Features
    • Updated default text chunking strategy to use sentence-aware processing with improved entity name handling for data ingestion, updates, and change application.

Review Change Stack

…te()

Changes the default chunker that ``GraphRAG.ingest()`` and
``GraphRAG.update()`` fall back to when the caller doesn't pass an
explicit ``chunker=``. Was ``FixedSizeChunking()``; now
``SentenceTokenCapChunking()`` (sentence-aware, max_tokens=512,
overlap_sentences=2 — the strategy's own defaults).

Why
---
``FixedSizeChunking`` splits on a hard character window with no awareness
of sentence, word, or paragraph boundaries. When the window cuts through
an entity name, the per-chunk LLM extractor produces a stub entity for
the fragment (``"Wayne Enterprises"`` → ``"Wayne En"`` in chunk N plus
unparsable text in chunk N+1). These stubs never merge with their full
forms during resolution because their embeddings differ enough that
LLMVerifiedResolution scores them below the soft threshold.

This silently inflates cypher counts and pollutes "which X" lists. The
strategy that surfaced this — ``CypherFirstAggregationStrategy`` — was
hitting a 6/7 ceiling on the internal aggregation benchmark with one
question failing because of these stubs. Switching to
``SentenceTokenCapChunking`` cleared the benchmark to 7/7 stable across
three runs, and the post-ingest graph state went from 11-14 organization
nodes (including ``Glo`` / ``Initech System`` / ``Wayne En``) to exactly
10 clean orgs, and from 66-80 ``Person`` nodes (with ``Carla`` / ``Carla
Okafor`` duplicates) to exactly 56 distinct persons — matching the
corpus.

A side benefit: sentence-aware chunks with 2-sentence overlap almost
always keep a person's first mention in the same chunk as their later
short-form references, so per-chunk FastCoref now binds ``Carla → Carla
Okafor`` reliably. That eliminates the short-form-duplicate class too,
not just the truncation stubs.

Compatibility
-------------
``FixedSizeChunking`` remains exported and fully supported — callers who
explicitly pass ``chunker=FixedSizeChunking()`` get unchanged behavior.
Existing tests (748 passed, 24 skipped) pass without modification: no
test in the suite asserts on chunk count or content shape from the
default chunker, so switching defaults doesn't break the suite.

Callers who relied on the previous default and want to keep it should
pass ``chunker=FixedSizeChunking()`` explicitly. The docstrings call out
the new default and reference ``FixedSizeChunking`` as the opt-in
character-window alternative.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: af7f5953-ecf8-44fe-b9b3-beb737752bad

📥 Commits

Reviewing files that changed from the base of the PR and between a90ad18 and 2d220aa.

📒 Files selected for processing (1)
  • graphrag_sdk/src/graphrag_sdk/api/main.py

📝 Walkthrough

Walkthrough

GraphRAG's default chunking strategy shifts from fixed-size to sentence-aware tokenization. The import is updated, API docstrings document the new SentenceTokenCapChunking() default and its entity-boundary preservation, and both ingest pipeline constructions apply the new default chunker.

Changes

Default Chunking Strategy Update

Layer / File(s) Summary
Import and API documentation updates
graphrag_sdk/src/graphrag_sdk/api/main.py
SentenceTokenCapChunking is imported; ingest() and apply_changes() docstrings updated to document sentence-aware chunking with max_tokens=512 and overlap_sentences=2, preserving entity boundaries.
Default chunker in ingest pipeline
graphrag_sdk/src/graphrag_sdk/api/main.py
IngestionPipeline in _ingest_single() and update() defaults to SentenceTokenCapChunking() instead of FixedSizeChunking() when no chunker is explicitly provided.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 A sentence-wise chunker hops into place,
No more fixed boxes—entity grace!
Tokens align where the boundaries are,
Docs and code dance beneath one star.
The ingest pipeline learns to see—
One small change, one new decree. 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: switching the default chunking strategy from FixedSizeChunking to SentenceTokenCapChunking in the ingest() and update() methods.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/default-sentence-aware-chunker

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates GraphRAG’s ingestion defaults so callers who don’t explicitly pass chunker= use sentence-aware chunking, improving extraction quality by reducing mid-entity boundary splits.

Changes:

  • Switch default chunker for GraphRAG.ingest()/_ingest_single() from FixedSizeChunking() to SentenceTokenCapChunking().
  • Switch default chunker for GraphRAG.update() from FixedSizeChunking() to SentenceTokenCapChunking().
  • Update public docstrings to reflect the new default chunker.
Comments suppressed due to low confidence (1)

graphrag_sdk/src/graphrag_sdk/api/main.py:1019

  • Same concern as ingest(): update() now defaults to SentenceTokenCapChunking(), but that chunker can produce over-cap chunks when a single sentence exceeds max_tokens. This can inflate Step 2 verification prompts and cause provider context-limit errors. Consider fixing the chunker to enforce the cap or documenting the over-cap edge case.
        pipeline = IngestionPipeline(
            loader=loader or TextLoader(),  # unused (text is provided below)
            chunker=chunker or SentenceTokenCapChunking(),
            extractor=extractor or self._default_extractor(),

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +326 to +328
— sentence-aware, never splits entity names at chunk boundaries.
Override with ``chunker=FixedSizeChunking(...)`` if you need
character-window chunking.
Comment on lines 535 to 538
pipeline = IngestionPipeline(
loader=loader or TextLoader(),
chunker=chunker or FixedSizeChunking(),
chunker=chunker or SentenceTokenCapChunking(),
extractor=extractor or self._default_extractor(),
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants