Skip to content

fix(chunking): preserve sentence order in NlpSentenceChunking#1910

Closed
kuishou68 wants to merge 1 commit intounclecode:mainfrom
kuishou68:fix/nlp-sentence-chunking-order
Closed

fix(chunking): preserve sentence order in NlpSentenceChunking#1910
kuishou68 wants to merge 1 commit intounclecode:mainfrom
kuishou68:fix/nlp-sentence-chunking-order

Conversation

@kuishou68
Copy link
Copy Markdown

Summary

Fixes #1909

Bug

NlpSentenceChunking.chunk() was returning list(set(sens)) which destroys the natural document order of sentences (Python sets are unordered) and incorrectly removes duplicate sentences.

Fix

Return sens directly — nltk.sent_tokenize() already returns sentences in document order.

Using list(set(sens)) destroys sentence order and incorrectly deduplicates.
Fix: return the sentences list directly.
@ntohidi
Copy link
Copy Markdown
Collaborator

ntohidi commented Apr 11, 2026

Thanks for your contribution. This is a clean fix.
While fixing this, I also found that NlpSentenceChunking.init() had a broken re-import (from crawl4ai.le.legacy.model_loader import ...) that shadows the working top-level import. So I've made a new PR, #1913

@ntohidi ntohidi closed this Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: NlpSentenceChunking.chunk() uses list(set(sens)) which destroys sentence order

2 participants