test: add 41 edge-case tests for chunking module by voidborne-d · Pull Request #436 · google/langextract

voidborne-d · 2026-04-06T14:51:27Z

Closes #430

Summary

Adds comprehensive edge-case test coverage for langextract/chunking.py, covering all untested code paths reported in #430.

New tests (41 total, organized into 11 focused test classes)

`SentenceIteratorGuardTest` (4 tests)

Negative curr_token_pos → IndexError
curr_token_pos past document end → IndexError
curr_token_pos at exact end → StopIteration
Mid-sentence start yields partial sentence first

`CreateTokenIntervalTest` (4 tests)

Negative start index → ValueError
Start equals end → ValueError
Start greater than end → ValueError
Valid interval construction

`GetTokenIntervalTextTest` (3 tests)

start_index >= end_index → ValueError
Equal start/end → ValueError
TokenUtilError when tokenizer returns empty string for non-empty text

`GetCharIntervalTest` (2 tests)

start_index >= end_index → ValueError
Valid char interval with correct bounds

`ChunkIteratorConstructorTest` (6 tests)

Both text and document None → ValueError
text=None falls back to document.text
Empty TokenizedText (no tokens) triggers re-tokenization
Empty TokenizedText with document falls back to document.text
String input is tokenized automatically
No document → creates default Document

`TextChunkErrorPathTest` (2 tests)

chunk_text raises ValueError without document
char_interval raises ValueError without document

`SanitizedChunkTextTest` (3 tests)

Whitespace normalization (newlines, consecutive spaces)
Result is cached (same object on repeated access)
_sanitize raises ValueError for whitespace-only input

`LazyCachingTest` (6 tests)

_chunk_text starts as None, populated after first access, same object on repeat
_char_interval starts as None, populated after first access, same object on repeat

`MakeBatchesTest` (4 tests)

batch_size=1 → single-element batches
batch_size=2 → two-element batches
batch_size larger than total chunks → single batch
Empty iterator → no batches

`BrokenSentenceFlagTest` (3 tests)

Flag set to True when a sentence is split
Flag reset to False after broken sentence fully consumed
Flag stays False when all sentences fit

`TextChunkStrTest` + `TextChunkPropertyCoverageTest` (4 tests)

__str__ shows "unavailable" when document is missing
__str__ includes document ID and text when present
document_id / document_text return None without document

Verification

$ python -m pytest tests/chunking_test.py tests/chunking_edge_cases_test.py -q
60 passed in 0.55s

All 41 new tests pass alongside the existing 19 tests.

github-actions · 2026-04-06T14:51:39Z

No linked issues found. Please link an issue in your pull request description or title.

Per our Contributing Guidelines, all PRs must:

Reference an issue with one of:
- Closing keywords: Fixes #123, Closes #123, Resolves #123 (auto-closes on merge in the same repository)
- Reference keywords: Related to #123, Refs #123, Part of #123, See #123 (links without closing)
The linked issue should have 5+ 👍 reactions from unique users (excluding bots and the PR author)
Include discussion demonstrating the importance of the change

You can also use cross-repo references like owner/repo#123 or full URLs.

Cover all untested code paths reported in google#430: - SentenceIterator.__init__ guard clauses (negative / out-of-range curr_token_pos, position at end of tokens, mid-sentence start) - create_token_interval / get_token_interval_text / get_char_interval ValueError and TokenUtilError error paths - ChunkIterator constructor edge cases (both text and document None, text=None falling back to document.text, empty TokenizedText triggering re-tokenization, string input tokenization, default document creation) - TextChunk.chunk_text / char_interval raising ValueError when document is None - TextChunk.sanitized_chunk_text including whitespace normalization, caching, and empty-input ValueError - Lazy caching verification for _chunk_text and _char_interval - make_batches_of_textchunk with batch_size=1, 2, larger-than-total, and empty iterator - broken_sentence flag lifecycle (set on split, reset after completion, stays False for small text) - TextChunk.__str__ with and without document - document_id / document_text returning None without document All 41 new tests pass alongside existing 19 tests (60 total).

github-actions Bot added the size/M Pull request with 150-600 lines changed label Apr 6, 2026

voidborne-d marked this pull request as draft April 6, 2026 19:12

voidborne-d force-pushed the test/chunking-edge-cases branch from 8fa6ce3 to eba27d0 Compare April 15, 2026 03:53

voidborne-d changed the title ~~test: cover chunking edge cases~~ test: add 41 edge-case tests for chunking module Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add 41 edge-case tests for chunking module#436

test: add 41 edge-case tests for chunking module#436
voidborne-d wants to merge 1 commit into
google:mainfrom
voidborne-d:test/chunking-edge-cases

voidborne-d commented Apr 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

voidborne-d commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New tests (41 total, organized into 11 focused test classes)

SentenceIteratorGuardTest (4 tests)

CreateTokenIntervalTest (4 tests)

GetTokenIntervalTextTest (3 tests)

GetCharIntervalTest (2 tests)

ChunkIteratorConstructorTest (6 tests)

TextChunkErrorPathTest (2 tests)

SanitizedChunkTextTest (3 tests)

LazyCachingTest (6 tests)

MakeBatchesTest (4 tests)

BrokenSentenceFlagTest (3 tests)

TextChunkStrTest + TextChunkPropertyCoverageTest (4 tests)

Verification

Uh oh!

github-actions Bot commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

voidborne-d commented Apr 6, 2026 •

edited

Loading

`SentenceIteratorGuardTest` (4 tests)

`CreateTokenIntervalTest` (4 tests)

`GetTokenIntervalTextTest` (3 tests)

`GetCharIntervalTest` (2 tests)

`ChunkIteratorConstructorTest` (6 tests)

`TextChunkErrorPathTest` (2 tests)

`SanitizedChunkTextTest` (3 tests)

`LazyCachingTest` (6 tests)

`MakeBatchesTest` (4 tests)

`BrokenSentenceFlagTest` (3 tests)

`TextChunkStrTest` + `TextChunkPropertyCoverageTest` (4 tests)