Skip to content

test: add 41 edge-case tests for chunking module#436

Draft
voidborne-d wants to merge 1 commit into
google:mainfrom
voidborne-d:test/chunking-edge-cases
Draft

test: add 41 edge-case tests for chunking module#436
voidborne-d wants to merge 1 commit into
google:mainfrom
voidborne-d:test/chunking-edge-cases

Conversation

@voidborne-d
Copy link
Copy Markdown
Contributor

@voidborne-d voidborne-d commented Apr 6, 2026

Closes #430

Summary

Adds comprehensive edge-case test coverage for langextract/chunking.py, covering all untested code paths reported in #430.

New tests (41 total, organized into 11 focused test classes)

SentenceIteratorGuardTest (4 tests)

  • Negative curr_token_posIndexError
  • curr_token_pos past document end → IndexError
  • curr_token_pos at exact end → StopIteration
  • Mid-sentence start yields partial sentence first

CreateTokenIntervalTest (4 tests)

  • Negative start index → ValueError
  • Start equals end → ValueError
  • Start greater than end → ValueError
  • Valid interval construction

GetTokenIntervalTextTest (3 tests)

  • start_index >= end_indexValueError
  • Equal start/end → ValueError
  • TokenUtilError when tokenizer returns empty string for non-empty text

GetCharIntervalTest (2 tests)

  • start_index >= end_indexValueError
  • Valid char interval with correct bounds

ChunkIteratorConstructorTest (6 tests)

  • Both text and document None → ValueError
  • text=None falls back to document.text
  • Empty TokenizedText (no tokens) triggers re-tokenization
  • Empty TokenizedText with document falls back to document.text
  • String input is tokenized automatically
  • No document → creates default Document

TextChunkErrorPathTest (2 tests)

  • chunk_text raises ValueError without document
  • char_interval raises ValueError without document

SanitizedChunkTextTest (3 tests)

  • Whitespace normalization (newlines, consecutive spaces)
  • Result is cached (same object on repeated access)
  • _sanitize raises ValueError for whitespace-only input

LazyCachingTest (6 tests)

  • _chunk_text starts as None, populated after first access, same object on repeat
  • _char_interval starts as None, populated after first access, same object on repeat

MakeBatchesTest (4 tests)

  • batch_size=1 → single-element batches
  • batch_size=2 → two-element batches
  • batch_size larger than total chunks → single batch
  • Empty iterator → no batches

BrokenSentenceFlagTest (3 tests)

  • Flag set to True when a sentence is split
  • Flag reset to False after broken sentence fully consumed
  • Flag stays False when all sentences fit

TextChunkStrTest + TextChunkPropertyCoverageTest (4 tests)

  • __str__ shows "unavailable" when document is missing
  • __str__ includes document ID and text when present
  • document_id / document_text return None without document

Verification

$ python -m pytest tests/chunking_test.py tests/chunking_edge_cases_test.py -q
60 passed in 0.55s

All 41 new tests pass alongside the existing 19 tests.

@github-actions github-actions Bot added the size/M Pull request with 150-600 lines changed label Apr 6, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 6, 2026

No linked issues found. Please link an issue in your pull request description or title.

Per our Contributing Guidelines, all PRs must:

  • Reference an issue with one of:
    • Closing keywords: Fixes #123, Closes #123, Resolves #123 (auto-closes on merge in the same repository)
    • Reference keywords: Related to #123, Refs #123, Part of #123, See #123 (links without closing)
  • The linked issue should have 5+ 👍 reactions from unique users (excluding bots and the PR author)
  • Include discussion demonstrating the importance of the change

You can also use cross-repo references like owner/repo#123 or full URLs.

@voidborne-d voidborne-d marked this pull request as draft April 6, 2026 19:12
Cover all untested code paths reported in google#430:

- SentenceIterator.__init__ guard clauses (negative / out-of-range
  curr_token_pos, position at end of tokens, mid-sentence start)
- create_token_interval / get_token_interval_text / get_char_interval
  ValueError and TokenUtilError error paths
- ChunkIterator constructor edge cases (both text and document None,
  text=None falling back to document.text, empty TokenizedText
  triggering re-tokenization, string input tokenization, default
  document creation)
- TextChunk.chunk_text / char_interval raising ValueError when
  document is None
- TextChunk.sanitized_chunk_text including whitespace normalization,
  caching, and empty-input ValueError
- Lazy caching verification for _chunk_text and _char_interval
- make_batches_of_textchunk with batch_size=1, 2, larger-than-total,
  and empty iterator
- broken_sentence flag lifecycle (set on split, reset after
  completion, stays False for small text)
- TextChunk.__str__ with and without document
- document_id / document_text returning None without document

All 41 new tests pass alongside existing 19 tests (60 total).
@voidborne-d voidborne-d force-pushed the test/chunking-edge-cases branch from 8fa6ce3 to eba27d0 Compare April 15, 2026 03:53
@voidborne-d voidborne-d changed the title test: cover chunking edge cases test: add 41 edge-case tests for chunking module Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Pull request with 150-600 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test: Add missing test coverage for chunking module edge cases

1 participant