test: add 41 edge-case tests for chunking module#436
Draft
voidborne-d wants to merge 1 commit into
Draft
Conversation
|
No linked issues found. Please link an issue in your pull request description or title. Per our Contributing Guidelines, all PRs must:
You can also use cross-repo references like |
Cover all untested code paths reported in google#430: - SentenceIterator.__init__ guard clauses (negative / out-of-range curr_token_pos, position at end of tokens, mid-sentence start) - create_token_interval / get_token_interval_text / get_char_interval ValueError and TokenUtilError error paths - ChunkIterator constructor edge cases (both text and document None, text=None falling back to document.text, empty TokenizedText triggering re-tokenization, string input tokenization, default document creation) - TextChunk.chunk_text / char_interval raising ValueError when document is None - TextChunk.sanitized_chunk_text including whitespace normalization, caching, and empty-input ValueError - Lazy caching verification for _chunk_text and _char_interval - make_batches_of_textchunk with batch_size=1, 2, larger-than-total, and empty iterator - broken_sentence flag lifecycle (set on split, reset after completion, stays False for small text) - TextChunk.__str__ with and without document - document_id / document_text returning None without document All 41 new tests pass alongside existing 19 tests (60 total).
8fa6ce3 to
eba27d0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #430
Summary
Adds comprehensive edge-case test coverage for
langextract/chunking.py, covering all untested code paths reported in #430.New tests (41 total, organized into 11 focused test classes)
SentenceIteratorGuardTest(4 tests)curr_token_pos→IndexErrorcurr_token_pospast document end →IndexErrorcurr_token_posat exact end →StopIterationCreateTokenIntervalTest(4 tests)ValueErrorValueErrorValueErrorGetTokenIntervalTextTest(3 tests)start_index >= end_index→ValueErrorValueErrorTokenUtilErrorwhen tokenizer returns empty string for non-empty textGetCharIntervalTest(2 tests)start_index >= end_index→ValueErrorChunkIteratorConstructorTest(6 tests)textanddocumentNone →ValueErrortext=Nonefalls back todocument.textTokenizedText(no tokens) triggers re-tokenizationTokenizedTextwith document falls back todocument.textDocumentTextChunkErrorPathTest(2 tests)chunk_textraisesValueErrorwithout documentchar_intervalraisesValueErrorwithout documentSanitizedChunkTextTest(3 tests)_sanitizeraisesValueErrorfor whitespace-only inputLazyCachingTest(6 tests)_chunk_textstarts asNone, populated after first access, same object on repeat_char_intervalstarts asNone, populated after first access, same object on repeatMakeBatchesTest(4 tests)batch_size=1→ single-element batchesbatch_size=2→ two-element batchesbatch_sizelarger than total chunks → single batchBrokenSentenceFlagTest(3 tests)Truewhen a sentence is splitFalseafter broken sentence fully consumedFalsewhen all sentences fitTextChunkStrTest+TextChunkPropertyCoverageTest(4 tests)__str__shows "unavailable" when document is missing__str__includes document ID and text when presentdocument_id/document_textreturnNonewithout documentVerification
All 41 new tests pass alongside the existing 19 tests.