Skip to content

test: add pytest suite + CI; fix empty-shard crash in JsonLineDataset#10

Open
ajinkyajawale14499 wants to merge 1 commit into
deepseek-ai:mainfrom
ajinkyajawale14499:test/pytest-suite-and-ci
Open

test: add pytest suite + CI; fix empty-shard crash in JsonLineDataset#10
ajinkyajawale14499 wants to merge 1 commit into
deepseek-ai:mainfrom
ajinkyajawale14499:test/pytest-suite-and-ci

Conversation

@ajinkyajawale14499

Copy link
Copy Markdown

What

Adds the repository's first automated test suite and CI, plus a small fix that the tests surfaced.

Tests (CPU-only, no GPU/model required)

  • tests/test_config.pyConfigNode attribute access, the recursive to_config_node / config_to_plain_dict converters, jsonable, CustomJSONEncoder (function/type/torch.dtype/Path/Namespace/ConfigNode), the --opts override parser (including the KeyError/TypeError paths), and load_config.
  • tests/test_jsonl_dataset.pyJsonLineDataset mmap line indexing: length with and without a trailing newline, per-record decoding, multi-file global indexing, out-of-range bounds, and the on-disk line-start cache (write + reuse). CACHE_DIR is redirected to a temp dir so tests are hermetic.
  • tests/test_sampling.py — deterministic surfaces of the sampling helpers: greedy/temperature-0 and softmax logits_to_probs, sample_tokens, gather_token_probs, sample_from_probs, and sample_residual (including the matched-distribution fallback).

CI

  • .github/workflows/ci.yml runs pytest on Python 3.10 and 3.12 with CPU-only torch, so it stays fast and needs no CUDA/triton. A root conftest.py keeps the project importable without an editable install.

Fix

  • While adding the dataset tests, JsonLineDataset was found to crash when a shard is 0 bytes: ValueError: cannot mmap an empty file. Empty shards are now treated as 0 records (with a matching regression test, including an empty shard alongside populated files).

Verification

All 42 tests pass locally (python -m pytest tests). Happy to adjust scope, naming, or the Python matrix to match your preferences.

Add a CPU-only pytest suite for the core pure-Python utilities and wire up
GitHub Actions CI:

- tests/test_config.py: ConfigNode, (to|from)-config-node converters, jsonable,
  CustomJSONEncoder, and the --opts override parser (incl. KeyError/TypeError
  paths) and load_config.
- tests/test_jsonl_dataset.py: mmap line indexing (trailing/no-trailing newline),
  per-record decoding, multi-file global indexing, bounds checks, and the
  on-disk line-start cache.
- tests/test_sampling.py: greedy/softmax logits_to_probs, sample_tokens,
  gather_token_probs, and sample_residual (incl. matched-distribution fallback).
- .github/workflows/ci.yml: runs pytest on Python 3.10/3.12 with CPU-only torch.

While adding the dataset tests, found and fixed a crash: JsonLineDataset raised
ValueError ('cannot mmap an empty file') when a shard was 0 bytes. Empty shards
are now treated as 0 records.

@rajpratham1 rajpratham1 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition! The new pytest suite significantly improves confidence in the core utilities, and the regression test for the empty JSONL shard fixes a real edge case that could otherwise crash dataset initialization. The CI workflow is lightweight and appropriate for CPU-only validation. Nice work improving both reliability and maintainability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants