Skip to content

fix(synthetic): align _SyntheticTextExamplesIterable with HuggingFace…#645

Closed
arturofredes wants to merge 1 commit intovllm-project:mainfrom
arturofredes:fix/synthetic-iterable-hf-datasets-api
Closed

fix(synthetic): align _SyntheticTextExamplesIterable with HuggingFace…#645
arturofredes wants to merge 1 commit intovllm-project:mainfrom
arturofredes:fix/synthetic-iterable-hf-datasets-api

Conversation

@arturofredes
Copy link
Copy Markdown

@arturofredes arturofredes commented Mar 19, 2026

… datasets API

  • Add n_shards property (alias of num_shards). HuggingFace IterableDataset expects n_shards; without it, loading synthetic data raises NotImplementedError.
  • Change shard_data_sources(worker_id, num_workers) to match the base _BaseExamplesIterable API; the previous signature (num_shards, index, contiguous) caused 'unexpected keyword argument worker_id' when using multiple DataLoader workers. Fixes synthetic data loader when used with datasets.IterableDataset.

Summary

Details

  • [ ]

Test Plan

Related Issues

  • Resolves #

  • "I certify that all code in this PR is my own, except as noted below."

Use of AI

  • Includes AI-assisted code completion
  • Includes code generated by an AI application
  • Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

… datasets API

- Add n_shards property (alias of num_shards). HuggingFace IterableDataset
  expects n_shards; without it, loading synthetic data raises
  NotImplementedError.
- Change shard_data_sources(worker_id, num_workers) to match the base
  _BaseExamplesIterable API; the previous signature (num_shards, index,
  contiguous) caused 'unexpected keyword argument worker_id' when
  using multiple DataLoader workers.
Fixes synthetic data loader when used with datasets.IterableDataset.

Signed-off-by: Arturo Fredes <arturofredesc@gmail.com>
@dbutenhof
Copy link
Copy Markdown
Collaborator

dbutenhof commented Mar 19, 2026

Can you tell us what version of the datasets package you're using?

The latest, 4.8.2, aligns with the code in the current GuideLLM. The change you're proposing seems to revert a datasets change made in October 2024, which renamed n_shards to num_shards (although it retained n_shards as an alias for backwards compatibility) and changed the signature of shard_data_sources from what you're proposing to what GuideLLM now uses.

GuideLLM currently depends on the datasets package without any minimum version, but apparently actually depends on a version somewhat later than you're using. (It appears that PR first appeared in datasets 3.1.0.)

Is there another constraint requiring you to use an old version of datasets? And, if not, can you try upgrading?

Copy link
Copy Markdown
Collaborator

@sjmonson sjmonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

@dbutenhof
Copy link
Copy Markdown
Collaborator

We've chosen to resolve this with #650, which explicitly requires a minimum datasets version of 4.1.0. This will avoid the possibility of accidentally running GuideLLM with the extremely old datasets version that caused your problem.

@dbutenhof dbutenhof closed this Mar 24, 2026
@dbutenhof dbutenhof added wontfix This will not be worked on community contribution An opportunity for contribution from the GuideLLM community already invested in this area. labels Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community contribution An opportunity for contribution from the GuideLLM community already invested in this area. wontfix This will not be worked on

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants