Skip to content

[BUG] load_datasets() cannot use the generated Arrow cache correctly #8034

@Nexround

Description

@Nexround

Describe the bug

The datasets library cannot use the generated Arrow cache correctly, seemingly due to a flaw in the internal hash symbol calculation logic.
The following code provides verification.
I am trying to locate the specific code position, and if there are further developments, I will update here.

Steps to reproduce the bug

python -c "
from datasets.load import LocalDatasetModuleFactory
mod = LocalDatasetModuleFactory('/cache/datasets/imagenet-1k')
dm = mod.get_module()
print('Hash:', dm.hash)
"
Resolving data files: 100%|█████████| 294/294 [00:00<00:00, 23895.46it/s]
Resolving data files: 100%|██████████| 28/28 [00:00<00:00, 221168.57it/s]
Hash: 9e9925e0f7d48775
python -c "
from datasets.load import LocalDatasetModuleFactory
mod = LocalDatasetModuleFactory('/cache/datasets/imagenet-1k')
dm = mod.get_module()
print('Hash:', dm.hash)
"
Resolving data files: 100%|█████████| 294/294 [00:00<00:00, 26252.91it/s]
Resolving data files: 100%|██████████| 28/28 [00:00<00:00, 188205.95it/s]
Hash: 9af23f3d5d488660

Expected behavior

The expected behavior is that the above code should give the same hash value.

Environment info

  • datasets version: 4.6.1
  • Platform: Linux-6.8.0-100-generic-x86_64-with-glibc2.39
  • Python version: 3.12.12
  • huggingface_hub version: 1.5.0
  • PyArrow version: 23.0.1
  • Pandas version: 3.0.1
  • fsspec version: 2026.2.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions