Describe the bug
The datasets library cannot use the generated Arrow cache correctly, seemingly due to a flaw in the internal hash symbol calculation logic.
The following code provides verification.
I am trying to locate the specific code position, and if there are further developments, I will update here.
Steps to reproduce the bug
python -c "
from datasets.load import LocalDatasetModuleFactory
mod = LocalDatasetModuleFactory('/cache/datasets/imagenet-1k')
dm = mod.get_module()
print('Hash:', dm.hash)
"
Resolving data files: 100%|█████████| 294/294 [00:00<00:00, 23895.46it/s]
Resolving data files: 100%|██████████| 28/28 [00:00<00:00, 221168.57it/s]
Hash: 9e9925e0f7d48775
python -c "
from datasets.load import LocalDatasetModuleFactory
mod = LocalDatasetModuleFactory('/cache/datasets/imagenet-1k')
dm = mod.get_module()
print('Hash:', dm.hash)
"
Resolving data files: 100%|█████████| 294/294 [00:00<00:00, 26252.91it/s]
Resolving data files: 100%|██████████| 28/28 [00:00<00:00, 188205.95it/s]
Hash: 9af23f3d5d488660
Expected behavior
The expected behavior is that the above code should give the same hash value.
Environment info
datasets version: 4.6.1
- Platform: Linux-6.8.0-100-generic-x86_64-with-glibc2.39
- Python version: 3.12.12
huggingface_hub version: 1.5.0
- PyArrow version: 23.0.1
- Pandas version: 3.0.1
fsspec version: 2026.2.0
Describe the bug
The datasets library cannot use the generated Arrow cache correctly, seemingly due to a flaw in the internal hash symbol calculation logic.
The following code provides verification.
I am trying to locate the specific code position, and if there are further developments, I will update here.
Steps to reproduce the bug
python -c "
from datasets.load import LocalDatasetModuleFactory
mod = LocalDatasetModuleFactory('/cache/datasets/imagenet-1k')
dm = mod.get_module()
print('Hash:', dm.hash)
"
Resolving data files: 100%|█████████| 294/294 [00:00<00:00, 23895.46it/s]
Resolving data files: 100%|██████████| 28/28 [00:00<00:00, 221168.57it/s]
Hash: 9e9925e0f7d48775
python -c "
from datasets.load import LocalDatasetModuleFactory
mod = LocalDatasetModuleFactory('/cache/datasets/imagenet-1k')
dm = mod.get_module()
print('Hash:', dm.hash)
"
Resolving data files: 100%|█████████| 294/294 [00:00<00:00, 26252.91it/s]
Resolving data files: 100%|██████████| 28/28 [00:00<00:00, 188205.95it/s]
Hash: 9af23f3d5d488660
Expected behavior
The expected behavior is that the above code should give the same hash value.
Environment info
datasetsversion: 4.6.1huggingface_hubversion: 1.5.0fsspecversion: 2026.2.0