feat: add `hf buckets import` command for S3-to-HF bucket ingestion by Wauplin · Pull Request #3993 · huggingface/huggingface_hub

Wauplin · 2026-03-27T15:16:09Z

Summary

Adds a new hf buckets import <s3://...> <hf://...> CLI command and corresponding import_from_s3() Python API to ingest data from AWS S3 buckets into Hugging Face buckets.

Data is streamed from S3 through the local machine (download + re-upload) using s3fs (fsspec-based S3 filesystem). AWS credentials are resolved via the standard boto chain (env vars, ~/.aws/credentials, instance profiles, etc.).

Usage

CLI

# Basic import
hf buckets import s3://my-data-bucket hf://buckets/user/my-bucket

# Import with prefix on both sides
hf buckets import s3://my-data-bucket/raw-data/ hf://buckets/user/my-bucket/imported/

# Dry run - preview what would be transferred
hf buckets import s3://my-data-bucket hf://buckets/user/my-bucket --dry-run

# Filter files
hf buckets import s3://my-data-bucket hf://buckets/user/my-bucket --include "*.parquet" --exclude "*.tmp"

# Tune parallelism
hf buckets import s3://my-data-bucket hf://buckets/user/my-bucket --workers 8 --batch-size 100

Python API

from huggingface_hub import HfApi
from huggingface_hub._buckets import import_from_s3

api = HfApi()
stats = import_from_s3(
    s3_source="s3://my-data-bucket",
    bucket_dest="hf://buckets/user/my-bucket",
    api=api,
    include=["*.parquet"],
    workers=8,
)
print(stats.summary_str())
# "Transferred 42 file(s), (1.2 GB), in 45.3s, (27.1 MB/s)"

Key features

S3 → HF one-way import: Downloads from S3, uploads to HF bucket (no reverse direction)
Parallel S3 downloads: Configurable --workers (default 4) for concurrent downloads
Batched HF uploads: Configurable --batch-size (default 50) to batch batch_bucket_files calls
Pattern filtering: --include / --exclude with fnmatch patterns
Dry run: --dry-run to preview what would be transferred
Progress reporting: Real-time status with throughput benchmarks (MB/s)
ImportStats dataclass: Returns transfer statistics (files, bytes, elapsed time, throughput)
Optional dependency: pip install huggingface_hub[s3] installs s3fs

Files changed

File	Change
`src/huggingface_hub/_buckets.py`	Core `import_from_s3()` logic, `ImportStats` dataclass, S3 file listing
`src/huggingface_hub/cli/buckets.py`	`hf buckets import` CLI command
`src/huggingface_hub/__init__.py`	Public API exports (`ImportStats`, `import_from_s3`)
`setup.py`	New `s3` extras (`pip install huggingface_hub[s3]`)
`tests/test_buckets_import.py`	22 unit tests (mocked S3, no real AWS needed)
`docs/source/en/package_reference/cli.md`	Auto-generated CLI reference update

Design decisions

Separate import subcommand rather than extending sync/cp: S3 sources are fundamentally different from local paths or HF bucket paths (different auth, different filesystem). A dedicated command avoids overloading existing bucket commands with S3-specific concerns.
Download-then-upload via temp files: Each batch of files is downloaded from S3 to a temp directory, then batch-uploaded to HF via batch_bucket_files. Temp files are cleaned up after each batch to bound disk usage.
s3fs as optional dep: Added as extras["s3"] so it's not pulled in unless users need it. Clear error message with install instructions if missing.

Slack Thread

Add `hf buckets import <s3://...> <hf://...>` CLI command and `import_from_s3()` Python API to ingest data from S3 buckets into HF buckets. Key features: - Uses s3fs/fsspec to stream files from S3 through local machine - Parallel S3 downloads with configurable --workers - Batched uploads to HF via batch_bucket_files with --batch-size - --include/--exclude fnmatch pattern filtering - --dry-run to preview transfers - Real-time progress with throughput benchmarks (MB/s) - ImportStats dataclass with transfer statistics - Optional s3 extra: pip install huggingface_hub[s3] Files changed: - src/huggingface_hub/_buckets.py: core import_from_s3 logic - src/huggingface_hub/cli/buckets.py: CLI command registration - src/huggingface_hub/__init__.py: public API exports - setup.py: s3 extra dependency - tests/test_buckets_import.py: 22 unit tests - docs/source/en/package_reference/cli.md: auto-generated CLI ref Co-authored-by: Lucain <Wauplin@users.noreply.github.com>

bot-ci-comment · 2026-03-27T15:21:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `hf buckets import` command for S3-to-HF bucket ingestion#3993

feat: add `hf buckets import` command for S3-to-HF bucket ingestion#3993
Wauplin wants to merge 1 commit intomainfrom
cursor/s3-to-hf-bucket-ingestion-f144

Wauplin commented Mar 27, 2026

Uh oh!

bot-ci-comment bot commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Wauplin commented Mar 27, 2026

Summary

Usage

CLI

Python API

Key features

Files changed

Design decisions

Uh oh!

bot-ci-comment bot commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants