Skip to content

feat: add hf buckets import command for S3-to-HF bucket ingestion#3993

Draft
Wauplin wants to merge 1 commit intomainfrom
cursor/s3-to-hf-bucket-ingestion-f144
Draft

feat: add hf buckets import command for S3-to-HF bucket ingestion#3993
Wauplin wants to merge 1 commit intomainfrom
cursor/s3-to-hf-bucket-ingestion-f144

Conversation

@Wauplin
Copy link
Copy Markdown
Contributor

@Wauplin Wauplin commented Mar 27, 2026

Summary

Adds a new hf buckets import <s3://...> <hf://...> CLI command and corresponding import_from_s3() Python API to ingest data from AWS S3 buckets into Hugging Face buckets.

Data is streamed from S3 through the local machine (download + re-upload) using s3fs (fsspec-based S3 filesystem). AWS credentials are resolved via the standard boto chain (env vars, ~/.aws/credentials, instance profiles, etc.).

Usage

CLI

# Basic import
hf buckets import s3://my-data-bucket hf://buckets/user/my-bucket

# Import with prefix on both sides
hf buckets import s3://my-data-bucket/raw-data/ hf://buckets/user/my-bucket/imported/

# Dry run - preview what would be transferred
hf buckets import s3://my-data-bucket hf://buckets/user/my-bucket --dry-run

# Filter files
hf buckets import s3://my-data-bucket hf://buckets/user/my-bucket --include "*.parquet" --exclude "*.tmp"

# Tune parallelism
hf buckets import s3://my-data-bucket hf://buckets/user/my-bucket --workers 8 --batch-size 100

Python API

from huggingface_hub import HfApi
from huggingface_hub._buckets import import_from_s3

api = HfApi()
stats = import_from_s3(
    s3_source="s3://my-data-bucket",
    bucket_dest="hf://buckets/user/my-bucket",
    api=api,
    include=["*.parquet"],
    workers=8,
)
print(stats.summary_str())
# "Transferred 42 file(s), (1.2 GB), in 45.3s, (27.1 MB/s)"

Key features

  • S3 → HF one-way import: Downloads from S3, uploads to HF bucket (no reverse direction)
  • Parallel S3 downloads: Configurable --workers (default 4) for concurrent downloads
  • Batched HF uploads: Configurable --batch-size (default 50) to batch batch_bucket_files calls
  • Pattern filtering: --include / --exclude with fnmatch patterns
  • Dry run: --dry-run to preview what would be transferred
  • Progress reporting: Real-time status with throughput benchmarks (MB/s)
  • ImportStats dataclass: Returns transfer statistics (files, bytes, elapsed time, throughput)
  • Optional dependency: pip install huggingface_hub[s3] installs s3fs

Files changed

File Change
src/huggingface_hub/_buckets.py Core import_from_s3() logic, ImportStats dataclass, S3 file listing
src/huggingface_hub/cli/buckets.py hf buckets import CLI command
src/huggingface_hub/__init__.py Public API exports (ImportStats, import_from_s3)
setup.py New s3 extras (pip install huggingface_hub[s3])
tests/test_buckets_import.py 22 unit tests (mocked S3, no real AWS needed)
docs/source/en/package_reference/cli.md Auto-generated CLI reference update

Design decisions

  1. Separate import subcommand rather than extending sync/cp: S3 sources are fundamentally different from local paths or HF bucket paths (different auth, different filesystem). A dedicated command avoids overloading existing bucket commands with S3-specific concerns.

  2. Download-then-upload via temp files: Each batch of files is downloaded from S3 to a temp directory, then batch-uploaded to HF via batch_bucket_files. Temp files are cleaned up after each batch to bound disk usage.

  3. s3fs as optional dep: Added as extras["s3"] so it's not pulled in unless users need it. Clear error message with install instructions if missing.

Slack Thread

Open in Web Open in Cursor 

Add `hf buckets import <s3://...> <hf://...>` CLI command and
`import_from_s3()` Python API to ingest data from S3 buckets into
HF buckets.

Key features:
- Uses s3fs/fsspec to stream files from S3 through local machine
- Parallel S3 downloads with configurable --workers
- Batched uploads to HF via batch_bucket_files with --batch-size
- --include/--exclude fnmatch pattern filtering
- --dry-run to preview transfers
- Real-time progress with throughput benchmarks (MB/s)
- ImportStats dataclass with transfer statistics
- Optional s3 extra: pip install huggingface_hub[s3]

Files changed:
- src/huggingface_hub/_buckets.py: core import_from_s3 logic
- src/huggingface_hub/cli/buckets.py: CLI command registration
- src/huggingface_hub/__init__.py: public API exports
- setup.py: s3 extra dependency
- tests/test_buckets_import.py: 22 unit tests
- docs/source/en/package_reference/cli.md: auto-generated CLI ref

Co-authored-by: Lucain <Wauplin@users.noreply.github.com>
@bot-ci-comment
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants