feat: add hf buckets import command for S3-to-HF bucket ingestion#3993
Draft
feat: add hf buckets import command for S3-to-HF bucket ingestion#3993
hf buckets import command for S3-to-HF bucket ingestion#3993Conversation
Add `hf buckets import <s3://...> <hf://...>` CLI command and `import_from_s3()` Python API to ingest data from S3 buckets into HF buckets. Key features: - Uses s3fs/fsspec to stream files from S3 through local machine - Parallel S3 downloads with configurable --workers - Batched uploads to HF via batch_bucket_files with --batch-size - --include/--exclude fnmatch pattern filtering - --dry-run to preview transfers - Real-time progress with throughput benchmarks (MB/s) - ImportStats dataclass with transfer statistics - Optional s3 extra: pip install huggingface_hub[s3] Files changed: - src/huggingface_hub/_buckets.py: core import_from_s3 logic - src/huggingface_hub/cli/buckets.py: CLI command registration - src/huggingface_hub/__init__.py: public API exports - setup.py: s3 extra dependency - tests/test_buckets_import.py: 22 unit tests - docs/source/en/package_reference/cli.md: auto-generated CLI ref Co-authored-by: Lucain <Wauplin@users.noreply.github.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new
hf buckets import <s3://...> <hf://...>CLI command and correspondingimport_from_s3()Python API to ingest data from AWS S3 buckets into Hugging Face buckets.Data is streamed from S3 through the local machine (download + re-upload) using
s3fs(fsspec-based S3 filesystem). AWS credentials are resolved via the standard boto chain (env vars,~/.aws/credentials, instance profiles, etc.).Usage
CLI
Python API
Key features
--workers(default 4) for concurrent downloads--batch-size(default 50) to batchbatch_bucket_filescalls--include/--excludewith fnmatch patterns--dry-runto preview what would be transferredImportStatsdataclass: Returns transfer statistics (files, bytes, elapsed time, throughput)pip install huggingface_hub[s3]installss3fsFiles changed
src/huggingface_hub/_buckets.pyimport_from_s3()logic,ImportStatsdataclass, S3 file listingsrc/huggingface_hub/cli/buckets.pyhf buckets importCLI commandsrc/huggingface_hub/__init__.pyImportStats,import_from_s3)setup.pys3extras (pip install huggingface_hub[s3])tests/test_buckets_import.pydocs/source/en/package_reference/cli.mdDesign decisions
Separate
importsubcommand rather than extendingsync/cp: S3 sources are fundamentally different from local paths or HF bucket paths (different auth, different filesystem). A dedicated command avoids overloading existing bucket commands with S3-specific concerns.Download-then-upload via temp files: Each batch of files is downloaded from S3 to a temp directory, then batch-uploaded to HF via
batch_bucket_files. Temp files are cleaned up after each batch to bound disk usage.s3fs as optional dep: Added as
extras["s3"]so it's not pulled in unless users need it. Clear error message with install instructions if missing.Slack Thread