Skip to content

Centralize hf:// URI parsing into utils/_hf_uri.py #3994

Open
omkar-334 wants to merge 8 commits intohuggingface:mainfrom
omkar-334:parse
Open

Centralize hf:// URI parsing into utils/_hf_uri.py #3994
omkar-334 wants to merge 8 commits intohuggingface:mainfrom
omkar-334:parse

Conversation

@omkar-334
Copy link
Copy Markdown
Contributor

@omkar-334 omkar-334 commented Mar 27, 2026

Related issue - #3971


Note

Medium Risk
Touches core path/URI parsing used by HfFileSystem, buckets, CLI jobs volumes, and repo_type_and_id_from_hf_id, so subtle parsing/regression risks exist around revisions, bucket IDs, and ambiguous repo paths.

Overview
Centralizes hf:// and HF identifier parsing by adding utils/_hf_uri.py with a shared parse_hf_url implementation (including bucket support and special refs/... revision handling).

Updates buckets helpers, HfFileSystem.resolve_path, CLI hf jobs volume parsing, and hf_api.repo_type_and_id_from_hf_id to delegate to the new parser, removing duplicated split/regex logic and adjusting a few edge cases (explicit-type detection, bare bucket name handling, and mount-spec parsing around :/).

Adds tests/test_hf_uri.py to cover repo, bucket, revision, and error cases.

Written by Cursor Bugbot for commit 4c080e6. This will update automatically on new commits. Configure here.

@omkar-334
Copy link
Copy Markdown
Contributor Author

hey @Wauplin i've tried to look into #3971 and implement this... what do you think? would love your thoughts on this

else:
# First segment isn't a known type -> model type
vol_type_str = constants.REPO_TYPE_MODEL
remaining = source_part
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Volume @revision silently dropped during parsing

High Severity

The Volume dataclass has a revision field, but the refactored _parse_volumes never populates it. The old code preserved @revision as part of the source string (e.g., source="user/repo@main"). Now parse_hf_url extracts the revision into parsed.revision and strips it from repo_id, but parsed.revision is never passed to the Volume constructor. For any volume spec using @revision syntax, the revision is silently lost.

Additional Locations (1)
Fix in Cursor Fix in Web

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

if not is_hf_url and len(url_segments) == 3:
# Passed <repo_type>/<user>/<model_id> — accept singular type names
# (e.g. "dataset/user/id") which parse_hf_url doesn't handle.
repo_type, namespace, repo_id = url_segments
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bucket paths with 3 segments fail validation

High Severity

When repo_type_and_id_from_hf_id receives a 3-segment non-URL bucket path like "hf://buckets/ns/name", the code at the len(url_segments) == 3 branch does a simple repo_type, namespace, repo_id = url_segments, setting repo_type to "buckets" (plural). The downstream validation checks repo_type != "bucket" (singular), so "buckets" fails and raises a ValueError. The old code had explicit elif url_segments[0] == "buckets" handling that correctly set repo_type = "bucket".

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant