Skip to content

[FSDP2] Auto-exclude non-floating frozen Params4bit from fully_shard to prevent QLoRA crash#3987

Merged
SunMarc merged 4 commits intohuggingface:mainfrom
roycho96:fix/fsdp2-auto-ignore
Apr 10, 2026
Merged

[FSDP2] Auto-exclude non-floating frozen Params4bit from fully_shard to prevent QLoRA crash#3987
SunMarc merged 4 commits intohuggingface:mainfrom
roycho96:fix/fsdp2-auto-ignore

Conversation

@roycho96
Copy link
Copy Markdown
Contributor

What does this PR do?

Auto-excludes incompatible BnB Params4bit from FSDP2 sharding via ignored_params to prevent QLoRA crash.

Fixes #3983

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@SunMarc

Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we get the same results as with bnb_4bit_quant_storage being a float ? Did you test on a e2e example to compare both ? Thanks a lot !

@roycho96
Copy link
Copy Markdown
Contributor Author

Do we get the same results as with bnb_4bit_quant_storage being a float ? Did you test on a e2e example to compare both ? Thanks a lot !

@SunMarc Thanks for the review!

I ran an e2e comparison on single GPU (Qwen3-0.6B + QLoRA + SFT, same seed/hyperparams):

Step uint8 (this PR) bf16 quant_storage (official)
1 1.2914 1.2914
2 1.2571 1.2516
3 1.2310 1.2281
4 1.1131 1.1202
5 1.0541 1.0563
6 0.9757 0.9793
7 0.9656 0.9687
8 0.9125 0.9066
avg 1.1001 1.1003

Loss curves are nearly identical (< 0.6% difference at any step).
The minor variations are expected from floating-point ordering
differences in the two quant_storage representations.
The dequantized values are equivalent.

Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question I have is that if we put those into ignored param, will fsdp stil manage to shard correctly the model across multiple gpus since most of params are Params4bit ?

@roycho96
Copy link
Copy Markdown
Contributor Author

roycho96 commented Mar 27, 2026

Another question I have is that if we put those into ignored param, will fsdp stil manage to shard correctly the model across multiple gpus since most of params are Params4bit ?

You're right. With most parameters being Params4bit and excluded from sharding, the base weights are replicated on each GPU. FSDP still shards the trainable LoRA parameters and gradients, but the memory benefit from weight sharding is limited in this path.
The primary goal of this PR is crash prevention. Currently, users with default QLoRA config (bnb_4bit_quant_storage unset) get a crash on FSDP2. I think this is a common QLoRA setup.
For users who need full base weight sharding (e.g. 70B+ models), the path with bnb_4bit_quant_storage=bf16 remains the right choice and is unaffected by this PR.

Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit ! Thanks

Comment on lines +665 to +668
if param.__class__.__name__ == "Params4bit":
model_has_params4bit = True
break
params4bit.append(param)

model_has_params4bit = len(params4bit) > 0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not append all param in params4bit. let's keep model_has_params4bit as we did before

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not append all param in params4bit. let's keep model_has_params4bit as we did before

Done. Reverted to model_has_params4bit flag with early detection, and moved the incompatible filtering into the same loop to avoid a second pass over parameters.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks !

Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@SunMarc SunMarc merged commit 5adf855 into huggingface:main Apr 10, 2026
24 of 25 checks passed
@roycho96 roycho96 deleted the fix/fsdp2-auto-ignore branch April 10, 2026 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FSDP2] auto-exclude incompatible Params4bit from fully_shard to prevent silent QLoRA corruption

3 participants