[rl] Remove call to get_model_state_dict in push_model_state_dict by daniellepintz · Pull Request #3066 · pytorch/torchtitan

daniellepintz · 2026-04-23T14:16:28Z

The trainer refactor (#2985) changed push_model_state_dict to call get_model_state_dict() instead of model.state_dict(). While the end result of these calls are the same (returns sharded state_dict with DTensors), calling get_model_state_dict adds some overhead.

Average push time over 10 steps for Qwen3 0.6B:
with get_model_state_dict(): 0.0170s
with model.state_dict(): 0.008s

Revert to previous way of calling self.model.state_dict() directly.

…del_state_dict The trainer refactor (#2985) changed push_model_state_dict to call get_model_state_dict() which unshards FSDP state (all-gather) before pushing to TorchStore. This added an expensive all-gather and increased the size of tensors transferred. Revert to passing self.model.state_dict() directly, which keeps the sharded representation.

joecummings

Great catch thank you! Would you mind opening a GitHub Issue as well to add in timing tests for our RL work? That would've caught this immediately

wwwjn · 2026-05-04T23:35:32Z

        means "skip StorageVolumes and let the destination read directly
        from the source's GPU memory".

-        Uses get_model_state_dict() to unshard FSDP state before pushing.


Now why we don't need unshard FSDP state?

actually, even before this PR, we weren't unsharding the FSDP state; the comment was wrong. by default get_model_state_dict() returns the sharded state

we don't need to unshard FSDP state because torchstore handles sharded DTensors. each trainer rank can put its own shard of a tensor to TorchStore, and when the generator pulls, it will handle the resharding if it needs to

daniellepintz requested a review from joecummings April 23, 2026 14:16

pytorch-bot Bot added the ciflow/8gpu label Apr 23, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 23, 2026

joecummings approved these changes Apr 23, 2026

View reviewed changes

daniellepintz merged commit b21bf09 into main Apr 23, 2026
10 of 14 checks passed

wwwjn reviewed May 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rl] Remove call to get_model_state_dict in push_model_state_dict#3066

[rl] Remove call to get_model_state_dict in push_model_state_dict#3066
daniellepintz merged 1 commit intomainfrom
fix-weight-sync-regression

daniellepintz commented Apr 23, 2026

Uh oh!

joecummings left a comment

Uh oh!

Uh oh!

wwwjn May 4, 2026

Uh oh!

daniellepintz May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

daniellepintz commented Apr 23, 2026

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wwwjn May 4, 2026

Choose a reason for hiding this comment

Uh oh!

daniellepintz May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

daniellepintz May 5, 2026 •

edited

Loading