Skip to content

Fix incorrectly set decoupled_grad and DistOpt mechanics for MFSDP.#4133

Merged
cspades merged 3 commits intoNVIDIA:mainfrom
cspades:cye/decgrad-argfix
Apr 10, 2026
Merged

Fix incorrectly set decoupled_grad and DistOpt mechanics for MFSDP.#4133
cspades merged 3 commits intoNVIDIA:mainfrom
cspades:cye/decgrad-argfix

Conversation

@cspades
Copy link
Copy Markdown
Member

@cspades cspades commented Apr 3, 2026

What does this PR do ?

  • Update DistributedOptimizer to correctly use decoupled_grad with Megatron-FSDP. (FSDP will always use decoupled gradients when using FusedAdam, which simplifies the logic a lot.)

TL;DR

use_decoupled_grad=self.config.use_precision_aware_optimizer_no_fp8_or_ds_fp8
or (
    # Megatron-FSDP always uses decoupled_grad with FusedAdam.
    self.config.use_precision_aware_optimizer
    and getattr(params[0], "__fsdp_param__", False)  # == ddp_config.use_megatron_fsdp
),

Details

  • Megatron-FSDP does not use FusedAdam's master weights, but Megatron-LM hard-codes master_weights=True if OptimizerConfig.use_precision_aware_optimizer_no_fp8_or_ds_fp8 / use_decoupled_grad are True. This PR turns off FusedAdam master weights when using Megatron-FSDP, as FusedAdam should only provide an optimizer.step() to Megatron-FSDP's DTensor(FP32/BF16) main weights.
  • Improved Megatron-FSDP argument validation and added NotImplementedError for completely unexpected paths in the distributed optimizer.

Testing

# No FusedAdam Master Weights (HFSDP + FP8 Delayed Scaling)
[Rank 0] (after 2 iterations) memory (MB) | allocated: 19595.55 | max allocated: 27022.17 | reserved: 22504.00 | max reserved: 30774.00
[2026-04-03 12:37:38.220805] iteration      100/15258789 | consumed samples:        12800 | elapsed time per iteration (ms): 3263.1 | throughput per GPU (TFLOP/s/GPU): 230.1 | learning rate: 4.915198E-07 | global batch size:   128 | lm loss: 5.473558E+00 | loss scale: 1.0 | grad norm: 9.413 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |

# FusedAdam Master Weights (HFSDP + FP8 Delayed Scaling)
[Rank 0] (after 2 iterations) memory (MB) | allocated: 23425.69 | max allocated: 30852.31 | reserved: 26344.00 | max reserved: 34788.00
[2026-04-03 12:44:12.848377] iteration      100/15258789 | consumed samples:        12800 | elapsed time per iteration (ms): 3149.4 | throughput per GPU (TFLOP/s/GPU): 238.4 | learning rate: 4.915198E-07 | global batch size:   128 | lm loss: 5.476564E+00 | loss scale: 1.0 | grad norm: 9.424 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

@cspades cspades self-assigned this Apr 3, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 3, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cspades cspades force-pushed the cye/decgrad-argfix branch 3 times, most recently from 3860daf to a230105 Compare April 3, 2026 19:35
@cspades cspades marked this pull request as ready for review April 3, 2026 19:52
@cspades cspades requested review from a team as code owners April 3, 2026 19:52
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team April 3, 2026 19:52
@svcnvidia-nemo-ci svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 3, 2026
@cspades cspades force-pushed the cye/decgrad-argfix branch from a230105 to 349b8ff Compare April 3, 2026 20:00
@cspades cspades force-pushed the cye/decgrad-argfix branch from 349b8ff to d562ada Compare April 3, 2026 20:02
@cspades cspades requested a review from a team April 3, 2026 20:02
Copy link
Copy Markdown
Contributor

@shjwudp shjwudp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just have a little concern about megatron_fsdp_use_decoupled_grad.

This PR avoids FusedAdam from redundantly creating master weights when M-FSDP already maintains FP32 main weights, which is the correct way to use FusedAdam under M-FSDP (in terms of memory usage).

@cspades cspades changed the title Fix incorrectly set decoupled_grad in training.py for MFSDP. Fix incorrectly set decoupled_grad and DistOpt mechanics for MFSDP. Apr 8, 2026
Signed-off-by: Cory Ye <cye@nvidia.com>
@cspades cspades force-pushed the cye/decgrad-argfix branch from 9ae1737 to 8263d17 Compare April 8, 2026 18:14
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Apr 8, 2026
@cspades cspades force-pushed the cye/decgrad-argfix branch from 8263d17 to 29b5e58 Compare April 8, 2026 18:18
@svcnvidia-nemo-ci svcnvidia-nemo-ci added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels Apr 9, 2026
@cspades cspades enabled auto-merge April 9, 2026 23:11
@cspades cspades added this pull request to the merge queue Apr 10, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24220276331

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 10, 2026
@cspades cspades added this pull request to the merge queue Apr 10, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24222278675

@cspades
Copy link
Copy Markdown
Member Author

cspades commented Apr 10, 2026

Random unit test failure in merge queue: https://github.com/NVIDIA/Megatron-LM/actions/runs/24220276331/job/70712019031

🤷🏻 Sending it back in.

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 10, 2026
@cspades
Copy link
Copy Markdown
Member Author

cspades commented Apr 10, 2026

@cspades cspades added this pull request to the merge queue Apr 10, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24249202447

@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24249361651

@cspades cspades removed this pull request from the merge queue due to a manual request Apr 10, 2026
@cspades cspades added this pull request to the merge queue Apr 10, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24251678492

Merged via the queue into NVIDIA:main with commit ab43d43 Apr 10, 2026
63 checks passed
@cspades cspades deleted the cye/decgrad-argfix branch April 10, 2026 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: low module: megatron-fsdp

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants