Fix incorrectly set decoupled_grad and DistOpt mechanics for MFSDP.#4133
Fix incorrectly set decoupled_grad and DistOpt mechanics for MFSDP.#4133cspades merged 3 commits intoNVIDIA:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
3860daf to
a230105
Compare
a230105 to
349b8ff
Compare
349b8ff to
d562ada
Compare
shjwudp
left a comment
There was a problem hiding this comment.
Overall LGTM, just have a little concern about megatron_fsdp_use_decoupled_grad.
This PR avoids FusedAdam from redundantly creating master weights when M-FSDP already maintains FP32 main weights, which is the correct way to use FusedAdam under M-FSDP (in terms of memory usage).
d562ada to
f30d840
Compare
Signed-off-by: Cory Ye <cye@nvidia.com>
9ae1737 to
8263d17
Compare
Signed-off-by: Cory Ye <cye@nvidia.com>
8263d17 to
29b5e58
Compare
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24220276331 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24222278675 |
|
Random unit test failure in merge queue: https://github.com/NVIDIA/Megatron-LM/actions/runs/24220276331/job/70712019031 🤷🏻 Sending it back in. |
|
Previous unit test passed: https://github.com/NVIDIA/Megatron-LM/actions/runs/24222278675/job/70717988646 Then something else randomly failed: https://github.com/NVIDIA/Megatron-LM/actions/runs/24222278675/job/70717988616 Rerunning this test to be sure: https://github.com/NVIDIA/Megatron-LM/actions/runs/24222278675/job/70801031886 (✅ 🤷🏻♂️) |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24249202447 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24249361651 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24251678492 |
What does this PR do ?
DistributedOptimizerto correctly usedecoupled_gradwith Megatron-FSDP. (FSDP will always use decoupled gradients when using FusedAdam, which simplifies the logic a lot.)TL;DR
Details
master_weights=TrueifOptimizerConfig.use_precision_aware_optimizer_no_fp8_or_ds_fp8/use_decoupled_gradare True. This PR turns off FusedAdam master weights when using Megatron-FSDP, as FusedAdam should only provide anoptimizer.step()to Megatron-FSDP'sDTensor(FP32/BF16)main weights.NotImplementedErrorfor completely unexpected paths in the distributed optimizer.Testing
--use-precision-aware-optimizerunit test to temporarily guarantee functionality. Cross-tested with this TE E2E test: Add Megatron-FSDP E2E integration test to TE CI/CD (L1). TransformerEngine#2845master_weights:Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.