Add SM90 FP8 MegaMoE support for DeepSeek-V4#29016
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the SM90 (Hopper) variant of the mega_moe_pre_dispatch kernel for DeepSeek v4, enabling all-FP8 Mega MoE support on SM90 GPUs. It includes the CUDA kernel implementation, JIT loading, integration into the mega_moe layer, and comprehensive unit tests. Feedback on the changes suggests optimizing _interleave_l1_weight_only by removing a redundant memory allocation and copy operation when stacking and reshaping tensors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
29360d0 to
2d5188b
Compare
|
Could we call the DeepGEMM MegaMoE API directly instead of porting all the MegaMoE kernel code piece? Something like this: |
|
In build_mega_moe_experts_weights adding SM90 deep gemm weights processing. |
|
@qiushixiaoyu Please test the full set of GPQA with 32 turns, thanks The model should be launched with |
|
Is sm90 fp8 megamoe also applicable on other models, like dsv3 or glm5? |
I think it's feasible, but we'll need to make some code adaptations. |
b3c6135 to
f1ba22e
Compare
sgl-eval run gpqa == gpqa ==
|
|
@qiushixiaoyu Can you please update the cookbook configuration tips for sm90+megamoe? |
|
Also please add a test for megamoe on sm90, can be a new subtest of test/registered/models_e2e/test_deepseek_v4_flash_fp8_h200.py |
@Fridge003 Thanks for the review! Both points are addressed in the latest commit. |
|
/rerun-test test/registered/models_e2e/test_deepseek_v4_flash_fp4_b200.py test/registered/models_e2e/test_deepseek_v4_flash_fp4_h200.py test/registered/models_e2e/test_deepseek_v4_flash_fp4_megamoe_b200.py test/registered/models_e2e/test_deepseek_v4_flash_fp8_h200.py |
|
Results for 🚀 🚀 |
|
@Fridge003 |
Motivation
This PR adds SM90 DeepGEMM MegaMoE adaptation for DeepSeek-V4 FP8 serving. It enables the MegaMoE A2A path with the DeepGEMM MoE runner on SM90, including the pre-dispatch JIT kernel, FP8 expert weight preparation, and DeepSeek-V4 integration.
The goal is to improve long-context / large decode serving throughput for DeepSeek-V4-Flash/Pro-FP8 workloads while keeping the path guarded by environment variables.
Modifications
Environment variables used to enable the path:
SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1enables the DeepGEMM MegaMoE path.SGLANG_OPT_FIX_MEGA_MOE_MEMORY=1enables the fixed MegaMoE memory path.SGLANG_OPT_USE_JIT_EP_ACTIVATION=1is required by the fixed MegaMoE memory path.SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096configures the max token buffer size per rank used by the MegaMoE path. In PD colocated serving, it should be set to the same value as--chunked-prefill-size; in decode-only serving, it should be larger than the maximum number of tokens per rank. The experiment below used4096, matching--chunked-prefill-size=4096.Serving command used in the experiment:
Accuracy Tests
sgl-eval run gpqa
--n-repeats 16 --max-tokens 200000
--temperature 1.0 --top-p 1.0 --thinking
--out-dir /sgl-workspace/logs
--base-url http://localhost:30000/v1
Unit test:
Result:
Speed Tests and Profiling
DeepGEMM dependency note:
The FP8 MegaMoE operator has already been merged into DeepGEMM in sgl-project/DeepGEMM#36. This benchmark additionally includes the small-batch performance optimization from sgl-project/DeepGEMM#48.
Benchmark script:
The serving benchmark was run with
sglang_auto_bench.py, an internal helper script that wrapspython -m sglang.bench_servingand automatically searches for the highest request rate / max concurrency for a given input-output length. It reports both:For the result below, the script was run with random prompts, input/output length
3500/1500, and SLO targetsTTFT <= 2000 msandTPOT <= 20 ms.MegaMoE ON vs OFF:
SLO total throughput: +22.24%
SLO output throughput: +22.24%
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ciCI States
Latest PR Test (Base): ❌ Run #28430198105
Latest PR Test (Extra): ❌ Run #28430197972