Add SM90 FP8 MegaMoE support for DeepSeek-V4 by qiushixiaoyu · Pull Request #29016 · sgl-project/sglang

qiushixiaoyu · 2026-06-23T08:14:13Z

Motivation

This PR adds SM90 DeepGEMM MegaMoE adaptation for DeepSeek-V4 FP8 serving. It enables the MegaMoE A2A path with the DeepGEMM MoE runner on SM90, including the pre-dispatch JIT kernel, FP8 expert weight preparation, and DeepSeek-V4 integration.

The goal is to improve long-context / large decode serving throughput for DeepSeek-V4-Flash/Pro-FP8 workloads while keeping the path guarded by environment variables.

Modifications

Add SM90 MegaMoE pre-dispatch JIT kernel for preparing activations, scales, top-k indices, and top-k weights.
Add unit tests for the SM90 MegaMoE pre-dispatch kernel against a PyTorch reference implementation.
Integrate the MegaMoE path into DeepSeek-V4 / FP8 MoE execution.
Add FP8 MegaMoE expert weight preparation support.
Wire DeepGEMM MegaMoE execution through the existing MoE backend selection.

Environment variables used to enable the path:

SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1 enables the DeepGEMM MegaMoE path.
SGLANG_OPT_FIX_MEGA_MOE_MEMORY=1 enables the fixed MegaMoE memory path.
SGLANG_OPT_USE_JIT_EP_ACTIVATION=1 is required by the fixed MegaMoE memory path.
SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096 configures the max token buffer size per rank used by the MegaMoE path. In PD colocated serving, it should be set to the same value as --chunked-prefill-size; in decode-only serving, it should be larger than the maximum number of tokens per rank. The experiment below used 4096, matching --chunked-prefill-size=4096.

Serving command used in the experiment:

export MODEL_PATH=/data00/models/DeepSeek-V4-Flash-FP8
export SERVER_PORT=30000
export NCCL_IB_GID_INDEX=3
export NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8
export FLASHINFER_DISABLE_VERSION_CHECK=1
export NVSHMEM_HCA_PE_MAPPING=mlx5_1:1:1,mlx5_2:1:1,mlx5_3:1:1,mlx5_4:1:1,mlx5_5:1:1,mlx5_6:1:1,mlx5_7:1:1,mlx5_8:1:1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_THINKING=1
export SGLANG_DSV4_FP4_EXPERTS=0
export SGLANG_JIT_DEEPGEMM_PRECOMPILE=0
export GLOO_SOCKET_IFNAME=eth0
export NCCL_MIN_NCHANNELS=24
export NCCL_IB_QPS_PER_CONNECTION=8

export SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1
export SGLANG_OPT_FIX_MEGA_MOE_MEMORY=1
export SGLANG_OPT_USE_JIT_EP_ACTIVATION=1
export SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096

python -m sglang.launch_server
--trust-remote-code
--model-path "${MODEL_PATH}"
--tp 8
--ep-size 8
--chunked-prefill-size 4096
--moe-a2a-backend deepep
--moe-runner-backend deep_gemm
--deepep-mode auto
--cuda-graph-max-bs 32
--max-running-requests 32
--speculative-algo EAGLE
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
--enable-metrics
--host 0.0.0.0
--port "${SERVER_PORT}"
--mem-fraction-static 0.75
--tool-call-parser deepseekv4
--reasoning-parser deepseek-v4

Accuracy Tests

sgl-eval run gpqa
--n-repeats 16 --max-tokens 200000
--temperature 1.0 --top-p 1.0 --thinking
--out-dir /sgl-workspace/logs
--base-url http://localhost:30000/v1

== gpqa ==
198 examples x 16 repeats | 13017.8s | 2620 tok/s | 34.1M tokens

pass@1[avg-of-16] = 88.64% +/- 1.28% (SEM 0.32%)
pass@16 = 96.46%
majority@16 = 90.15%
no_answer = 0.03%
stop_rate = 100.00%
truncated_rate = 0.00%
error_rate = 0.00%

Benchmark	Result
GSM8K	Accuracy 0.956, Invalid 0.000
GPQA	Score 0.90625, num_examples=32

Unit test:

python -m pytest -q python/sglang/jit_kernel/tests/test_mega_moe_pre_dispatch_sm90.py

Result:

10 passed, 20 warnings in 7.32s

Speed Tests and Profiling

DeepGEMM dependency note:

The FP8 MegaMoE operator has already been merged into DeepGEMM in sgl-project/DeepGEMM#36. This benchmark additionally includes the small-batch performance optimization from sgl-project/DeepGEMM#48.

Benchmark script:
The serving benchmark was run with sglang_auto_bench.py, an internal helper script that wraps python -m sglang.bench_serving and automatically searches for the highest request rate / max concurrency for a given input-output length. It reports both:

SLO-compliant throughput, constrained by the configured TTFT / TPOT targets.
Max throughput, which may violate the SLO and is used only to understand the saturation point.

For the result below, the script was run with random prompts, input/output length 3500/1500, and SLO targets TTFT <= 2000 ms and TPOT <= 20 ms.

Mode	SLO total throughput	SLO output throughput	TTFT	TPOT
MegaMoE ON	7001.55 tok/s	2100.46 tok/s	683.94 ms	13.83 ms
MegaMoE OFF	5727.56 tok/s	1718.27 tok/s	825.53 ms	17.05 ms

MegaMoE ON vs OFF:
SLO total throughput: +22.24%
SLO output throughput: +22.24%

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): ❌ Run #28430198105
Latest PR Test (Extra): ❌ Run #28430197972

gemini-code-assist

Code Review

This pull request introduces the SM90 (Hopper) variant of the mega_moe_pre_dispatch kernel for DeepSeek v4, enabling all-FP8 Mega MoE support on SM90 GPUs. It includes the CUDA kernel implementation, JIT loading, integration into the mega_moe layer, and comprehensive unit tests. Feedback on the changes suggests optimizing _interleave_l1_weight_only by removing a redundant memory allocation and copy operation when stacking and reshaping tensors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

yuan-luo · 2026-06-24T06:04:20Z

Could we call the DeepGEMM MegaMoE API directly instead of porting all the MegaMoE kernel code piece?
sgl-project/DeepGEMM#36

Something like this:

    if _is_sm90():
        # SM90 (Hopper) fp8 mega MoE (DeepGEMM dev). Input is quantized on the
        # host and copied into the symm buffer (no mega_moe_pre_dispatch kernel,
        # which is the SM100 path).
        from deep_gemm.utils import per_token_cast_to_fp8

        if num_tokens > 0:
            x_fp8 = per_token_cast_to_fp8(
                hidden_states, use_ue8m0=False, gran_k=128, use_packed_ue8m0=False
            )
            buf.x[:num_tokens].copy_(x_fp8[0])
            buf.x_sf[:num_tokens].copy_(x_fp8[1])
            buf.topk_idx[:num_tokens].copy_(topk_ids_in)
            buf.topk_weights[:num_tokens].copy_(topk_weights_in)

        num_local_experts = num_experts // ep_group.size()
        cum_stats = torch.zeros(
            (num_local_experts,), dtype=torch.int32, device=hidden_states.device
        )
        y = torch.empty(
            (max(num_tokens, 1), hidden_size),
            dtype=torch.bfloat16,
            device=hidden_states.device,
        )
        swiglu_limit = getattr(moe.config, "swiglu_limit", None)
        deep_gemm.fp8_mega_moe(
            y,
            moe.experts.mega_l1_weights,
            moe.experts.mega_l2_weights,
            buf,
            cumulative_local_expert_recv_stats=cum_stats,
            recipe=(128, 128, 128),
            activation="swiglu",
            activation_clamp=swiglu_limit,
            fast_math=True,
        )
        y = y[:num_tokens]
        if not moe.experts.should_fuse_routed_scaling_factor_in_topk:
            y.mul_(moe.routed_scaling_factor)
        return y

yuan-luo · 2026-06-24T06:52:25Z

In build_mega_moe_experts_weights adding SM90 deep gemm weights processing.

def build_mega_moe_experts_weights(experts) -> None:
    if getattr(experts, "_mega_moe_weights_built", False):
        return
    if _is_sm90():
        # SM90 (Hopper) mega weight prep: gate/up interleave only (no UTCCP /
        # ue8m0 transpose, which are SM100-only). sglang stores fp8 weights in
        # an int8 container; view back to fp8_e4m3fn. Scales stay in the
        # loaded 128x128 fp8 block layout.
        import deep_gemm

        w13 = experts.w13_weight.data
        w2 = experts.w2_weight.data
        if w13.dtype == torch.int8:
            w13 = w13.view(torch.float8_e4m3fn)
        if w2.dtype == torch.int8:
            w2 = w2.view(torch.float8_e4m3fn)
        l1_pair, l2_pair = deep_gemm.transform_weights_for_mega_moe_sm90(
            (w13, experts.w13_weight_scale_inv.data),
            (w2, experts.w2_weight_scale_inv.data),
        )
        experts.mega_l1_weights = l1_pair
        experts.mega_l2_weights = l2_pair
        experts._mega_moe_weights_built = True
        return
    from deep_gemm import (
        transform_sf_into_required_layout,
        transform_weights_for_mega_moe,
    )
    from deep_gemm.mega import _interleave_l1_weights, _transpose_sf_for_utccp

Fridge003 · 2026-06-25T03:16:45Z

@qiushixiaoyu Please test the full set of GPQA with 32 turns, thanks
Instructions:

sgl-eval run gpqa \
  --model fp8_flash --api-key <api-key> \
  --n-repeats 16 --max-tokens 200000 \
  --temperature 1.0 --top-p 1.0 --thinking \
  --out-dir /sgl-workspace/logs \
  --base-url http://localhost:30000/v1

The model should be launched with SGLANG_DEFAULT_THINKING=1

Fridge003 · 2026-06-25T03:31:00Z

Is sm90 fp8 megamoe also applicable on other models, like dsv3 or glm5?

qiushixiaoyu · 2026-06-25T07:06:20Z

Is sm90 fp8 megamoe also applicable on other models, like dsv3 or glm5?

I think it's feasible, but we'll need to make some code adaptations.

qiushixiaoyu · 2026-06-26T06:59:32Z

@qiushixiaoyu Please test the full set of GPQA with 32 turns, thanks Instructions:
sgl-eval run gpqa \
  --model fp8_flash --api-key <api-key> \
  --n-repeats 16 --max-tokens 200000 \
  --temperature 1.0 --top-p 1.0 --thinking \
  --out-dir /sgl-workspace/logs \
  --base-url http://localhost:30000/v1
The model should be launched with SGLANG_DEFAULT_THINKING=1

sgl-eval run gpqa
--n-repeats 16 --max-tokens 200000
--temperature 1.0 --top-p 1.0 --thinking
--out-dir /sgl-workspace/logs
--base-url http://localhost:30000/v1

== gpqa ==
198 examples x 16 repeats | 13017.8s | 2620 tok/s | 34.1M tokens

pass@1[avg-of-16] = 88.64% +/- 1.28% (SEM 0.32%)
pass@16 = 96.46%
majority@16 = 90.15%
no_answer = 0.03%
stop_rate = 100.00%
truncated_rate = 0.00%
error_rate = 0.00%

Fridge003 · 2026-06-29T09:07:34Z

@qiushixiaoyu Can you please update the cookbook configuration tips for sm90+megamoe?
https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4#2-configuration-tips

Fridge003 · 2026-06-29T09:11:29Z

Also please add a test for megamoe on sm90, can be a new subtest of test/registered/models_e2e/test_deepseek_v4_flash_fp8_h200.py

Fridge003

see above

qiushixiaoyu · 2026-06-30T08:20:01Z

see above

@Fridge003 Thanks for the review! Both points are addressed in the latest commit.

Fridge003 · 2026-07-02T07:42:42Z

/rerun-test test/registered/models_e2e/test_deepseek_v4_flash_fp4_b200.py test/registered/models_e2e/test_deepseek_v4_flash_fp4_h200.py test/registered/models_e2e/test_deepseek_v4_flash_fp4_megamoe_b200.py test/registered/models_e2e/test_deepseek_v4_flash_fp8_h200.py

github-actions · 2026-07-02T07:43:08Z

Results for /rerun-test test/registered/models_e2e/test_deepseek_v4_flash_fp4_b200.py test/registered/models_e2e/test_deepseek_v4_flash_fp4_h200.py test/registered/models_e2e/test_deepseek_v4_flash_fp4_megamoe_b200.py test/registered/models_e2e/test_deepseek_v4_flash_fp8_h200.py:

🚀 4-gpu-b200 (2 tests): ✅ View workflow run

cd test/ && python3 registered/models_e2e/test_deepseek_v4_flash_fp4_b200.py
cd test/ && python3 registered/models_e2e/test_deepseek_v4_flash_fp4_megamoe_b200.py

🚀 8-gpu-h200 (2 tests): ❌ View workflow run

cd test/ && python3 registered/models_e2e/test_deepseek_v4_flash_fp4_h200.py
cd test/ && python3 registered/models_e2e/test_deepseek_v4_flash_fp8_h200.py

qiushixiaoyu · 2026-07-03T05:00:45Z

@Fridge003
I've root-caused the h200 failure.
The 8-gpu-h200 TestDSV4FlashFP8H200MegaMoE failure is a missing DeepGEMM dependency, not a bug in this PR's logic. The CI image's deep_gemm predates sgl-project/DeepGEMM#48, so the SM90 FP8 Mega-MoE path is never actually activated; execution silently falls back into the generic masked GEMM, which aborts on the first idle (empty) batch.
The PR description already lists #48 as a small-batch perf optimization — but the h200 failure shows it's actually a hard runtime dependency for the SM90 path: #48 also adds the mega_moe_pre_dispatch_sm90 entrypoint that is_sm90_fp8_mega_moe_available() gates on.

qiushixiaoyu requested review from AniZpZ, BBuf, DarkSharpness, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, HydraQYH, Ying1123, b8zhong, celve, ch-wan, ispobock, merrymercy and yuan-luo as code owners June 23, 2026 08:14

github-actions Bot added deepseek jit-kernel labels Jun 23, 2026

gemini-code-assist Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/moe/mega_moe.py Outdated

qiushixiaoyu force-pushed the mega_moe_adapt branch from 29360d0 to 2d5188b Compare June 23, 2026 08:30

Fridge003 added the release-highlight Candidate PR for release note highlight label Jun 23, 2026

Fridge003 self-assigned this Jun 23, 2026

Fridge003 requested changes Jun 25, 2026

View reviewed changes

Add SM90 FP8 MegaMoE adaptation

f1ba22e

qiushixiaoyu force-pushed the mega_moe_adapt branch from b3c6135 to f1ba22e Compare June 25, 2026 11:04

Fridge003 approved these changes Jun 29, 2026

View reviewed changes

Fridge003 mentioned this pull request Jun 29, 2026

DeepSeek V4 Roadmap #23602

Open

36 tasks

Fridge003 added 2 commits June 29, 2026 02:08

Merge remote-tracking branch 'origin/main' into mega_moe_adapt

e0add98

fix lint

4c86636

Fridge003 requested changes Jun 29, 2026

View reviewed changes

Add SM90 FP8 MegaMoE cookbook configuration tips and H200 e2e subtest

b6c3549

qiushixiaoyu requested review from JustinTong0323, sogalin, wisclmy0611 and zijiexia as code owners June 30, 2026 08:14

github-actions Bot added the documentation Improvements or additions to documentation label Jun 30, 2026

qiushixiaoyu requested a review from Fridge003 June 30, 2026 08:24

Fridge003 approved these changes Jul 2, 2026

View reviewed changes

Uh oh!

Conversation

qiushixiaoyu commented Jun 23, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

CI States

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

yuan-luo commented Jun 24, 2026

Uh oh!

yuan-luo commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fridge003 commented Jun 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qiushixiaoyu commented Jun 25, 2026

Uh oh!

qiushixiaoyu commented Jun 26, 2026

Uh oh!

Fridge003 commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fridge003 commented Jun 29, 2026

Uh oh!

Fridge003 left a comment

Choose a reason for hiding this comment

Uh oh!

qiushixiaoyu commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fridge003 commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qiushixiaoyu commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qiushixiaoyu commented Jun 23, 2026 •

edited by github-actions Bot

Loading

yuan-luo commented Jun 24, 2026 •

edited

Loading

Fridge003 commented Jun 25, 2026 •

edited

Loading

Fridge003 commented Jun 29, 2026 •

edited

Loading

qiushixiaoyu commented Jun 30, 2026 •

edited

Loading

github-actions Bot commented Jul 2, 2026 •

edited

Loading