Skip to content

Add SM90 FP8 MegaMoE support for DeepSeek-V4#29016

Open
qiushixiaoyu wants to merge 4 commits into
sgl-project:mainfrom
qiushixiaoyu:mega_moe_adapt
Open

Add SM90 FP8 MegaMoE support for DeepSeek-V4#29016
qiushixiaoyu wants to merge 4 commits into
sgl-project:mainfrom
qiushixiaoyu:mega_moe_adapt

Conversation

@qiushixiaoyu

@qiushixiaoyu qiushixiaoyu commented Jun 23, 2026

Copy link
Copy Markdown

Motivation

This PR adds SM90 DeepGEMM MegaMoE adaptation for DeepSeek-V4 FP8 serving. It enables the MegaMoE A2A path with the DeepGEMM MoE runner on SM90, including the pre-dispatch JIT kernel, FP8 expert weight preparation, and DeepSeek-V4 integration.

The goal is to improve long-context / large decode serving throughput for DeepSeek-V4-Flash/Pro-FP8 workloads while keeping the path guarded by environment variables.

Modifications

  • Add SM90 MegaMoE pre-dispatch JIT kernel for preparing activations, scales, top-k indices, and top-k weights.
  • Add unit tests for the SM90 MegaMoE pre-dispatch kernel against a PyTorch reference implementation.
  • Integrate the MegaMoE path into DeepSeek-V4 / FP8 MoE execution.
  • Add FP8 MegaMoE expert weight preparation support.
  • Wire DeepGEMM MegaMoE execution through the existing MoE backend selection.

Environment variables used to enable the path:

  • SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1 enables the DeepGEMM MegaMoE path.
  • SGLANG_OPT_FIX_MEGA_MOE_MEMORY=1 enables the fixed MegaMoE memory path.
  • SGLANG_OPT_USE_JIT_EP_ACTIVATION=1 is required by the fixed MegaMoE memory path.
  • SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096 configures the max token buffer size per rank used by the MegaMoE path. In PD colocated serving, it should be set to the same value as --chunked-prefill-size; in decode-only serving, it should be larger than the maximum number of tokens per rank. The experiment below used 4096, matching --chunked-prefill-size=4096.

Serving command used in the experiment:

export MODEL_PATH=/data00/models/DeepSeek-V4-Flash-FP8
export SERVER_PORT=30000
export NCCL_IB_GID_INDEX=3
export NCCL_IB_HCA=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8
export FLASHINFER_DISABLE_VERSION_CHECK=1
export NVSHMEM_HCA_PE_MAPPING=mlx5_1:1:1,mlx5_2:1:1,mlx5_3:1:1,mlx5_4:1:1,mlx5_5:1:1,mlx5_6:1:1,mlx5_7:1:1,mlx5_8:1:1
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_THINKING=1
export SGLANG_DSV4_FP4_EXPERTS=0
export SGLANG_JIT_DEEPGEMM_PRECOMPILE=0
export GLOO_SOCKET_IFNAME=eth0
export NCCL_MIN_NCHANNELS=24
export NCCL_IB_QPS_PER_CONNECTION=8

export SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1
export SGLANG_OPT_FIX_MEGA_MOE_MEMORY=1
export SGLANG_OPT_USE_JIT_EP_ACTIVATION=1
export SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096

python -m sglang.launch_server
--trust-remote-code
--model-path "${MODEL_PATH}"
--tp 8
--ep-size 8
--chunked-prefill-size 4096
--moe-a2a-backend deepep
--moe-runner-backend deep_gemm
--deepep-mode auto
--cuda-graph-max-bs 32
--max-running-requests 32
--speculative-algo EAGLE
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
--enable-metrics
--host 0.0.0.0
--port "${SERVER_PORT}"
--mem-fraction-static 0.75
--tool-call-parser deepseekv4
--reasoning-parser deepseek-v4

Accuracy Tests

sgl-eval run gpqa
--n-repeats 16 --max-tokens 200000
--temperature 1.0 --top-p 1.0 --thinking
--out-dir /sgl-workspace/logs
--base-url http://localhost:30000/v1

== gpqa ==
198 examples x 16 repeats | 13017.8s | 2620 tok/s | 34.1M tokens

pass@1[avg-of-16] = 88.64% +/- 1.28% (SEM 0.32%)
pass@16 = 96.46%
majority@16 = 90.15%
no_answer = 0.03%
stop_rate = 100.00%
truncated_rate = 0.00%
error_rate = 0.00%

Benchmark Result
GSM8K Accuracy 0.956, Invalid 0.000
GPQA Score 0.90625, num_examples=32

Unit test:

python -m pytest -q python/sglang/jit_kernel/tests/test_mega_moe_pre_dispatch_sm90.py

Result:

10 passed, 20 warnings in 7.32s

Speed Tests and Profiling

DeepGEMM dependency note:

The FP8 MegaMoE operator has already been merged into DeepGEMM in sgl-project/DeepGEMM#36. This benchmark additionally includes the small-batch performance optimization from sgl-project/DeepGEMM#48.

Benchmark script:
The serving benchmark was run with sglang_auto_bench.py, an internal helper script that wraps python -m sglang.bench_serving and automatically searches for the highest request rate / max concurrency for a given input-output length. It reports both:

  • SLO-compliant throughput, constrained by the configured TTFT / TPOT targets.
  • Max throughput, which may violate the SLO and is used only to understand the saturation point.

For the result below, the script was run with random prompts, input/output length 3500/1500, and SLO targets TTFT <= 2000 ms and TPOT <= 20 ms.

Mode SLO total throughput SLO output throughput TTFT TPOT
MegaMoE ON 7001.55 tok/s 2100.46 tok/s 683.94 ms 13.83 ms
MegaMoE OFF 5727.56 tok/s 1718.27 tok/s 825.53 ms 17.05 ms

MegaMoE ON vs OFF:
SLO total throughput: +22.24%
SLO output throughput: +22.24%

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): ❌ Run #28430198105
Latest PR Test (Extra): ❌ Run #28430197972

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the SM90 (Hopper) variant of the mega_moe_pre_dispatch kernel for DeepSeek v4, enabling all-FP8 Mega MoE support on SM90 GPUs. It includes the CUDA kernel implementation, JIT loading, integration into the mega_moe layer, and comprehensive unit tests. Feedback on the changes suggests optimizing _interleave_l1_weight_only by removing a redundant memory allocation and copy operation when stacking and reshaping tensors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/sglang/srt/layers/moe/mega_moe.py Outdated
@Fridge003 Fridge003 added the release-highlight Candidate PR for release note highlight label Jun 23, 2026
@Fridge003 Fridge003 self-assigned this Jun 23, 2026
@yuan-luo

Copy link
Copy Markdown
Collaborator

Could we call the DeepGEMM MegaMoE API directly instead of porting all the MegaMoE kernel code piece?
sgl-project/DeepGEMM#36

Something like this:

    if _is_sm90():
        # SM90 (Hopper) fp8 mega MoE (DeepGEMM dev). Input is quantized on the
        # host and copied into the symm buffer (no mega_moe_pre_dispatch kernel,
        # which is the SM100 path).
        from deep_gemm.utils import per_token_cast_to_fp8

        if num_tokens > 0:
            x_fp8 = per_token_cast_to_fp8(
                hidden_states, use_ue8m0=False, gran_k=128, use_packed_ue8m0=False
            )
            buf.x[:num_tokens].copy_(x_fp8[0])
            buf.x_sf[:num_tokens].copy_(x_fp8[1])
            buf.topk_idx[:num_tokens].copy_(topk_ids_in)
            buf.topk_weights[:num_tokens].copy_(topk_weights_in)

        num_local_experts = num_experts // ep_group.size()
        cum_stats = torch.zeros(
            (num_local_experts,), dtype=torch.int32, device=hidden_states.device
        )
        y = torch.empty(
            (max(num_tokens, 1), hidden_size),
            dtype=torch.bfloat16,
            device=hidden_states.device,
        )
        swiglu_limit = getattr(moe.config, "swiglu_limit", None)
        deep_gemm.fp8_mega_moe(
            y,
            moe.experts.mega_l1_weights,
            moe.experts.mega_l2_weights,
            buf,
            cumulative_local_expert_recv_stats=cum_stats,
            recipe=(128, 128, 128),
            activation="swiglu",
            activation_clamp=swiglu_limit,
            fast_math=True,
        )
        y = y[:num_tokens]
        if not moe.experts.should_fuse_routed_scaling_factor_in_topk:
            y.mul_(moe.routed_scaling_factor)
        return y

@yuan-luo

yuan-luo commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

In build_mega_moe_experts_weights adding SM90 deep gemm weights processing.

def build_mega_moe_experts_weights(experts) -> None:
    if getattr(experts, "_mega_moe_weights_built", False):
        return
    if _is_sm90():
        # SM90 (Hopper) mega weight prep: gate/up interleave only (no UTCCP /
        # ue8m0 transpose, which are SM100-only). sglang stores fp8 weights in
        # an int8 container; view back to fp8_e4m3fn. Scales stay in the
        # loaded 128x128 fp8 block layout.
        import deep_gemm

        w13 = experts.w13_weight.data
        w2 = experts.w2_weight.data
        if w13.dtype == torch.int8:
            w13 = w13.view(torch.float8_e4m3fn)
        if w2.dtype == torch.int8:
            w2 = w2.view(torch.float8_e4m3fn)
        l1_pair, l2_pair = deep_gemm.transform_weights_for_mega_moe_sm90(
            (w13, experts.w13_weight_scale_inv.data),
            (w2, experts.w2_weight_scale_inv.data),
        )
        experts.mega_l1_weights = l1_pair
        experts.mega_l2_weights = l2_pair
        experts._mega_moe_weights_built = True
        return
    from deep_gemm import (
        transform_sf_into_required_layout,
        transform_weights_for_mega_moe,
    )
    from deep_gemm.mega import _interleave_l1_weights, _transpose_sf_for_utccp

@Fridge003

Copy link
Copy Markdown
Collaborator

@qiushixiaoyu Please test the full set of GPQA with 32 turns, thanks
Instructions:

sgl-eval run gpqa \
  --model fp8_flash --api-key <api-key> \
  --n-repeats 16 --max-tokens 200000 \
  --temperature 1.0 --top-p 1.0 --thinking \
  --out-dir /sgl-workspace/logs \
  --base-url http://localhost:30000/v1

The model should be launched with SGLANG_DEFAULT_THINKING=1

Comment thread python/sglang/jit_kernel/tests/test_mega_moe_pre_dispatch_sm90.py Outdated
Comment thread python/sglang/jit_kernel/csrc/deepseek_v4/mega_moe_pre_dispatch_sm90.cuh Outdated
Comment thread python/sglang/srt/layers/quantization/fp8.py Outdated
Comment thread python/sglang/srt/layers/quantization/fp8.py Outdated
Comment thread python/sglang/srt/layers/moe/mega_moe.py
@Fridge003

Fridge003 commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Is sm90 fp8 megamoe also applicable on other models, like dsv3 or glm5?

@qiushixiaoyu

Copy link
Copy Markdown
Author

Is sm90 fp8 megamoe also applicable on other models, like dsv3 or glm5?

I think it's feasible, but we'll need to make some code adaptations.

@qiushixiaoyu

Copy link
Copy Markdown
Author

@qiushixiaoyu Please test the full set of GPQA with 32 turns, thanks Instructions:

sgl-eval run gpqa \
  --model fp8_flash --api-key <api-key> \
  --n-repeats 16 --max-tokens 200000 \
  --temperature 1.0 --top-p 1.0 --thinking \
  --out-dir /sgl-workspace/logs \
  --base-url http://localhost:30000/v1

The model should be launched with SGLANG_DEFAULT_THINKING=1

sgl-eval run gpqa
--n-repeats 16 --max-tokens 200000
--temperature 1.0 --top-p 1.0 --thinking
--out-dir /sgl-workspace/logs
--base-url http://localhost:30000/v1

== gpqa ==
198 examples x 16 repeats | 13017.8s | 2620 tok/s | 34.1M tokens

  • pass@1[avg-of-16] = 88.64% +/- 1.28% (SEM 0.32%)
    pass@16 = 96.46%
    majority@16 = 90.15%
    no_answer = 0.03%
    stop_rate = 100.00%
    truncated_rate = 0.00%
    error_rate = 0.00%

@Fridge003 Fridge003 mentioned this pull request Jun 29, 2026
36 tasks
@Fridge003

Fridge003 commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

@qiushixiaoyu Can you please update the cookbook configuration tips for sm90+megamoe?
https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4#2-configuration-tips

@Fridge003

Copy link
Copy Markdown
Collaborator

Also please add a test for megamoe on sm90, can be a new subtest of test/registered/models_e2e/test_deepseek_v4_flash_fp8_h200.py

@Fridge003 Fridge003 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Jun 30, 2026
@qiushixiaoyu

qiushixiaoyu commented Jun 30, 2026

Copy link
Copy Markdown
Author

see above

@Fridge003 Thanks for the review! Both points are addressed in the latest commit.

@qiushixiaoyu qiushixiaoyu requested a review from Fridge003 June 30, 2026 08:24
@Fridge003

Copy link
Copy Markdown
Collaborator

/rerun-test test/registered/models_e2e/test_deepseek_v4_flash_fp4_b200.py test/registered/models_e2e/test_deepseek_v4_flash_fp4_h200.py test/registered/models_e2e/test_deepseek_v4_flash_fp4_megamoe_b200.py test/registered/models_e2e/test_deepseek_v4_flash_fp8_h200.py

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Results for /rerun-test test/registered/models_e2e/test_deepseek_v4_flash_fp4_b200.py test/registered/models_e2e/test_deepseek_v4_flash_fp4_h200.py test/registered/models_e2e/test_deepseek_v4_flash_fp4_megamoe_b200.py test/registered/models_e2e/test_deepseek_v4_flash_fp8_h200.py:

🚀 4-gpu-b200 (2 tests): ✅ View workflow run

cd test/ && python3 registered/models_e2e/test_deepseek_v4_flash_fp4_b200.py
cd test/ && python3 registered/models_e2e/test_deepseek_v4_flash_fp4_megamoe_b200.py

🚀 8-gpu-h200 (2 tests): ❌ View workflow run

cd test/ && python3 registered/models_e2e/test_deepseek_v4_flash_fp4_h200.py
cd test/ && python3 registered/models_e2e/test_deepseek_v4_flash_fp8_h200.py

@qiushixiaoyu

Copy link
Copy Markdown
Author

@Fridge003
I've root-caused the h200 failure.
The 8-gpu-h200 TestDSV4FlashFP8H200MegaMoE failure is a missing DeepGEMM dependency, not a bug in this PR's logic. The CI image's deep_gemm predates sgl-project/DeepGEMM#48, so the SM90 FP8 Mega-MoE path is never actually activated; execution silently falls back into the generic masked GEMM, which aborts on the first idle (empty) batch.
The PR description already lists #48 as a small-batch perf optimization — but the h200 failure shows it's actually a hard runtime dependency for the SM90 path: #48 also adds the mega_moe_pre_dispatch_sm90 entrypoint that is_sm90_fp8_mega_moe_available() gates on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek documentation Improvements or additions to documentation jit-kernel release-highlight Candidate PR for release note highlight

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants