[SM90] Optimize FP8 MegaMoE small-batch decode path with swapAB by qiushixiaoyu · Pull Request #48 · sgl-project/DeepGEMM

qiushixiaoyu · 2026-06-22T08:59:55Z

Summary

This PR optimizes the SM90 FP8 MegaMoE decode split-N path for small-batch workloads by adding a heuristic-controlled swapAB kernel path.

Main changes:

Add a conservative swapAB heuristic for SM90 FP8 MegaMoE:
- decode split-N path only
- num_tokens <= 128
- expected_tokens_per_expert > 0
Keep the normal no-swap kernel path as the default fallback when the heuristic does not match.
Use a single compile-time use_swap_ab template flag for the optimized path.
Keep the change local to the SM90 FP8 MegaMoE implementation path.

Performance

Single-operator performance was measured against the low-latency baseline with --run-low-latency-baseline.

Notes:

Each batch was run as one 8-rank process group.
fused_mean_us and ll_mean_us are averages across ranks.
swapAB is enabled by heuristic for batch 1-128; batch 256 falls back to the normal no-swap path.

Flash shape

hidden=4096, intermediate=2048, experts=256, topk=6

Batch	swapAB	MegaMoE us	LL baseline us	Speedup
1	true	122.2	223.4	1.830x
2	true	193.4	309.2	1.606x
4	true	274.1	420.8	1.539x
8	true	287.6	489.5	1.702x
16	true	323.9	529.8	1.635x
32	true	341.0	532.8	1.562x
64	true	393.4	556.5	1.414x
128	true	476.6	572.9	1.202x
256	false	519.2	632.0	1.217x

Pro shape

hidden=7168, intermediate=3072, experts=384, topk=6

Batch	swapAB	MegaMoE us	LL baseline us	Speedup
1	true	309.4	489.6	1.582x
2	true	436.9	663.0	1.518x
4	true	674.5	973.2	1.443x
8	true	967.9	1367.0	1.412x
16	true	1154.1	1614.5	1.399x
32	true	1206.2	1701.8	1.411x
64	true	1298.8	1718.4	1.323x
128	true	1462.5	1749.8	1.196x
256	false	1619.0	1809.2	1.118x

Accuracy

Validation was run with the updated DeepGEMM code.

Single-operator accuracy:

PASSED all 28 scenarios
Max diff: 0.0006
Tolerance: 0.07

SGLang model-level accuracy:

Eval	Result
GSM8K full	0.953
GPQA, 32 examples	0.90625

Add swapAB code paths for small-batch SM90 FP8 MegaMoE and remove ptxas C7510 sources from hot device code. (cherry picked from commit 0074938)

(cherry picked from commit b230085)

(cherry picked from commit 34fe473)

(cherry picked from commit 15a6f42)

qiushixiaoyu mentioned this pull request Jun 23, 2026

Add SM90 FP8 MegaMoE support for DeepSeek-V4 sgl-project/sglang#29016

Open

5 tasks

Fridge003 force-pushed the dev branch from 77c9522 to 731e7c7 Compare June 30, 2026 00:43

yinding added 4 commits June 30, 2026 16:43

Enable SM90 FP8 MegaMoE swapAB

6be80da

Add swapAB code paths for small-batch SM90 FP8 MegaMoE and remove ptxas C7510 sources from hot device code. (cherry picked from commit 0074938)

Fix SM90 FP8 MegaMoE swapAB synchronization

c36c666

(cherry picked from commit b230085)

Add SM90 FP8 MegaMoE pre-dispatch kernel

8d5d398

(cherry picked from commit 34fe473)

Add SM90 mega-MoE pre-dispatch test

6d2e9d5

(cherry picked from commit 15a6f42)

qiushixiaoyu force-pushed the sm90-mega-moe-on-sgl-dev branch from 15a6f42 to 6d2e9d5 Compare June 30, 2026 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SM90] Optimize FP8 MegaMoE small-batch decode path with swapAB#48

[SM90] Optimize FP8 MegaMoE small-batch decode path with swapAB#48
qiushixiaoyu wants to merge 4 commits into
sgl-project:devfrom
qiushixiaoyu:sm90-mega-moe-on-sgl-dev

qiushixiaoyu commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

qiushixiaoyu commented Jun 22, 2026

Summary

Performance

Flash shape

Pro shape

Accuracy

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant