Skip to content

[SM90] Optimize FP8 MegaMoE small-batch decode path with swapAB#48

Open
qiushixiaoyu wants to merge 4 commits into
sgl-project:devfrom
qiushixiaoyu:sm90-mega-moe-on-sgl-dev
Open

[SM90] Optimize FP8 MegaMoE small-batch decode path with swapAB#48
qiushixiaoyu wants to merge 4 commits into
sgl-project:devfrom
qiushixiaoyu:sm90-mega-moe-on-sgl-dev

Conversation

@qiushixiaoyu

Copy link
Copy Markdown

Summary

This PR optimizes the SM90 FP8 MegaMoE decode split-N path for small-batch workloads by adding a heuristic-controlled swapAB kernel path.

Main changes:

  • Add a conservative swapAB heuristic for SM90 FP8 MegaMoE:
    • decode split-N path only
    • num_tokens <= 128
    • expected_tokens_per_expert > 0
  • Keep the normal no-swap kernel path as the default fallback when the heuristic does not match.
  • Use a single compile-time use_swap_ab template flag for the optimized path.
  • Keep the change local to the SM90 FP8 MegaMoE implementation path.

Performance

Single-operator performance was measured against the low-latency baseline with --run-low-latency-baseline.

Notes:

  • Each batch was run as one 8-rank process group.
  • fused_mean_us and ll_mean_us are averages across ranks.
  • swapAB is enabled by heuristic for batch 1-128; batch 256 falls back to the normal no-swap path.

Flash shape

hidden=4096, intermediate=2048, experts=256, topk=6

Batch swapAB MegaMoE us LL baseline us Speedup
1 true 122.2 223.4 1.830x
2 true 193.4 309.2 1.606x
4 true 274.1 420.8 1.539x
8 true 287.6 489.5 1.702x
16 true 323.9 529.8 1.635x
32 true 341.0 532.8 1.562x
64 true 393.4 556.5 1.414x
128 true 476.6 572.9 1.202x
256 false 519.2 632.0 1.217x

Pro shape

hidden=7168, intermediate=3072, experts=384, topk=6

Batch swapAB MegaMoE us LL baseline us Speedup
1 true 309.4 489.6 1.582x
2 true 436.9 663.0 1.518x
4 true 674.5 973.2 1.443x
8 true 967.9 1367.0 1.412x
16 true 1154.1 1614.5 1.399x
32 true 1206.2 1701.8 1.411x
64 true 1298.8 1718.4 1.323x
128 true 1462.5 1749.8 1.196x
256 false 1619.0 1809.2 1.118x

Accuracy

Validation was run with the updated DeepGEMM code.

Single-operator accuracy:

  • PASSED all 28 scenarios
  • Max diff: 0.0006
  • Tolerance: 0.07

SGLang model-level accuracy:

Eval Result
GSM8K full 0.953
GPQA, 32 examples 0.90625

yinding added 4 commits June 30, 2026 16:43
Add swapAB code paths for small-batch SM90 FP8 MegaMoE and remove ptxas C7510 sources from hot device code.

(cherry picked from commit 0074938)
(cherry picked from commit 34fe473)
(cherry picked from commit 15a6f42)
@qiushixiaoyu qiushixiaoyu force-pushed the sm90-mega-moe-on-sgl-dev branch from 15a6f42 to 6d2e9d5 Compare June 30, 2026 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant