Feat: sm90 support mxfp8 fp8 groupgemm by zhangxiaolei123456 · Pull Request #362 · deepseek-ai/DeepGEMM

zhangxiaolei123456 · 2026-06-16T03:24:40Z

accuracy test:

python -m pytest tests/test_sm90_mxfp8_fp8.py -q OK

Peformance test:

python -m pytest tests/test_sm90_mxfp8_fp8.py::test_m_grouped_mxfp8_vs_fp8_perf_contiguous_and_masked -q -s

kernel	active M	MXFP8 us	FP8 us	MXFP8 TFLOPS	FP8 TFLOPS	speedup	diff
contiguous	512	23	31	45.9	35.0	1.31x	0.0462
masked	320	22	32	30.4	21.0	1.45x	0.0229

The MXFP8 consumer was issuing a fresh global load for every (kk, accum) pair, even though the producer warp had already staged SFA/SFB into shared memory. Replace the per-iteration GMEM lookups with two LDS loads (one 32-bit pack per SFA row, one 64-bit pack per pair of SFB rows) outside the wgmma kk loop, and decode the per-kk byte with a register-side shift. Keeps the result bit-identical while removing the GMEM dependency from the inner wgmma pipeline.

NAP-GHJ · 2026-07-02T09:06:31Z

When m increases, performance is very poor; at m = 8192, the performance is 765 µs versus 66 µs—only 0.09x the speed.

zhangxiaolei123456 · 2026-07-02T09:12:35Z

Thanks for the reminder. I'll fix this issue.

zhangxiaolei123456 added 21 commits June 16, 2026 11:16

Add SM90 MXFP8 FP8 grouped kernels

bd42ee2

Fix MXFP8 FP8 per-column B scaling

5e8cd3d

Fix MXFP8 contiguous accuracy test constraints

4365d1d

Add MXFP8 FP8 performance comparison test

5608f23

Stage MXFP8 B scales in shared memory

aaecde6

Fix main-based MXFP8 include list

cb8e933

Fence staged MXFP8 B scales before consume

dcb1017

Support MXFP8 A scales on SM90 grouped kernels

c1125fc

Load SM90 MXFP8 A scales from global memory

405be71

Load SM90 MXFP8 B scales from global memory

c2b8716

Avoid SM90 masked MXFP8 cross-group stores

ffe17ba

Support packed UE8M0 scales in SM90 MXFP8 GEMM

6de214d

Pass explicit recipes to SM90 MXFP8 GEMM

06b2ac5

Fix SM90 masked packed scale group stride

a5ef613

Allow non-contiguous SM90 MXFP8 scales

2b5b958

Test UE8M0 int32 packing byte order

9862891

Fix SM90 MXFP8 contiguous RHS scale group

65871a5

Add SM90 MXFP8 DeepEP scale layout test

203e3b3

Tighten SM90 MXFP8 DeepEP scale layout test

d2032ea

Add SM90 MXFP8 dense raw scale test

664fa78

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: sm90 support mxfp8 fp8 groupgemm#362

Feat: sm90 support mxfp8 fp8 groupgemm#362
zhangxiaolei123456 wants to merge 21 commits into
deepseek-ai:mainfrom
zhangxiaolei123456:feat/sm90-mxfp8-fp8-main

zhangxiaolei123456 commented Jun 16, 2026 •

edited

Loading

Uh oh!

NAP-GHJ commented Jul 2, 2026

Uh oh!

zhangxiaolei123456 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

zhangxiaolei123456 commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NAP-GHJ commented Jul 2, 2026

Uh oh!

zhangxiaolei123456 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhangxiaolei123456 commented Jun 16, 2026 •

edited

Loading