Skip to content

Feat: sm90 support mxfp8 fp8 groupgemm#362

Open
zhangxiaolei123456 wants to merge 21 commits into
deepseek-ai:mainfrom
zhangxiaolei123456:feat/sm90-mxfp8-fp8-main
Open

Feat: sm90 support mxfp8 fp8 groupgemm#362
zhangxiaolei123456 wants to merge 21 commits into
deepseek-ai:mainfrom
zhangxiaolei123456:feat/sm90-mxfp8-fp8-main

Conversation

@zhangxiaolei123456

@zhangxiaolei123456 zhangxiaolei123456 commented Jun 16, 2026

Copy link
Copy Markdown

accuracy test:

python -m pytest tests/test_sm90_mxfp8_fp8.py -q OK

Peformance test:

python -m pytest tests/test_sm90_mxfp8_fp8.py::test_m_grouped_mxfp8_vs_fp8_perf_contiguous_and_masked -q -s
kernel active M MXFP8 us FP8 us MXFP8 TFLOPS FP8 TFLOPS speedup diff
contiguous 512 23 31 45.9 35.0 1.31x 0.0462
masked 320 22 32 30.4 21.0 1.45x 0.0229

The MXFP8 consumer was issuing a fresh global load for every (kk, accum)
pair, even though the producer warp had already staged SFA/SFB into
shared memory. Replace the per-iteration GMEM lookups with two LDS
loads (one 32-bit pack per SFA row, one 64-bit pack per pair of SFB
rows) outside the wgmma kk loop, and decode the per-kk byte with a
register-side shift. Keeps the result bit-identical while removing the
GMEM dependency from the inner wgmma pipeline.
@NAP-GHJ

NAP-GHJ commented Jul 2, 2026

Copy link
Copy Markdown

When m increases, performance is very poor; at m = 8192, the performance is 765 µs versus 66 µs—only 0.09x the speed.

@zhangxiaolei123456

Copy link
Copy Markdown
Author

Thanks for the reminder. I'll fix this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants