Summary
Commit a56b5206aea55c5464c29bcd25eb4e5a6e4be273 (PR #2434) introduced a severe accuracy regression for the DeepSeek-R1-MXFP4 model with KV cache FP8 on MI35x (GSM8K few-shot completion benchmark).
Details
Test: sglang/test/registered/amd/accuracy/mi35x/test_deepseek_r1_mxfp4_kv_fp8_eval_mi35x.py (GSM8K 200-question, 5-shot, --kv-cache-dtype fp8_e4m3, TP=8)
Threshold: 0.93
| Condition |
Accuracy |
Notes |
| Before the commit |
0.94 – 0.96 |
Consistently above threshold |
| At the commit |
0.03 |
Catastrophic drop |
After reverting gfx950-GEMM-AFP4WFP4-N=7168-K=2304.json |
0.905 – 0.93 |
Partial recovery (2/3 runs at 0.915, 1/3 at 0.93) |
Reproduction
Pull rocm/sgl-dev:v0.5.10rc0-rocm700-mi35x-20260406, reinstall AITER in /sgl-workspace/aiter, execute
export SGLANG_AMD_CI=1
export SGLANG_IS_IN_CI=1
export SGLANG_IS_IN_CI_AMD=1
export SGLANG_USE_AITER=1
python3 -m pytest \
test/registered/amd/accuracy/mi35x/test_deepseek_r1_mxfp4_kv_fp8_eval_mi35x.py \
-v -s
Root Cause Analysis
Thanks to @1am9trash for pointing out this clue, and @yctseng0211 for bisecting the aiter commits
Reverting the Triton GEMM config for shape N=7168, K=2304 recovers most of the accuracy:
git checkout a56b520^ -- "aiter/ops/triton/configs/gemm/gfx950-GEMM-AFP4WFP4-N=7168-K=2304.json"
However, accuracy still falls short of the expected 0.94–0.96 range even after this revert, suggesting that other config changes in this commit may also be contributing to the regression.
Expected Behavior
Accuracy on GSM8K should remain in the 0.94–0.96 range, consistently above the 0.93 threshold.
Environment
- GPU: AMD MI35x (8x)
- Model:
amd/DeepSeek-R1-MXFP4-Preview
- Attention backend: aiter
- KV cache dtype: fp8_e4m3
Summary
Commit
a56b5206aea55c5464c29bcd25eb4e5a6e4be273(PR #2434) introduced a severe accuracy regression for the DeepSeek-R1-MXFP4 model with KV cache FP8 on MI35x (GSM8K few-shot completion benchmark).Details
Test:
sglang/test/registered/amd/accuracy/mi35x/test_deepseek_r1_mxfp4_kv_fp8_eval_mi35x.py(GSM8K 200-question, 5-shot,--kv-cache-dtype fp8_e4m3, TP=8)Threshold: 0.93
gfx950-GEMM-AFP4WFP4-N=7168-K=2304.jsonReproduction
Pull
rocm/sgl-dev:v0.5.10rc0-rocm700-mi35x-20260406, reinstall AITER in/sgl-workspace/aiter, executeRoot Cause Analysis
Thanks to @1am9trash for pointing out this clue, and @yctseng0211 for bisecting the aiter commits
Reverting the Triton GEMM config for shape N=7168, K=2304 recovers most of the accuracy:
git checkout a56b520^ -- "aiter/ops/triton/configs/gemm/gfx950-GEMM-AFP4WFP4-N=7168-K=2304.json"However, accuracy still falls short of the expected 0.94–0.96 range even after this revert, suggesting that other config changes in this commit may also be contributing to the regression.
Expected Behavior
Accuracy on GSM8K should remain in the 0.94–0.96 range, consistently above the 0.93 threshold.
Environment
amd/DeepSeek-R1-MXFP4-Preview