Accuracy regression in DeepSeek-R1-MXFP4 (KV FP8) on MI35x after commit a56b520 in SGLang

## Summary

Commit `a56b5206aea55c5464c29bcd25eb4e5a6e4be273` (PR https://github.com/ROCm/aiter/pull/2434) introduced a severe accuracy regression for the DeepSeek-R1-MXFP4 model with KV cache FP8 on MI35x (GSM8K few-shot completion benchmark).

## Details

**Test:** `sglang/test/registered/amd/accuracy/mi35x/test_deepseek_r1_mxfp4_kv_fp8_eval_mi35x.py` (GSM8K 200-question, 5-shot, `--kv-cache-dtype fp8_e4m3`, TP=8)
**Threshold:** 0.93

| Condition | Accuracy | Notes |
|---|---|---|
| Before the commit | 0.94 – 0.96 | Consistently above threshold |
| At the commit | **0.03** | Catastrophic drop |
| After reverting `gfx950-GEMM-AFP4WFP4-N=7168-K=2304.json` | 0.905 – 0.93 | Partial recovery (2/3 runs at 0.915, 1/3 at 0.93) |

## Reproduction
Pull `rocm/sgl-dev:v0.5.10rc0-rocm700-mi35x-20260406`, reinstall AITER in `/sgl-workspace/aiter`, execute 
```bash
    export SGLANG_AMD_CI=1 
    export SGLANG_IS_IN_CI=1
    export SGLANG_IS_IN_CI_AMD=1
    export SGLANG_USE_AITER=1
    python3 -m pytest \
      test/registered/amd/accuracy/mi35x/test_deepseek_r1_mxfp4_kv_fp8_eval_mi35x.py \
      -v -s
```

## Root Cause Analysis
Thanks to @1am9trash for pointing out this clue, and @yctseng0211  for bisecting the aiter commits
Reverting the Triton GEMM config for shape N=7168, K=2304 recovers most of the accuracy:

```bash
git checkout a56b520^ -- "aiter/ops/triton/configs/gemm/gfx950-GEMM-AFP4WFP4-N=7168-K=2304.json"
```

However, accuracy still falls short of the expected 0.94–0.96 range even after this revert, suggesting that other config changes in this commit may also be contributing to the regression.

## Expected Behavior

Accuracy on GSM8K should remain in the 0.94–0.96 range, consistently above the 0.93 threshold.

## Environment

- **GPU:** AMD MI35x (8x)
- **Model:** `amd/DeepSeek-R1-MXFP4-Preview`
- **Attention backend:** aiter
- **KV cache dtype:** fp8_e4m3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accuracy regression in DeepSeek-R1-MXFP4 (KV FP8) on MI35x after commit a56b520 in SGLang #2656

Summary

Details

Reproduction

Root Cause Analysis

Expected Behavior

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Condition	Accuracy	Notes
Before the commit	0.94 – 0.96	Consistently above threshold
At the commit	0.03	Catastrophic drop
After reverting `gfx950-GEMM-AFP4WFP4-N=7168-K=2304.json`	0.905 – 0.93	Partial recovery (2/3 runs at 0.915, 1/3 at 0.93)

Accuracy regression in DeepSeek-R1-MXFP4 (KV FP8) on MI35x after commit a56b520 in SGLang #2656

Description

Summary

Details

Reproduction

Root Cause Analysis

Expected Behavior

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions