[gfx1250][gemm] Add PTPC FP8/A8W4, non-tile-aligned M, and strided A/C support by aoli26 · Pull Request #649 · ROCm/FlyDSL

aoli26 · 2026-06-03T17:00:43Z

Motivation

Add per-token per-channel (PTPC) scaling to the gfx1250 GEMM kernel: per-token sa[M] and per-channel sb[N] scales, constant along K, stored as fp32 and applied once in the epilogue instead of per K-block. Also add non-tile-aligned M and strided A/C so the host can pass an unpadded runtime M and arbitrary A/C leading-dim strides directly, eliminating per-call padding allocation and memcpy.

Technical Details

PTPC FP8 runs the unscaled WMMA in the K-loop; A8W4 uses the scaled f8f6f4 op with an identity scale. sa*sb is applied in fp32 in the epilogue (split-K supported via per-chunk scale + atomic add).
All changes are compile-time gated to PTPC, so the mxscale path is untouched. PTPC also skips scale TDM/LDS and prefetches epilogue scale loads behind the last WMMAs.
M-OOB: A/A-scale loads skip rows ≥ M via the TDM descriptor and the output clip is auto-selected; make_tensor_descriptor_2d gains an oob_outer_bound parameter.
Strided A/C: M is no longer a compile-time arg; launch_fn now takes runtime lda/ldc leading-dim strides (dense callers pass lda == K, ldc == N). make_tensor_descriptor_2d accepts a runtime i32/index outer stride. Aligned/dense callers are byte-identical to before.

Test Plan

pytest tests/kernels/test_gemm_fp8fp4_gfx1250.py -k 'ptpc or mpad or strided', plus ISA inspection of the PTPC kernels.

Test Result

All PTPC, M-pad, and strided A/C tests pass. ISA confirms scale TDM removal and epilogue prefetch with lower VGPR count and 0 spill.

Submission Checklist

I have reviewed the contributing guidelines.

- kernel: m_oob_clip + m_oob_store {buffer, tdm_tail}. A/A-scale load clip via TDM tensor_dim1, C-store clips via buffer num_records, split-K via per-lane (row < M) predicate on the atomic path. - tdm_ops: make_tensor_descriptor_2d gains oob_outer_bound. It sets only tensor_dim1 (HW OOB field); tile_dim1 stays the full per-warp tile. Accepts int|index|i32, raises otherwise. None keeps the original (byte-identical) path. - tests: M-pad coverage (M=16..1000 x buffer/tdm_tail x bf16/f32 + split-K).

Remove the m_oob_store parameter from compile_fp8fp4_gemm / compile_ptpc_gemm and pick the non-aligned-M output clip internally: tdm_tail when use_tdm_store and split_k == 1 (full tiles keep the fast TDM store; the <=1 partial last M-tile falls back to buffer num_records) buffer otherwise (whole-output num_records clip; split_k>1 uses the per-lane row < M atomic predicate) A whole-output buffer clip regressed aligned production prefill by +15%..+82%, while tdm_tail stays within ~2% of the no-clip path, so a static buffer default was wrong. The choice is fully derivable from use_tdm_store/split_k, so cache_tag drops m_oob_store too (no collision). Tests: the mxscale mpad test now parametrizes use_tdm_store to cover both auto branches (tdm_tail / buffer); the atomic branch stays covered by the split-k mpad test.

Copilot

Pull request overview

This PR extends the gfx1250 FP8/FP4 GEMM implementation to support (1) a new PTPC scaling mode (per-token sa[M] and per-channel sb[N], applied in the epilogue) and (2) non-tile-aligned runtime M handling (avoiding host-side padding/copies). It also adds a new ROCDL WMMA wrapper and substantially expands correctness coverage for PTPC and M-tail behavior.

Changes:

Add scale_mode="ptpc" support to the unified gfx1250 GEMM kernel, with fp32 sa/sb loads and epilogue scaling (including split-K atomic path).
Add non-tile-aligned M (“M-OOB”) support via TDM descriptor bounds and output clipping (TDM store for full tiles, buffer/atomic for tails).
Expand tests/benchmark CLI to cover PTPC (FP8 + A8W4), split-K, and a wide range of M-tail cases; add a no-scale WMMA wrapper op.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`kernels/gemm_fp8fp4_gfx1250.py`	Implements PTPC scaling mode and M-OOB handling in the kernel (TDM descriptors, epilogue scaling, store/atomic clipping).
`python/flydsl/expr/rocdl/tdm_ops.py`	Extends `make_tensor_descriptor_2d` with `oob_outer_bound` to support runtime outer-dim OOB-safe TDM loads.
`python/flydsl/expr/rocdl/__init__.py`	Adds ROCDL builder wrapper for `wmma_f32_16x16x128_fp8_fp8`.
`python/flydsl/expr/rocdl.py`	Adds the same ROCDL wrapper and exports it via `__all__`.
`tests/kernels/test_gemm_fp8fp4_gfx1250.py`	Adds PTPC correctness tests, M-tail test matrix, and benchmark/CLI support for PTPC scale mode.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…p compile-time M

aoli26 force-pushed the gfx1250/gemm_ptpc branch 2 times, most recently from 5fea303 to 49a21d4 Compare June 6, 2026 03:05

aoli26 changed the title ~~[gfx1250][gemm] Add PTPC FP8/A8W4 support~~ [gfx1250][gemm] Add PTPC FP8/A8W4 and non-tile-aligned M support Jun 6, 2026

Base automatically changed from gfx1250/gemm_fp8_opt to main June 6, 2026 14:15

aoli26 added 7 commits June 6, 2026 15:32

feat: add ptpc fp8, a8w4 gemm

74e38a8

optimize ptpc epilogue vgpr prefetch

9399040

ptpc use no-scale wmma for compatibility

8928d59

mxscale/ptpc a8w4 use latest fp8 scheduler

ef8a9ef

Remove m_oob_clip flag: non-tile-aligned M is now the default GEMM path

542641d

aoli26 force-pushed the gfx1250/gemm_ptpc branch from 1b5e5a5 to 542641d Compare June 6, 2026 15:35

aoli26 marked this pull request as ready for review June 7, 2026 03:21

Copilot AI review requested due to automatic review settings June 7, 2026 03:21

Copilot started reviewing on behalf of aoli26 June 7, 2026 03:21 View session

aoli26 requested a review from coderfeli June 7, 2026 03:22

Copilot AI reviewed Jun 7, 2026

View reviewed changes

Comment thread kernels/gemm_fp8fp4_gfx1250.py Outdated

Comment thread kernels/gemm_fp8fp4_gfx1250.py Outdated

aoli26 added 2 commits June 7, 2026 04:29

ptpc: set scale buffer num_records from runtime M/N to keep OOB clipping

be59f65

gemm_fp8fp4_gfx1250: add runtime lda/ldc strides for strided A/C; dro…

ad297fb

…p compile-time M

aoli26 changed the title ~~[gfx1250][gemm] Add PTPC FP8/A8W4 and non-tile-aligned M support~~ [gfx1250][gemm] Add PTPC FP8/A8W4, non-tile-aligned M, and strided A/C support Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gfx1250][gemm] Add PTPC FP8/A8W4, non-tile-aligned M, and strided A/C support#649

[gfx1250][gemm] Add PTPC FP8/A8W4, non-tile-aligned M, and strided A/C support#649
aoli26 wants to merge 9 commits into
mainfrom
gfx1250/gemm_ptpc

aoli26 commented Jun 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aoli26 commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aoli26 commented Jun 3, 2026 •

edited

Loading