fix(run_benchmark): report base norm bandwidth, not the last variant (layernorm 1.69 → 5.6 TB/s) by jhinpan · Pull Request #654 · ROCm/FlyDSL

jhinpan · 2026-06-04T15:58:05Z

Fixes #655

What

scripts/run_benchmark.sh has reported layernorm at ~1.69 TB/s on the MI355 runner since #549, while softmax/rmsnorm sit at ~5.8 TB/s for the same 32768x8192 bf16 shape. This is a benchmark-parser bug, not a kernel regression — the real base layernorm runs at ~5.6 TB/s.

Root cause

The norm-style branch of _py_parse_and_emit keeps the last Bandwidth: match:

for m_bw in re.finditer(r"Bandwidth:\s*([0-9.]+)\s*GB/s", txt):
    pass

#549 reworked test_layernorm.py so its __main__ runs six benchmarks in sequence — base layernorm, then fused_add / dynamicquant / smoothquant / fused_add_dynamicquant / fused_add_smoothquant — each printing its own Bandwidth: line. The last is the fully-scalar fused_add_smoothquant path, so the parser reported that as "layernorm". softmax/rmsnorm were spared only because their __main__ runs a single base test (first == last).

Running the exact CI command on MI350X / gfx950 (32768x8192 bf16) prints all six:

LayerNorm (base, 128b vectorized) ........ 5617 GB/s   <- the real number
FusedAdd LayerNorm ....................... 2982
LayerNorm DynamicQuant ................... 1852
LayerNorm SmoothQuant .................... 1608
FusedAdd DynamicQuant .................... 1768
FusedAdd SmoothQuant ..................... 1660 GB/s   <- parser kept this one

The per-commit CI benchmark history shows the step lands exactly on #549 (5.5 → 1.69 TB/s), with softmax/rmsnorm flat across the same range. The kernel itself never changed: git diff on kernels/layernorm_kernel.py at #549 is @@ -314,3 +314,607 @@ — purely appended quant/fused builders; build_layernorm_module is byte-identical.

Fix

Take the first Bandwidth: match. The base op is always benchmarked first; any later lines are fused/quant variants.

Verification — MI350X / gfx950, `32768x8192` bf16

op	before	after
softmax	5.69	5.69
layernorm	1.69	5.56
rmsnorm	5.85	5.85

Failed: 0. layernorm now reports its base fast-path bandwidth; softmax/rmsnorm unchanged.

Note: the current-vs-main benchmark gate could not catch this — main was already mislabeled, so it only ever compared 1.69 against 1.69. The next main baseline self-corrects after merge (one-time current≫main "improvement", not a regression).

🤖 Generated with Claude Code

The norm-style parser in _py_parse_and_emit kept the LAST "Bandwidth:" line (`for m_bw in re.finditer(...): pass`). Since ROCm#549 made test_layernorm.py run six variants from __main__ (base, fused_add, dynamicquant, smoothquant, fused_add_dynamicquant, fused_add_smoothquant), each printing its own "Bandwidth:" line, the parser reported the slow scalar fused_add_smoothquant path (~1.69 TB/s) as "layernorm" instead of the base fast 128b path (~5.6 TB/s). softmax/rmsnorm were unaffected only because their __main__ runs a single base test (first == last). Take the FIRST match instead: the base op is always benchmarked first; any later "Bandwidth:" lines are fused/quant variants. Verified on MI350X / gfx950 (32768x8192 bf16): layernorm 1.69 -> 5.56 TB/s; softmax 5.69 and rmsnorm 5.85 unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates the benchmark output parser to correctly attribute bandwidth/TB/s for softmax/norm-style tests by selecting the first reported Bandwidth: match (the base op), avoiding later matches from fused/quant variants.

Changes:

Adjusted bandwidth parsing to take the first Bandwidth: regex match instead of the last.
Added clarifying comments explaining why the first match is the correct one for these tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings June 4, 2026 15:58

Copilot AI reviewed Jun 4, 2026

View reviewed changes

This was referenced Jun 4, 2026

run_benchmark mislabels layernorm at ~1.69 TB/s (parser keeps last Bandwidth = scalar smoothquant variant); base layernorm is ~5.6 TB/s #655

Open

📋 FlyDSL upstream tracker — jhinpan issues & PRs jhinpan/flydsl-kernel-profiling#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(run_benchmark): report base norm bandwidth, not the last variant (layernorm 1.69 → 5.6 TB/s)#654

fix(run_benchmark): report base norm bandwidth, not the last variant (layernorm 1.69 → 5.6 TB/s)#654
jhinpan wants to merge 1 commit into
ROCm:mainfrom
jhinpan:fix/run_benchmark-norm-bandwidth-first-match

jhinpan commented Jun 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jhinpan commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Root cause

Fix

Verification — MI350X / gfx950, 32768x8192 bf16

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jhinpan commented Jun 4, 2026 •

edited

Loading

Verification — MI350X / gfx950, `32768x8192` bf16