Skip to content

fix(run_benchmark): report base norm bandwidth, not the last variant (layernorm 1.69 → 5.6 TB/s)#654

Open
jhinpan wants to merge 1 commit into
ROCm:mainfrom
jhinpan:fix/run_benchmark-norm-bandwidth-first-match
Open

fix(run_benchmark): report base norm bandwidth, not the last variant (layernorm 1.69 → 5.6 TB/s)#654
jhinpan wants to merge 1 commit into
ROCm:mainfrom
jhinpan:fix/run_benchmark-norm-bandwidth-first-match

Conversation

@jhinpan
Copy link
Copy Markdown
Contributor

@jhinpan jhinpan commented Jun 4, 2026

Fixes #655

What

scripts/run_benchmark.sh has reported layernorm at ~1.69 TB/s on the MI355 runner since #549, while softmax/rmsnorm sit at ~5.8 TB/s for the same 32768x8192 bf16 shape. This is a benchmark-parser bug, not a kernel regression — the real base layernorm runs at ~5.6 TB/s.

Root cause

The norm-style branch of _py_parse_and_emit keeps the last Bandwidth: match:

for m_bw in re.finditer(r"Bandwidth:\s*([0-9.]+)\s*GB/s", txt):
    pass

#549 reworked test_layernorm.py so its __main__ runs six benchmarks in sequence — base layernorm, then fused_add / dynamicquant / smoothquant / fused_add_dynamicquant / fused_add_smoothquant — each printing its own Bandwidth: line. The last is the fully-scalar fused_add_smoothquant path, so the parser reported that as "layernorm". softmax/rmsnorm were spared only because their __main__ runs a single base test (first == last).

Running the exact CI command on MI350X / gfx950 (32768x8192 bf16) prints all six:

LayerNorm (base, 128b vectorized) ........ 5617 GB/s   <- the real number
FusedAdd LayerNorm ....................... 2982
LayerNorm DynamicQuant ................... 1852
LayerNorm SmoothQuant .................... 1608
FusedAdd DynamicQuant .................... 1768
FusedAdd SmoothQuant ..................... 1660 GB/s   <- parser kept this one

The per-commit CI benchmark history shows the step lands exactly on #549 (5.5 → 1.69 TB/s), with softmax/rmsnorm flat across the same range. The kernel itself never changed: git diff on kernels/layernorm_kernel.py at #549 is @@ -314,3 +314,607 @@ — purely appended quant/fused builders; build_layernorm_module is byte-identical.

Fix

Take the first Bandwidth: match. The base op is always benchmarked first; any later lines are fused/quant variants.

Verification — MI350X / gfx950, 32768x8192 bf16

op before after
softmax 5.69 5.69
layernorm 1.69 5.56
rmsnorm 5.85 5.85

Failed: 0. layernorm now reports its base fast-path bandwidth; softmax/rmsnorm unchanged.

Note: the current-vs-main benchmark gate could not catch this — main was already mislabeled, so it only ever compared 1.69 against 1.69. The next main baseline self-corrects after merge (one-time current≫main "improvement", not a regression).

🤖 Generated with Claude Code

The norm-style parser in _py_parse_and_emit kept the LAST "Bandwidth:"
line (`for m_bw in re.finditer(...): pass`). Since ROCm#549 made
test_layernorm.py run six variants from __main__ (base, fused_add,
dynamicquant, smoothquant, fused_add_dynamicquant, fused_add_smoothquant),
each printing its own "Bandwidth:" line, the parser reported the slow
scalar fused_add_smoothquant path (~1.69 TB/s) as "layernorm" instead of
the base fast 128b path (~5.6 TB/s). softmax/rmsnorm were unaffected only
because their __main__ runs a single base test (first == last).

Take the FIRST match instead: the base op is always benchmarked first;
any later "Bandwidth:" lines are fused/quant variants.

Verified on MI350X / gfx950 (32768x8192 bf16): layernorm 1.69 -> 5.56 TB/s;
softmax 5.69 and rmsnorm 5.85 unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 4, 2026 15:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates the benchmark output parser to correctly attribute bandwidth/TB/s for softmax/norm-style tests by selecting the first reported Bandwidth: match (the base op), avoiding later matches from fused/quant variants.

Changes:

  • Adjusted bandwidth parsing to take the first Bandwidth: regex match instead of the last.
  • Added clarifying comments explaining why the first match is the correct one for these tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

run_benchmark mislabels layernorm at ~1.69 TB/s (parser keeps last Bandwidth = scalar smoothquant variant); base layernorm is ~5.6 TB/s

2 participants