fix(run_benchmark): report base norm bandwidth, not the last variant (layernorm 1.69 → 5.6 TB/s)#654
Open
jhinpan wants to merge 1 commit into
Open
Conversation
The norm-style parser in _py_parse_and_emit kept the LAST "Bandwidth:" line (`for m_bw in re.finditer(...): pass`). Since ROCm#549 made test_layernorm.py run six variants from __main__ (base, fused_add, dynamicquant, smoothquant, fused_add_dynamicquant, fused_add_smoothquant), each printing its own "Bandwidth:" line, the parser reported the slow scalar fused_add_smoothquant path (~1.69 TB/s) as "layernorm" instead of the base fast 128b path (~5.6 TB/s). softmax/rmsnorm were unaffected only because their __main__ runs a single base test (first == last). Take the FIRST match instead: the base op is always benchmarked first; any later "Bandwidth:" lines are fused/quant variants. Verified on MI350X / gfx950 (32768x8192 bf16): layernorm 1.69 -> 5.56 TB/s; softmax 5.69 and rmsnorm 5.85 unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates the benchmark output parser to correctly attribute bandwidth/TB/s for softmax/norm-style tests by selecting the first reported Bandwidth: match (the base op), avoiding later matches from fused/quant variants.
Changes:
- Adjusted bandwidth parsing to take the first
Bandwidth:regex match instead of the last. - Added clarifying comments explaining why the first match is the correct one for these tests.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #655
What
scripts/run_benchmark.shhas reported layernorm at ~1.69 TB/s on the MI355 runner since #549, while softmax/rmsnorm sit at ~5.8 TB/s for the same32768x8192bf16 shape. This is a benchmark-parser bug, not a kernel regression — the real base layernorm runs at ~5.6 TB/s.Root cause
The norm-style branch of
_py_parse_and_emitkeeps the lastBandwidth:match:#549 reworked
test_layernorm.pyso its__main__runs six benchmarks in sequence — base layernorm, thenfused_add/dynamicquant/smoothquant/fused_add_dynamicquant/fused_add_smoothquant— each printing its ownBandwidth:line. The last is the fully-scalarfused_add_smoothquantpath, so the parser reported that as "layernorm".softmax/rmsnormwere spared only because their__main__runs a single base test (first == last).Running the exact CI command on MI350X / gfx950 (
32768x8192bf16) prints all six:The per-commit CI benchmark history shows the step lands exactly on #549 (5.5 → 1.69 TB/s), with softmax/rmsnorm flat across the same range. The kernel itself never changed:
git diffonkernels/layernorm_kernel.pyat #549 is@@ -314,3 +314,607 @@— purely appended quant/fused builders;build_layernorm_moduleis byte-identical.Fix
Take the first
Bandwidth:match. The base op is always benchmarked first; any later lines are fused/quant variants.Verification — MI350X / gfx950,
32768x8192bf16Failed: 0. layernorm now reports its base fast-path bandwidth; softmax/rmsnorm unchanged.🤖 Generated with Claude Code