Raise layer_norm scalar threshold from 256 to 1024#18954
Raise layer_norm scalar threshold from 256 to 1024#18954xiaodong705 wants to merge 1 commit intopytorch:mainfrom
Conversation
Summary: Raise kSmallNThreshold from 256 to 1024 in the optimized native_layer_norm kernel. The previous threshold of 256 (D98795281) was tuned for trackpad models with N=26-144, but newer EMG models use larger normalized dimensions where SIMD vectorization setup/tail overhead still exceeds the benefit. #### Why 1024 Inspecting two production EMG models reveals that all layer_norm calls have N <= 512, well below where SIMD (RowwiseMoments + vec::map3) becomes beneficial: **Laserpointer model** (laserpointer-e985727590-FP32.pte) — 24 layer_norm calls: - N=324: 20x, N=512: 4x - All calls have weight=None, bias=None, so the vec::map3 SIMD normalization path is never used — all fall into the scalar gamma/beta loop anyway - With threshold=256: all 24 calls take the SIMD path (regression) - With threshold=1024: all 24 calls take the scalar path (fix) **Trackpad V4 model** (trackpad-V4-FP32.pte) — 37 layer_norm calls: - N=32: 6x, N=64: 2x, N=128: 8x, N=256: 16x, N=512: 5x - With threshold=256: 16 calls at N=32-128 use scalar, but 21 calls at N=256-512 use SIMD (regression) - With threshold=1024: all 37 calls use scalar path (fix) Large-tensor models (LLM, ASR, TTS) with N >= 1024 are unaffected and continue to use the SIMD path. #### Benchmark results (x86_64 devserver, median latency in us) Laserpointer model: | Threads | Portable | Optimized (N<256) | Optimized (N<1024) | Change | |---------|----------|-------------------|--------------------|--------| | 1 | 3,916 | 4,801 | 3,390 | 1.42x faster | | 2 | 3,231 | 4,959 | 3,278 | 1.51x faster | | 4 | 4,127 | 5,375 | 4,085 | 1.32x faster | Trackpad V4 model: | Threads | Portable | Optimized (N<256) | Optimized (N<1024) | Change | |---------|----------|-------------------|--------------------|--------| | 1 | 5,742 | 6,146 | 5,866 | 1.05x faster | | 2 | 6,010 | 6,379 | 6,137 | 1.04x faster | | 4 | 7,428 | 8,010 | 7,699 | 1.04x faster | With threshold=256, the optimized kernel was 1.07-1.53x slower than portable for these models. With threshold=1024, the regression is eliminated — optimized is now on par or faster than portable. This diff was authored with Claude Code. Differential Revision: D101050373
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18954
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below:
|
|
@xiaodong705 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D101050373. |
This PR needs a
|
|
@claude review |
|
Claude finished @kirklandsign's task in 52s —— View job Review of PR #18954: Raise layer_norm scalar threshold from 256 to 1024
LGTM. This is a well-motivated, minimal, and low-risk change. SummaryThe change is a single constant bump at - constexpr size_t kSmallNThreshold = 256;
+ constexpr size_t kSmallNThreshold = 1024;This widens the range of normalized dimensions that fall back to the scalar AnalysisThe code paths are correct. When The threshold value is well-justified. The PR description provides concrete benchmark data showing that the SIMD path was a regression (1.07-1.53x slower) for models with N in [32, 512]. The new threshold of 1024 eliminates these regressions while preserving SIMD benefits for large-N models (LLM/ASR/TTS). No functional risk for large N. Models with N >= 1024 are unaffected — they continue to use the SIMD path exactly as before. One observation (non-blocking): The SIMD path at lines 105-110 falls back to a scalar gamma/beta loop when No concerns with the change. |
Summary:
Raise kSmallNThreshold from 256 to 1024 in the optimized native_layer_norm kernel. The previous threshold of 256 (D98795281) was tuned for trackpad models with N=26-144, but newer EMG models use larger normalized dimensions where SIMD vectorization setup/tail overhead still exceeds the benefit.
Why 1024
Inspecting two production EMG models reveals that all layer_norm calls have N <= 512, well below where SIMD (RowwiseMoments + vec::map3) becomes beneficial:
Laserpointer model (laserpointer-e985727590-FP32.pte) — 24 layer_norm calls:
Trackpad V4 model (trackpad-V4-FP32.pte) — 37 layer_norm calls:
Large-tensor models (LLM, ASR, TTS) with N >= 1024 are unaffected and continue to use the SIMD path.
Benchmark results (x86_64 devserver, median latency in us)
Laserpointer model:
Trackpad V4 model:
With threshold=256, the optimized kernel was 1.07-1.53x slower than portable for these models. With threshold=1024, the regression is eliminated — optimized is now on par or faster than portable.
This diff was authored with Claude Code.
Differential Revision: D101050373