Raise layer_norm scalar threshold from 256 to 1024 by xiaodong705 · Pull Request #18954 · pytorch/executorch

xiaodong705 · 2026-04-16T19:35:35Z

Summary:
Raise kSmallNThreshold from 256 to 1024 in the optimized native_layer_norm kernel. The previous threshold of 256 (D98795281) was tuned for trackpad models with N=26-144, but newer EMG models use larger normalized dimensions where SIMD vectorization setup/tail overhead still exceeds the benefit.

Why 1024

Inspecting two production EMG models reveals that all layer_norm calls have N <= 512, well below where SIMD (RowwiseMoments + vec::map3) becomes beneficial:

Laserpointer model (laserpointer-e985727590-FP32.pte) — 24 layer_norm calls:

N=324: 20x, N=512: 4x
All calls have weight=None, bias=None, so the vec::map3 SIMD normalization path is never used — all fall into the scalar gamma/beta loop anyway
With threshold=256: all 24 calls take the SIMD path (regression)
With threshold=1024: all 24 calls take the scalar path (fix)

Trackpad V4 model (trackpad-V4-FP32.pte) — 37 layer_norm calls:

N=32: 6x, N=64: 2x, N=128: 8x, N=256: 16x, N=512: 5x
With threshold=256: 16 calls at N=32-128 use scalar, but 21 calls at N=256-512 use SIMD (regression)
With threshold=1024: all 37 calls use scalar path (fix)

Large-tensor models (LLM, ASR, TTS) with N >= 1024 are unaffected and continue to use the SIMD path.

Benchmark results (x86_64 devserver, median latency in us)

Laserpointer model:

Threads	Portable	Optimized (N<256)	Optimized (N<1024)	Change
1	3,916	4,801	3,390	1.42x faster
2	3,231	4,959	3,278	1.51x faster
4	4,127	5,375	4,085	1.32x faster

Trackpad V4 model:

Threads	Portable	Optimized (N<256)	Optimized (N<1024)	Change
1	5,742	6,146	5,866	1.05x faster
2	6,010	6,379	6,137	1.04x faster
4	7,428	8,010	7,699	1.04x faster

With threshold=256, the optimized kernel was 1.07-1.53x slower than portable for these models. With threshold=1024, the regression is eliminated — optimized is now on par or faster than portable.

This diff was authored with Claude Code.

Differential Revision: D101050373

Summary: Raise kSmallNThreshold from 256 to 1024 in the optimized native_layer_norm kernel. The previous threshold of 256 (D98795281) was tuned for trackpad models with N=26-144, but newer EMG models use larger normalized dimensions where SIMD vectorization setup/tail overhead still exceeds the benefit. #### Why 1024 Inspecting two production EMG models reveals that all layer_norm calls have N <= 512, well below where SIMD (RowwiseMoments + vec::map3) becomes beneficial: **Laserpointer model** (laserpointer-e985727590-FP32.pte) — 24 layer_norm calls: - N=324: 20x, N=512: 4x - All calls have weight=None, bias=None, so the vec::map3 SIMD normalization path is never used — all fall into the scalar gamma/beta loop anyway - With threshold=256: all 24 calls take the SIMD path (regression) - With threshold=1024: all 24 calls take the scalar path (fix) **Trackpad V4 model** (trackpad-V4-FP32.pte) — 37 layer_norm calls: - N=32: 6x, N=64: 2x, N=128: 8x, N=256: 16x, N=512: 5x - With threshold=256: 16 calls at N=32-128 use scalar, but 21 calls at N=256-512 use SIMD (regression) - With threshold=1024: all 37 calls use scalar path (fix) Large-tensor models (LLM, ASR, TTS) with N >= 1024 are unaffected and continue to use the SIMD path. #### Benchmark results (x86_64 devserver, median latency in us) Laserpointer model: | Threads | Portable | Optimized (N<256) | Optimized (N<1024) | Change | |---------|----------|-------------------|--------------------|--------| | 1 | 3,916 | 4,801 | 3,390 | 1.42x faster | | 2 | 3,231 | 4,959 | 3,278 | 1.51x faster | | 4 | 4,127 | 5,375 | 4,085 | 1.32x faster | Trackpad V4 model: | Threads | Portable | Optimized (N<256) | Optimized (N<1024) | Change | |---------|----------|-------------------|--------------------|--------| | 1 | 5,742 | 6,146 | 5,866 | 1.05x faster | | 2 | 6,010 | 6,379 | 6,137 | 1.04x faster | | 4 | 7,428 | 8,010 | 7,699 | 1.04x faster | With threshold=256, the optimized kernel was 1.07-1.53x slower than portable for these models. With threshold=1024, the regression is eliminated — optimized is now on par or faster than portable. This diff was authored with Claude Code. Differential Revision: D101050373

pytorch-bot · 2026-04-16T19:35:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18954

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[CI[B200] Smoke test encounters CUDA Unknown error for dgxb200-03 and dgxb200-04

⚠️ 1 Awaiting Approval, 2 Unrelated Failures

As of commit 8604e2d with merge base a489707 ():

AWAITING APPROVAL - The following workflow needs approval before CI can run:

Lint (gh)

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-04-16T19:35:44Z

@xiaodong705 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D101050373.

github-actions · 2026-04-16T19:36:45Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

kirklandsign · 2026-04-16T22:51:29Z

@claude review

claude · 2026-04-16T22:52:14Z

Claude finished @kirklandsign's task in 52s —— View job

Review of PR #18954: Raise layer_norm scalar threshold from 256 to 1024

Read changed file and surrounding context
Analyze the threshold usage and impact
Post review feedback

LGTM. This is a well-motivated, minimal, and low-risk change.

Summary

The change is a single constant bump at kernels/optimized/cpu/op_native_layer_norm.cpp:78:

-  constexpr size_t kSmallNThreshold = 256;
+  constexpr size_t kSmallNThreshold = 1024;

This widens the range of normalized dimensions that fall back to the scalar layer_norm_scalar path (from kernels/portable/cpu/util/normalization_ops_util.h) instead of using the SIMD RowwiseMoments + vec::map3 path.

Analysis

The code paths are correct. When N < kSmallNThreshold, the function delegates to layer_norm_scalar (line 80-90) and returns early. Otherwise, it proceeds to the SIMD path using RowwiseMoments (line 99) and either a scalar gamma/beta loop (line 105-110, when weight or bias is null) or vec::map3 (line 112-121, when both are present).

The threshold value is well-justified. The PR description provides concrete benchmark data showing that the SIMD path was a regression (1.07-1.53x slower) for models with N in [32, 512]. The new threshold of 1024 eliminates these regressions while preserving SIMD benefits for large-N models (LLM/ASR/TTS).

No functional risk for large N. Models with N >= 1024 are unaffected — they continue to use the SIMD path exactly as before.

One observation (non-blocking): The SIMD path at lines 105-110 falls back to a scalar gamma/beta loop when gamma_null || beta_null, meaning the only SIMD benefit in that case comes from RowwiseMoments for the mean/variance computation. For the Laserpointer model (where all calls have weight=None, bias=None), this means the SIMD overhead was being paid for moments computation alone, with normalization always being scalar — further justifying the threshold increase.

No concerns with the change.

xiaodong705 requested a review from manuelcandales as a code owner April 16, 2026 19:35

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 16, 2026

meta-codesync Bot added fb-exported meta-exported labels Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise layer_norm scalar threshold from 256 to 1024#18954

Raise layer_norm scalar threshold from 256 to 1024#18954
xiaodong705 wants to merge 1 commit intopytorch:mainfrom
xiaodong705:export-D101050373

xiaodong705 commented Apr 16, 2026

Uh oh!

pytorch-bot Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

kirklandsign commented Apr 16, 2026

Uh oh!

claude Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xiaodong705 commented Apr 16, 2026

Why 1024

Benchmark results (x86_64 devserver, median latency in us)

Uh oh!

pytorch-bot Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18954

❗ 1 Active SEVs

⚠️ 1 Awaiting Approval, 2 Unrelated Failures

Uh oh!

meta-codesync Bot commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026

This PR needs a release notes: label

Uh oh!

kirklandsign commented Apr 16, 2026

Uh oh!

claude Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #18954: Raise layer_norm scalar threshold from 256 to 1024

Summary

Analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot Bot commented Apr 16, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Apr 16, 2026 •

edited

Loading