Skip to content

Raise layer_norm scalar threshold from 256 to 1024#18954

Open
xiaodong705 wants to merge 1 commit intopytorch:mainfrom
xiaodong705:export-D101050373
Open

Raise layer_norm scalar threshold from 256 to 1024#18954
xiaodong705 wants to merge 1 commit intopytorch:mainfrom
xiaodong705:export-D101050373

Conversation

@xiaodong705
Copy link
Copy Markdown
Contributor

Summary:
Raise kSmallNThreshold from 256 to 1024 in the optimized native_layer_norm kernel. The previous threshold of 256 (D98795281) was tuned for trackpad models with N=26-144, but newer EMG models use larger normalized dimensions where SIMD vectorization setup/tail overhead still exceeds the benefit.

Why 1024

Inspecting two production EMG models reveals that all layer_norm calls have N <= 512, well below where SIMD (RowwiseMoments + vec::map3) becomes beneficial:

Laserpointer model (laserpointer-e985727590-FP32.pte) — 24 layer_norm calls:

  • N=324: 20x, N=512: 4x
  • All calls have weight=None, bias=None, so the vec::map3 SIMD normalization path is never used — all fall into the scalar gamma/beta loop anyway
  • With threshold=256: all 24 calls take the SIMD path (regression)
  • With threshold=1024: all 24 calls take the scalar path (fix)

Trackpad V4 model (trackpad-V4-FP32.pte) — 37 layer_norm calls:

  • N=32: 6x, N=64: 2x, N=128: 8x, N=256: 16x, N=512: 5x
  • With threshold=256: 16 calls at N=32-128 use scalar, but 21 calls at N=256-512 use SIMD (regression)
  • With threshold=1024: all 37 calls use scalar path (fix)

Large-tensor models (LLM, ASR, TTS) with N >= 1024 are unaffected and continue to use the SIMD path.

Benchmark results (x86_64 devserver, median latency in us)

Laserpointer model:

Threads Portable Optimized (N<256) Optimized (N<1024) Change
1 3,916 4,801 3,390 1.42x faster
2 3,231 4,959 3,278 1.51x faster
4 4,127 5,375 4,085 1.32x faster

Trackpad V4 model:

Threads Portable Optimized (N<256) Optimized (N<1024) Change
1 5,742 6,146 5,866 1.05x faster
2 6,010 6,379 6,137 1.04x faster
4 7,428 8,010 7,699 1.04x faster

With threshold=256, the optimized kernel was 1.07-1.53x slower than portable for these models. With threshold=1024, the regression is eliminated — optimized is now on par or faster than portable.

This diff was authored with Claude Code.

Differential Revision: D101050373

Summary:
Raise kSmallNThreshold from 256 to 1024 in the optimized native_layer_norm kernel. The previous threshold of 256 (D98795281) was tuned for trackpad models with N=26-144, but newer EMG models use larger normalized dimensions where SIMD vectorization setup/tail overhead still exceeds the benefit.

#### Why 1024

Inspecting two production EMG models reveals that all layer_norm calls have N <= 512, well below where SIMD (RowwiseMoments + vec::map3) becomes beneficial:

**Laserpointer model** (laserpointer-e985727590-FP32.pte) — 24 layer_norm calls:
- N=324: 20x, N=512: 4x
- All calls have weight=None, bias=None, so the vec::map3 SIMD normalization path is never used — all fall into the scalar gamma/beta loop anyway
- With threshold=256: all 24 calls take the SIMD path (regression)
- With threshold=1024: all 24 calls take the scalar path (fix)

**Trackpad V4 model** (trackpad-V4-FP32.pte) — 37 layer_norm calls:
- N=32: 6x, N=64: 2x, N=128: 8x, N=256: 16x, N=512: 5x
- With threshold=256: 16 calls at N=32-128 use scalar, but 21 calls at N=256-512 use SIMD (regression)
- With threshold=1024: all 37 calls use scalar path (fix)

Large-tensor models (LLM, ASR, TTS) with N >= 1024 are unaffected and continue to use the SIMD path.

#### Benchmark results (x86_64 devserver, median latency in us)

Laserpointer model:

| Threads | Portable | Optimized (N<256) | Optimized (N<1024) | Change |
|---------|----------|-------------------|--------------------|--------|
| 1       | 3,916    | 4,801             | 3,390              | 1.42x faster |
| 2       | 3,231    | 4,959             | 3,278              | 1.51x faster |
| 4       | 4,127    | 5,375             | 4,085              | 1.32x faster |

Trackpad V4 model:

| Threads | Portable | Optimized (N<256) | Optimized (N<1024) | Change |
|---------|----------|-------------------|--------------------|--------|
| 1       | 5,742    | 6,146             | 5,866              | 1.05x faster |
| 2       | 6,010    | 6,379             | 6,137              | 1.04x faster |
| 4       | 7,428    | 8,010             | 7,699              | 1.04x faster |

With threshold=256, the optimized kernel was 1.07-1.53x slower than portable for these models. With threshold=1024, the regression is eliminated — optimized is now on par or faster than portable.

This diff was authored with Claude Code.

Differential Revision: D101050373
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 16, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18954

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

⚠️ 1 Awaiting Approval, 2 Unrelated Failures

As of commit 8604e2d with merge base a489707 (image):

AWAITING APPROVAL - The following workflow needs approval before CI can run:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 16, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 16, 2026

@xiaodong705 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D101050373.

@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@kirklandsign
Copy link
Copy Markdown
Contributor

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 16, 2026

Claude finished @kirklandsign's task in 52s —— View job


Review of PR #18954: Raise layer_norm scalar threshold from 256 to 1024

  • Read changed file and surrounding context
  • Analyze the threshold usage and impact
  • Post review feedback

LGTM. This is a well-motivated, minimal, and low-risk change.

Summary

The change is a single constant bump at kernels/optimized/cpu/op_native_layer_norm.cpp:78:

-  constexpr size_t kSmallNThreshold = 256;
+  constexpr size_t kSmallNThreshold = 1024;

This widens the range of normalized dimensions that fall back to the scalar layer_norm_scalar path (from kernels/portable/cpu/util/normalization_ops_util.h) instead of using the SIMD RowwiseMoments + vec::map3 path.

Analysis

The code paths are correct. When N < kSmallNThreshold, the function delegates to layer_norm_scalar (line 80-90) and returns early. Otherwise, it proceeds to the SIMD path using RowwiseMoments (line 99) and either a scalar gamma/beta loop (line 105-110, when weight or bias is null) or vec::map3 (line 112-121, when both are present).

The threshold value is well-justified. The PR description provides concrete benchmark data showing that the SIMD path was a regression (1.07-1.53x slower) for models with N in [32, 512]. The new threshold of 1024 eliminates these regressions while preserving SIMD benefits for large-N models (LLM/ASR/TTS).

No functional risk for large N. Models with N >= 1024 are unaffected — they continue to use the SIMD path exactly as before.

One observation (non-blocking): The SIMD path at lines 105-110 falls back to a scalar gamma/beta loop when gamma_null || beta_null, meaning the only SIMD benefit in that case comes from RowwiseMoments for the mean/variance computation. For the Laserpointer model (where all calls have weight=None, bias=None), this means the SIMD overhead was being paid for moments computation alone, with normalization always being scalar — further justifying the threshold increase.

No concerns with the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants