Qualcomm AI Engine Direct - PTQ Mix precision guidance for LLMs by DannyYuyang-quic · Pull Request #18969 · pytorch/executorch

DannyYuyang-quic · 2026-04-17T08:16:47Z

Summary:

Add a quantization guidance tutorial README for LLMs
- Uses Qwen3-1.7B as an example
Add PerLayerSqnrAnalyzer for quantization sensitivity analysis
Add unit tests for PerLayerSqnrAnalyzer

Test

python backends/qualcomm/tests/test_qnn_delegate.py TestUtilsScript.test_analyzer_to_file_generation -s ${device_id} -H ${host_id} -m ${soc} -b build-android

Motivation

When applying post-training quantization (PTQ) to LLMs under aggressive precision constraints, such as 4-bit weight (LPBQ) quantization, users often rely on mixed-precision strategies to recover accuracy. In practice, this means selectively keeping certain layers at higher precision (e.g. 8-bit weights) while quantizing the rest of the model more aggressively.

However, there is currently a lack of quantization recipe guidance to indicate which layers are most critical for preserving accuracy under these low-precision regimes. As a result, users are often forced to perform extensive grid searches over mixed-precision combinations, which is both time-consuming and difficult to reason about.

To address this gap, this PR introduces analysis tools and documentation that help identify quantization-sensitive layers and provide a directional starting point for constructing mixed-precision recipes. This allows users to move away from blind trial-and-error and instead iterate from an informed baseline when tuning mixed precision under extreme quantization settings.

It is highly recommended to combine this analyzer with existing PTQ algorithms (e.g. SeqMSE, SpinQuant R3) to further enhance accuracy and help guide mixed-precision selection.

Example usage: Qwen3-1.7B Mixed-Precision

The following demonstrates the mixed-precision workflow applied to Qwen3-1.7B to illustrate expected outcomes. The evaluation uses Wikitext (word perplexity, where lower is better).

For users unsure of which precision to start with, we recommend running an initial experiment using an aggressive baseline (e.g. LPBQ). In this example, we start with LPBQ block size 64 across all layers.

After this initial run, the analyzer flags the quantization-sensitive layers and automatically generates suggested mixed-precision quantization recipes. We picked one of these recipes and compared it against the current mainline recipe, which was previously obtained through a tedious, manual finetuning process.

Stage	Recipe	Word PPL	Tokens/second (SM8750)
FP32 (CPU)	—	~14.04	-
Initial Experiment (Default Baseline)	`16a4w_block` blk64 base	17.33	~54.0
Current Mainline (Manual finetuning)	`16a4w_block` blk16 base + `16a8w` per-channel on LM head	~14.75	~47.1
Suggested Mixed-Precision (Auto-generated)	`16a4w_block` blk64 base + `16a8w` per-channel on sensitive layers	~15.05	~48.8

As shown in the results above, the auto-generated mixed-precision recipe achieves performance that is very close to the manually fine-tuned mainline recipe

Insight findings (SQNR Threshold <= 10 dB):

During the initial experiment, the SQNR analyzer successfully flagged the following layers as highly sensitive and generated the corresponding recipe classes:

feed_forward.w2_conv
feed_forward.w3_conv
attention.wv_conv

Note: The suggested mixed-precision recipe may not be the absolute optimal combination for every model, but it serves as a highly effective starting point to guide further tuning.

Summary: - Add a quantization guidance tutorial README for LLMs - Uses Qwen3-1.7B as an example - Add PerLayerSqnrAnalyzer for quantization sensitivity analysis - Add unit tests for PerLayerSqnrAnalyzer

pytorch-bot · 2026-04-17T08:16:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18969

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 23b53a1 with merge base 9d72936 ():

NEW FAILURE - The following job has failed:

Test ARM Backend / test-arm / test-backend-linux (arm_vgf_fp, models) / linux-job (gh)
RuntimeError: Command docker exec -t 1c2b307aa39c6b9b536fe4b93d3415751dc833a631a0c7b285a40f80f83c0f6a /exec failed with exit code 92

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

DannyYuyang-quic · 2026-04-17T08:17:14Z

@pytorchbot label "release notes: qualcomm"

DannyYuyang-quic · 2026-04-17T08:20:08Z

Hi @abhinaykukkadapu,

This PR introduces mixed-precision guidance and quantization sensitivity analysis tool for LLMs in PTQ.
Could you please help to take a look?
Thanks!

cc: @cccclai @haowhsu-quic @shewu-quic

abhinaykukkadapu

@DannyYuyang-quic this is awesome. This would really help users iterate faster to arrive to the target accuracy. Do you think we can increase the search space combinations to bmm nodes and annotate_kv_8bit. Also do we need to distinguish lm heads and output layers as they are the most sensitive layers.

DannyYuyang-quic · 2026-04-21T09:17:45Z

@DannyYuyang-quic this is awesome. This would really help users iterate faster to arrive to the target accuracy. Do you think we can increase the search space combinations to bmm nodes and annotate_kv_8bit.

Hi @abhinaykukkadapu. Yes, we can definitely expand the search space to include kv cache precision.

Also do we need to distinguish lm heads and output layers as they are the most sensitive layers.

I agree with that. I’ve observed that even both the LM head and the transformer blocks contribute to same quantization error, but they impact model quality in different ways. We could potentially weight the SQNR errors differently for these components to arrive at a better overall configuration.

Qualcomm AI Engine Direct - Mix precision guidance for LLMs

23b53a1

Summary: - Add a quantization guidance tutorial README for LLMs - Uses Qwen3-1.7B as an example - Add PerLayerSqnrAnalyzer for quantization sensitivity analysis - Add unit tests for PerLayerSqnrAnalyzer

DannyYuyang-quic requested a review from abhinaykukkadapu as a code owner April 17, 2026 08:16

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 17, 2026

pytorch-bot Bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Apr 17, 2026

abhinaykukkadapu mentioned this pull request Apr 20, 2026

[Qualcomm] Support native_layer_norm and affine-free LayerNorm in QNN backend #18990

Open

abhinaykukkadapu approved these changes Apr 20, 2026

View reviewed changes

abhinaykukkadapu merged commit 1941d07 into pytorch:main Apr 20, 2026
160 of 166 checks passed

abhinaykukkadapu mentioned this pull request Apr 27, 2026

[QNN] GPU support for executorch QNN delegate #19059

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualcomm AI Engine Direct - PTQ Mix precision guidance for LLMs#18969

Qualcomm AI Engine Direct - PTQ Mix precision guidance for LLMs#18969
abhinaykukkadapu merged 1 commit intopytorch:mainfrom
CodeLinaro:dev1/danny/mix_precision

DannyYuyang-quic commented Apr 17, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

DannyYuyang-quic commented Apr 17, 2026

Uh oh!

DannyYuyang-quic commented Apr 17, 2026 •

edited

Loading

Uh oh!

abhinaykukkadapu left a comment

Uh oh!

Uh oh!

DannyYuyang-quic commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DannyYuyang-quic commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary:

Test

Motivation

Example usage: Qwen3-1.7B Mixed-Precision

Insight findings (SQNR Threshold <= 10 dB):

Uh oh!

pytorch-bot Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18969

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

DannyYuyang-quic commented Apr 17, 2026

Uh oh!

DannyYuyang-quic commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhinaykukkadapu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DannyYuyang-quic commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DannyYuyang-quic commented Apr 17, 2026 •

edited

Loading

pytorch-bot Bot commented Apr 17, 2026 •

edited

Loading

DannyYuyang-quic commented Apr 17, 2026 •

edited

Loading