Skip to content

Qualcomm AI Engine Direct - PTQ Mix precision guidance for LLMs#18969

Merged
abhinaykukkadapu merged 1 commit intopytorch:mainfrom
CodeLinaro:dev1/danny/mix_precision
Apr 20, 2026
Merged

Qualcomm AI Engine Direct - PTQ Mix precision guidance for LLMs#18969
abhinaykukkadapu merged 1 commit intopytorch:mainfrom
CodeLinaro:dev1/danny/mix_precision

Conversation

@DannyYuyang-quic
Copy link
Copy Markdown
Contributor

@DannyYuyang-quic DannyYuyang-quic commented Apr 17, 2026

Summary:

  • Add a quantization guidance tutorial README for LLMs
    • Uses Qwen3-1.7B as an example
  • Add PerLayerSqnrAnalyzer for quantization sensitivity analysis
  • Add unit tests for PerLayerSqnrAnalyzer

Test

python backends/qualcomm/tests/test_qnn_delegate.py TestUtilsScript.test_analyzer_to_file_generation -s ${device_id} -H ${host_id} -m ${soc} -b build-android

Motivation

When applying post-training quantization (PTQ) to LLMs under aggressive precision constraints, such as 4-bit weight (LPBQ) quantization, users often rely on mixed-precision strategies to recover accuracy. In practice, this means selectively keeping certain layers at higher precision (e.g. 8-bit weights) while quantizing the rest of the model more aggressively.

However, there is currently a lack of quantization recipe guidance to indicate which layers are most critical for preserving accuracy under these low-precision regimes. As a result, users are often forced to perform extensive grid searches over mixed-precision combinations, which is both time-consuming and difficult to reason about.

To address this gap, this PR introduces analysis tools and documentation that help identify quantization-sensitive layers and provide a directional starting point for constructing mixed-precision recipes. This allows users to move away from blind trial-and-error and instead iterate from an informed baseline when tuning mixed precision under extreme quantization settings.

It is highly recommended to combine this analyzer with existing PTQ algorithms (e.g. SeqMSE, SpinQuant R3) to further enhance accuracy and help guide mixed-precision selection.

Example usage: Qwen3-1.7B Mixed-Precision

The following demonstrates the mixed-precision workflow applied to Qwen3-1.7B to illustrate expected outcomes. The evaluation uses Wikitext (word perplexity, where lower is better).

  1. For users unsure of which precision to start with, we recommend running an initial experiment using an aggressive baseline (e.g. LPBQ). In this example, we start with LPBQ block size 64 across all layers.

After this initial run, the analyzer flags the quantization-sensitive layers and automatically generates suggested mixed-precision quantization recipes. We picked one of these recipes and compared it against the current mainline recipe, which was previously obtained through a tedious, manual finetuning process.

Stage Recipe Word PPL Tokens/second (SM8750)
FP32 (CPU) ~14.04 -
Initial Experiment (Default Baseline) 16a4w_block blk64 base 17.33 ~54.0
Current Mainline (Manual finetuning) 16a4w_block blk16 base + 16a8w per-channel on LM head ~14.75 ~47.1
Suggested Mixed-Precision (Auto-generated) 16a4w_block blk64 base + 16a8w per-channel on sensitive layers ~15.05 ~48.8

As shown in the results above, the auto-generated mixed-precision recipe achieves performance that is very close to the manually fine-tuned mainline recipe

Insight findings (SQNR Threshold <= 10 dB):

During the initial experiment, the SQNR analyzer successfully flagged the following layers as highly sensitive and generated the corresponding recipe classes:

  1. feed_forward.w2_conv
  2. feed_forward.w3_conv
  3. attention.wv_conv

Note: The suggested mixed-precision recipe may not be the absolute optimal combination for every model, but it serves as a highly effective starting point to guide further tuning.

Summary:
- Add a quantization guidance tutorial README for LLMs
  - Uses Qwen3-1.7B as an example
- Add PerLayerSqnrAnalyzer for quantization sensitivity analysis
- Add unit tests for PerLayerSqnrAnalyzer
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 17, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18969

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 23b53a1 with merge base 9d72936 (image):

NEW FAILURE - The following job has failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 17, 2026
@DannyYuyang-quic
Copy link
Copy Markdown
Contributor Author

@pytorchbot label "release notes: qualcomm"

@pytorch-bot pytorch-bot Bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Apr 17, 2026
@DannyYuyang-quic
Copy link
Copy Markdown
Contributor Author

DannyYuyang-quic commented Apr 17, 2026

Hi @abhinaykukkadapu,

This PR introduces mixed-precision guidance and quantization sensitivity analysis tool for LLMs in PTQ.
Could you please help to take a look?
Thanks!

cc: @cccclai @haowhsu-quic @shewu-quic

Copy link
Copy Markdown
Contributor

@abhinaykukkadapu abhinaykukkadapu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DannyYuyang-quic this is awesome. This would really help users iterate faster to arrive to the target accuracy. Do you think we can increase the search space combinations to bmm nodes and annotate_kv_8bit. Also do we need to distinguish lm heads and output layers as they are the most sensitive layers.

@abhinaykukkadapu abhinaykukkadapu merged commit 1941d07 into pytorch:main Apr 20, 2026
160 of 166 checks passed
@DannyYuyang-quic
Copy link
Copy Markdown
Contributor Author

@DannyYuyang-quic this is awesome. This would really help users iterate faster to arrive to the target accuracy. Do you think we can increase the search space combinations to bmm nodes and annotate_kv_8bit.

Hi @abhinaykukkadapu. Yes, we can definitely expand the search space to include kv cache precision.

Also do we need to distinguish lm heads and output layers as they are the most sensitive layers.

I agree with that. I’ve observed that even both the LM head and the transformer blocks contribute to same quantization error, but they impact model quality in different ways. We could potentially weight the SQNR errors differently for these components to arrive at a better overall configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: qualcomm Changes to the Qualcomm backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants