Qualcomm AI Engine Direct - PTQ Mix precision guidance for LLMs#18969
Qualcomm AI Engine Direct - PTQ Mix precision guidance for LLMs#18969abhinaykukkadapu merged 1 commit intopytorch:mainfrom
Conversation
Summary: - Add a quantization guidance tutorial README for LLMs - Uses Qwen3-1.7B as an example - Add PerLayerSqnrAnalyzer for quantization sensitivity analysis - Add unit tests for PerLayerSqnrAnalyzer
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18969
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit 23b53a1 with merge base 9d72936 ( NEW FAILURE - The following job has failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "release notes: qualcomm" |
|
This PR introduces mixed-precision guidance and quantization sensitivity analysis tool for LLMs in PTQ. |
abhinaykukkadapu
left a comment
There was a problem hiding this comment.
@DannyYuyang-quic this is awesome. This would really help users iterate faster to arrive to the target accuracy. Do you think we can increase the search space combinations to bmm nodes and annotate_kv_8bit. Also do we need to distinguish lm heads and output layers as they are the most sensitive layers.
Hi @abhinaykukkadapu. Yes, we can definitely expand the search space to include kv cache precision.
I agree with that. I’ve observed that even both the LM head and the transformer blocks contribute to same quantization error, but they impact model quality in different ways. We could potentially weight the SQNR errors differently for these components to arrive at a better overall configuration. |
Summary:
Test
Motivation
When applying post-training quantization (PTQ) to LLMs under aggressive precision constraints, such as 4-bit weight (LPBQ) quantization, users often rely on mixed-precision strategies to recover accuracy. In practice, this means selectively keeping certain layers at higher precision (e.g. 8-bit weights) while quantizing the rest of the model more aggressively.
However, there is currently a lack of quantization recipe guidance to indicate which layers are most critical for preserving accuracy under these low-precision regimes. As a result, users are often forced to perform extensive grid searches over mixed-precision combinations, which is both time-consuming and difficult to reason about.
To address this gap, this PR introduces analysis tools and documentation that help identify quantization-sensitive layers and provide a directional starting point for constructing mixed-precision recipes. This allows users to move away from blind trial-and-error and instead iterate from an informed baseline when tuning mixed precision under extreme quantization settings.
It is highly recommended to combine this analyzer with existing PTQ algorithms (e.g. SeqMSE, SpinQuant R3) to further enhance accuracy and help guide mixed-precision selection.
Example usage: Qwen3-1.7B Mixed-Precision
The following demonstrates the mixed-precision workflow applied to
Qwen3-1.7Bto illustrate expected outcomes. The evaluation usesWikitext(word perplexity, where lower is better).64across all layers.After this initial run, the analyzer flags the quantization-sensitive layers and automatically generates suggested mixed-precision quantization recipes. We picked one of these recipes and compared it against the current mainline recipe, which was previously obtained through a tedious, manual finetuning process.
16a4w_blockblk64 base16a4w_blockblk16 base +16a8wper-channel on LM head16a4w_blockblk64 base +16a8wper-channel on sensitive layersAs shown in the results above, the auto-generated mixed-precision recipe achieves performance that is very close to the manually fine-tuned mainline recipe
Insight findings (SQNR Threshold <= 10 dB):
During the initial experiment, the SQNR analyzer successfully flagged the following layers as highly sensitive and generated the corresponding recipe classes:
feed_forward.w2_convfeed_forward.w3_convattention.wv_convNote: The suggested mixed-precision recipe may not be the absolute optimal combination for every model, but it serves as a highly effective starting point to guide further tuning.