Arm backend: add VGF PT2E linear quantization modes for LLM export by xingguo01 · Pull Request #19029 · pytorch/executorch

xingguo01 · 2026-04-21T17:02:06Z

add vgf_16a8w/8a8w PT2E quantization modes
add backend.vgf.quantize_scope for full vs linear VGF quantization
wire the VGF config through the LLM export and quantizer selection path
add coverage in export_llama_lib tests for the new VGF PT2E modes

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani

- add vgf_16a8w PT2E quantization modes - add backend.vgf.quantize_scope for full vs linear VGF quantization - wire the VGF config through the LLM export and quantizer selection path - add coverage in export_llama_lib tests for the new VGF PT2E modes Signed-off-by: Xingguo Li <xingguo.li@arm.com> Change-Id: Ie8fe849b4856321308d6d526248a7a4760ddc573

pytorch-bot · 2026-04-21T17:02:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19029

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Run pull request jobs on OSDC runners in shadow mode

❌ 3 New Failures, 2 Cancelled Jobs, 10 Unrelated Failures

As of commit 7d65ac8 with merge base d6f1625 ():

NEW FAILURES - The following jobs have failed:

pull / test-sqnr-static-llm-qnn-linux (smollm2_135m) / linux-job (gh)
test_static_llm_model
pull / unittest-editable / linux / linux-job (gh)
examples/models/test/test_export.py::ExportTest::test_mv3_export_to_executorch
Test CoreML Backend / test-coreml / test-backend-macos (coreml, models) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

CANCELLED JOBS - The following jobs were cancelled. Please retry:

trunk / test-huggingface-transformers-xnnpack (qwen3-1.7b|xnnpack|--quantize) / linux-job (gh)
##[error]The operation was canceled.
trunk / test-qnn-model (fp32, mb) / linux-job (gh)
##[error]The operation was canceled.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / unittest / macos / macos-job (gh) (similar failure)
##[error]The operation was canceled.

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-models-linux (mobilebert, portable, linux.2xlarge) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-models-linux (mobilebert, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-voxtral-realtime-xnnpack-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-models-linux-aarch64 (mobilebert, portable, linux.arm64.2xlarge) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-models-linux-aarch64 (mobilebert, xnnpack-quantization-delegation, linux.arm64.2xlarge) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-models-windows (mobilebert, portable) / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / test-models-windows (mobilebert, xnnpack-q8) / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / unittest-release / macos / macos-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copilot

Pull request overview

Adds Arm VGF backend PT2E quantization support for LLM export, including a new 16a8w mode gated on INT16 TOSA extension support and a configurable quantization scope (full-model vs Linear-only), plus test coverage for the new behavior.

Changes:

Add vgf_16a8w PT2E quantization mode and enforce INT16 compile spec extension when selected.
Introduce backend.vgf.quantize_scope (full vs linear) and apply it when constructing the VGF quantizer.
Wire new VGF settings through llama export CLI/config and add unit tests for scope + INT16 gating.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
`extension/llm/export/quantizer_lib.py`	Extends VGF quantizer selection for `vgf_16a8w` and adds scope-based application (global vs Linear-only).
`extension/llm/export/config/llm_config.py`	Adds `vgf_16a8w` enum value and introduces `VgfQuantizeScope` + config wiring from CLI args.
`examples/models/llama/export_llama_lib.py`	Exposes VGF PT2E modes and VGF scope/compile-spec CLI flags; passes scope into VGF quantizer creation.
`examples/models/llama/tests/test_export_llama_lib.py`	Adds coverage for VGF linear-only scope and INT16 compile spec enforcement for `vgf_16a8w`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-21T17:07:26Z

+            "vgf_8a8w",
+            "vgf_16a8w",
        ],
        help="Use PT2E quantization. Comma separated options. e.g. xnnpack_dynamic (for per channel 8 bit weight), xnnpack_dynamic_qc4 (for per channel 4 bit weight), embedding.",


The --pt2e_quantize argparse option is defined with a fixed set of choices, so it only accepts a single value, but the help text says it supports "Comma separated options" (and even mentions embedding, which is not a valid choice). This is user-facing and likely to confuse; either update the help text to reflect single-choice behavior, or switch the argument parsing to accept a comma-separated list (and adjust LlmConfig/Pt2eQuantize parsing accordingly).

Suggested change

help="Use PT2E quantization. Comma separated options. e.g. xnnpack_dynamic (for per channel 8 bit weight), xnnpack_dynamic_qc4 (for per channel 4 bit weight), embedding.",

help="Use a single PT2E quantization mode, e.g. xnnpack_dynamic (per-channel 8-bit weight) or xnnpack_dynamic_qc4 (per-channel 4-bit weight).",

xingguo01 · 2026-05-06T16:33:21Z

@digantdesai @SS-JIA highly appreciated if you can have a look

zingo

Mostly touch Arm backend stuff, and no need for Buck2 fixes

digantdesai · 2026-05-13T03:11:25Z

    )
    parser.add_argument("-V", "--vulkan", action="store_true")
    parser.add_argument("--vulkan-force-fp16", action="store_true")
+    parser.add_argument("--vgf", action="store_true")


nit ```suggestion
parser.add_argument("--arm-vgf", action="store_true")

rationale is not everyone knows what vgf is

digantdesai · 2026-05-13T03:13:40Z

@claude review this PR

claude · 2026-05-13T03:14:37Z

Claude finished @digantdesai's task in 1m 59s —— View job

Review of PR #19029: Arm backend VGF PT2E linear quantization modes for LLM export

Read PR diff and changed files
Analyze code for correctness, style, and potential issues
Post review feedback

Overall: Clean, well-structured PR. The new quantization modes are properly gated, the scope abstraction is sensible, and the tests cover the key paths (full vs linear scope, INT16 enforcement). A few observations below:

1. Reviewer feedback: `--vgf` → `--arm-vgf` (digantdesai's comment)

I agree with @digantdesai's suggestion at export_llama_lib.py:461. "VGF" is Arm-specific terminology and not widely recognizable. Prefixing with --arm-vgf (and correspondingly --arm-vgf-compile-spec, --arm-vgf-quantize-scope) would be more discoverable for users scanning --help. The other Arm backends (TOSA, Ethos-U) don't have this prefix either, but those are more established names — vgf is new enough that the disambiguation helps.

2. `from_args` silently accepts `vgf_compile_spec` even when `--vgf` is not set

In llm_config.py:828-829:

if hasattr(args, "vgf_compile_spec"):
    llm_config.backend.vgf.compile_spec = args.vgf_compile_spec

Because the argparse default is "TOSA-1.0+INT", this unconditionally overwrites the config's compile spec whether or not --vgf is passed. It's harmless today (both defaults match), but fragile if the defaults diverge. Consider gating on args.vgf or at least args.vgf_compile_spec != parser.get_default(...). Low priority since it's consistent with how other backends handle this pattern in this file.

3. Copilot's note on `--pt2e_quantize` help text is valid but pre-existing

The help text says "Comma separated options" but the argument uses choices= which only accepts a single value. This is a pre-existing issue unrelated to this PR — mentioning for awareness but not blocking.

4. Test coverage is good

The three new tests cover:

vgf_8a8w + linear scope → verifies module_type_config is set
vgf_16a8w without INT16 → verifies ValueError raised
vgf_16a8w with INT16 → verifies success

The existing test for vgf_8a8w + full scope is also correctly extended with assertions on global_config and empty module_type_config. This gives solid boundary coverage.

5. Minor: error message could be more actionable

At quantizer_lib.py:388-389:

"vgf_16a8w requires a VGF compile spec with INT16 support, "
"for example TOSA-1.0+INT+int16."

This is clear and actionable — good.

6. Code quality

The VgfQuantizeScope enum is a clean str, Enum which serializes nicely via .value
The scope dispatch in get_vgf_quantizer handles invalid values with a clear error
Import of get_symmetric_a16w8_quantization_config is properly deferred inside the function (matching existing pattern)

Summary: This is a solid, focused change. The only actionable feedback is the --arm-vgf naming suggestion from digantdesai. The code is correct and well-tested. LGTM once the naming nit is addressed.

Copilot AI review requested due to automatic review settings April 21, 2026 17:02

xingguo01 requested review from larryliu0820, lucylq and mergennachin as code owners April 21, 2026 17:02

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 21, 2026

Copilot started reviewing on behalf of xingguo01 April 21, 2026 17:02 View session

xingguo01 added partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm ciflow/trunk release notes: arm Changes to the ARM backend delegate labels Apr 21, 2026

Merge branch 'main' into arm-backend-llm-export

93c91b6

Copilot AI reviewed Apr 21, 2026

View reviewed changes

xingguo01 mentioned this pull request Apr 28, 2026

LLM support: improve VGF export and calibration pipeline #19157

Open

Merge branch 'main' into arm-backend-llm-export

7d65ac8

zingo approved these changes May 12, 2026

View reviewed changes

zingo added the help wanted Extra attention is needed label May 12, 2026

digantdesai reviewed May 13, 2026

View reviewed changes

digantdesai approved these changes May 13, 2026

View reviewed changes

zingo removed the help wanted Extra attention is needed label May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm backend: add VGF PT2E linear quantization modes for LLM export#19029

Arm backend: add VGF PT2E linear quantization modes for LLM export#19029
xingguo01 wants to merge 3 commits into
pytorch:mainfrom
xingguo01:arm-backend-llm-export

xingguo01 commented Apr 21, 2026 •

edited by pytorch-bot Bot

Loading

Uh oh!

pytorch-bot Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

xingguo01 commented May 6, 2026

Uh oh!

zingo left a comment

Uh oh!

digantdesai May 13, 2026

Uh oh!

digantdesai commented May 13, 2026

Uh oh!

claude Bot commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	help="Use PT2E quantization. Comma separated options. e.g. xnnpack_dynamic (for per channel 8 bit weight), xnnpack_dynamic_qc4 (for per channel 4 bit weight), embedding.",
	help="Use a single PT2E quantization mode, e.g. xnnpack_dynamic (per-channel 8-bit weight) or xnnpack_dynamic_qc4 (per-channel 4-bit weight).",

Conversation

xingguo01 commented Apr 21, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19029

❗ 1 Active SEVs

❌ 3 New Failures, 2 Cancelled Jobs, 10 Unrelated Failures

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

xingguo01 commented May 6, 2026

Uh oh!

zingo left a comment

Choose a reason for hiding this comment

Uh oh!

digantdesai May 13, 2026

Choose a reason for hiding this comment

Uh oh!

digantdesai commented May 13, 2026

Uh oh!

claude Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #19029: Arm backend VGF PT2E linear quantization modes for LLM export

1. Reviewer feedback: --vgf → --arm-vgf (digantdesai's comment)

2. from_args silently accepts vgf_compile_spec even when --vgf is not set

3. Copilot's note on --pt2e_quantize help text is valid but pre-existing

4. Test coverage is good

5. Minor: error message could be more actionable

6. Code quality

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xingguo01 commented Apr 21, 2026 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Apr 21, 2026 •

edited

Loading

claude Bot commented May 13, 2026 •

edited

Loading

1. Reviewer feedback: `--vgf` → `--arm-vgf` (digantdesai's comment)

2. `from_args` silently accepts `vgf_compile_spec` even when `--vgf` is not set

3. Copilot's note on `--pt2e_quantize` help text is valid but pre-existing