Skip to content

[Fix]: Handle None scales in generate_zero_point for mixed-format layers#4505

Open
lingyezhixing wants to merge 1 commit intoInternLM:mainfrom
lingyezhixing:lingyezhixing/fix-none-scales-zero-point
Open

[Fix]: Handle None scales in generate_zero_point for mixed-format layers#4505
lingyezhixing wants to merge 1 commit intoInternLM:mainfrom
lingyezhixing:lingyezhixing/fix-none-scales-zero-point

Conversation

@lingyezhixing
Copy link
Copy Markdown

@lingyezhixing lingyezhixing commented Apr 7, 2026

Motivation

Fix crash when loading compressed-tensors quantized Qwen3.5 models (e.g., from llm-compressor) in TurboMind backend.

Qwen3.5 mixes linear_attention (24 layers) and full_attention (8 layers). For linear_attention layers that lack self_attn weights, the reader returns None for scales. When compressed_tensors=True and has_zero_point=False (symmetric quantization), generate_zero_point(scales) is called unconditionally, crashing on None.

Models with standard AWQ format (quant_method="awq") are unaffected because they take a different code path that never calls generate_zero_point.

Modification

Guard generate_zero_point(scales) with a None check in lmdeploy/turbomind/deploy/parameter.py:

if self.compressed_tensors and not self.has_zero_point:
-    zeros = generate_zero_point(scales)
+    if scales is not None and all(s is not None for s in scales):
+        zeros = generate_zero_point(scales)
+    else:
+        zeros = scales

BC-breaking (Optional)

No.

Use cases (Optional)

Crash reproduction (before fix):

lmdeploy chat cyankiwi/Qwen3.5-4B-AWQ-4bit --backend turbomind

Works correctly (standard AWQ, unaffected):

lmdeploy chat QuantTrio/Qwen3.5-4B-AWQ --backend turbomind

Qwen3.5-AWQ has mixed-format attention layers (fp16 QKV + AWQ O projection).
The reader returns (None, None, None, None) for quant params to signal skip,
but QuantWeightOnly.__call__ passed these Nones directly to generate_zero_point()
which crashed on None.shape. Guard the call so Nones propagate to _export's
existing all-None skip logic instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a TurboMind export-time crash when converting/loading compressed-tensors symmetric-int4 weights for Qwen3.5 models that mix linear_attention and full_attention, where some layers may not have self_attn weights and thus produce None scale entries.

Changes:

  • Add a None-aware guard around generate_zero_point(scales) for compressed-tensors symmetric quantization.
  • Fall back to passing through scales as zeros when scales are missing (None) to avoid crashing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +100 to +103
if scales is not None and all(s is not None for s in scales):
zeros = generate_zero_point(scales)
else:
zeros = scales
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new branch that skips generate_zero_point when scales (or any element within it) is None isn’t covered by the existing compressed-tensors tests. Please add a unit test that exercises QuantWeightOnly with compressed-tensors keys where weight_scale is a tuple containing None entries (e.g., all None for a missing self_attn layer) and asserts the call does not crash and that zeros is passed through consistently.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants