[Fix]: Handle None scales in generate_zero_point for mixed-format layers#4505
[Fix]: Handle None scales in generate_zero_point for mixed-format layers#4505lingyezhixing wants to merge 1 commit intoInternLM:mainfrom
Conversation
Qwen3.5-AWQ has mixed-format attention layers (fp16 QKV + AWQ O projection). The reader returns (None, None, None, None) for quant params to signal skip, but QuantWeightOnly.__call__ passed these Nones directly to generate_zero_point() which crashed on None.shape. Guard the call so Nones propagate to _export's existing all-None skip logic instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Fixes a TurboMind export-time crash when converting/loading compressed-tensors symmetric-int4 weights for Qwen3.5 models that mix linear_attention and full_attention, where some layers may not have self_attn weights and thus produce None scale entries.
Changes:
- Add a
None-aware guard aroundgenerate_zero_point(scales)for compressed-tensors symmetric quantization. - Fall back to passing through
scalesaszeroswhen scales are missing (None) to avoid crashing.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if scales is not None and all(s is not None for s in scales): | ||
| zeros = generate_zero_point(scales) | ||
| else: | ||
| zeros = scales |
There was a problem hiding this comment.
The new branch that skips generate_zero_point when scales (or any element within it) is None isn’t covered by the existing compressed-tensors tests. Please add a unit test that exercises QuantWeightOnly with compressed-tensors keys where weight_scale is a tuple containing None entries (e.g., all None for a missing self_attn layer) and asserts the call does not crash and that zeros is passed through consistently.
Motivation
Fix crash when loading compressed-tensors quantized Qwen3.5 models (e.g., from llm-compressor) in TurboMind backend.
Qwen3.5 mixes
linear_attention(24 layers) andfull_attention(8 layers). Forlinear_attentionlayers that lackself_attnweights, the reader returnsNonefor scales. Whencompressed_tensors=Trueandhas_zero_point=False(symmetric quantization),generate_zero_point(scales)is called unconditionally, crashing onNone.Models with standard AWQ format (
quant_method="awq") are unaffected because they take a different code path that never callsgenerate_zero_point.Modification
Guard
generate_zero_point(scales)with a None check inlmdeploy/turbomind/deploy/parameter.py:BC-breaking (Optional)
No.
Use cases (Optional)
Crash reproduction (before fix):
Works correctly (standard AWQ, unaffected):