fix lite module for transformers>=5.0 by 43758726 · Pull Request #4488 · InternLM/lmdeploy

43758726 · 2026-04-02T13:02:12Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

The [lmdeploy.lite] fails to quant/calibrate when running with [transformers >= 5.0] in some models.

Modification

lmdeploy/lite/quantization/calibration.py: Added fallback logic in _guess_num_heads() to unwrap nested config objects by checking for text_config and llm_config attributes before accessing head count parameters.
lmdeploy/lite/quantization/awq.py: Cast scales.max() and scales.min() to float32 before multiplication to prevent float16/bfloat16 overflow that produces inf.
lmdeploy/lite/apis/auto_awq.py: Changed the import of LAYER_TYPE_MAP and calibrate from a relative import to an absolute import to avoid potential circular import issues.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

Copilot

Pull request overview

Fixes lmdeploy.lite quantization/calibration regressions when used with transformers>=5.0, focusing on newer nested config wrappers and numerical stability in AWQ smoothing.

Changes:

Add fallback logic in calibration to unwrap nested HF config objects before reading head-count fields.
Prevent potential overflow in AWQ scale normalization by computing extrema in float32.
Switch auto_awq to absolute imports for calibrate/LAYER_TYPE_MAP.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
lmdeploy/lite/quantization/calibration.py	Unwrap nested config objects in `_guess_num_heads`; also includes new commented debug prints in the wrapped forward.
lmdeploy/lite/quantization/awq.py	Adjusts AWQ `smooth_fc_fcs` normalization to avoid float16/bfloat16 overflow.
lmdeploy/lite/apis/auto_awq.py	Changes relative import of `calibrate`/`LAYER_TYPE_MAP` to an absolute import.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-02T13:05:48Z

lmdeploy/lite/quantization/calibration.py

+        if hasattr(model.config, 'text_config'):
+            model.config = model.config.text_config
+        if hasattr(model.config, 'llm_config'):
+            model.config = model.config.llm_config


_guess_num_heads() mutates model.config by reassigning it to text_config / llm_config. This has side effects for the rest of calibration (e.g., later code uses model.config.hidden_size, use_cache, and config updates/saving) and can break models whose wrapper config contains fields not present on the nested config. Use a local variable (e.g., cfg = model.config and unwrap cfg), and leave model.config unchanged.

lmdeploy/lite/quantization/calibration.py

lmdeploy/lite/quantization/awq.py

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T14:48:33Z

lmdeploy/pytorch/models/q_modules.py

        """
        hidden_size = mod.weight.shape[0]
-        eps = mod.variance_epsilon
+        eps = getattr(mod, 'variance_epsilon', None) or getattr(mod, 'eps', 1e-6)


or treats 0/0.0 as falsy, so an explicit variance_epsilon=0 would be replaced by mod.eps/default. Prefer an explicit is None check (or nested getattr) so only missing attributes fall back, not valid zero values.

Suggested change

eps = getattr(mod, 'variance_epsilon', None) or getattr(mod, 'eps', 1e-6)

eps = getattr(mod, 'variance_epsilon', None)

if eps is None:

eps = getattr(mod, 'eps', 1e-6)

Copilot · 2026-04-07T14:48:33Z

lmdeploy/lite/utils/load.py

-        print('Warning: we cast model to float16 to prevent OOM. '
-              'You may enforce it bfloat16 by `--dtype bfloat16`')
-        torch_dtype = torch.float16



With dtype='auto', torch_dtype can still resolve to torch.bfloat16 (from config) even when torch.cuda.is_bf16_supported() is false; the current guard only checks dtype == 'bfloat16'. This can lead to failures later when moving the model to CUDA. Consider adding an auto branch to fall back to float16 (or raise) when the resolved dtype is bf16 but the device can’t run bf16.

Suggested change

if torch_dtype == torch.bfloat16 and not torch.cuda.is_bf16_supported():

if dtype == 'auto':

torch_dtype = torch.float16

if hasattr(hf_config, 'bf16'):

hf_config.bf16 = False

if hasattr(hf_config, 'fp16'):

hf_config.fp16 = True

else:

raise RuntimeError('Your device does not supports bf16(bfloat16), '

'please change to fp16(float16)')

Copilot · 2026-04-07T14:48:34Z

lmdeploy/lite/quantization/awq.py

+    denom = denom.to(dtype=dtype)
+    scales = scales / denom


Casting denom back to dtype can overflow again (e.g., float16) and reintroduce inf/0 scaling, undermining the float32 max/min fix. Keep denom in float32 (and optionally compute scales = scales.float() / denom), then cast the final scales back to the target dtype.

Suggested change

denom = denom.to(dtype=dtype)

scales = scales / denom

scales = (scales.float() / denom).to(device=device, dtype=dtype)

Copilot · 2026-04-07T14:48:34Z

lmdeploy/lite/apis/calibrate.py

+        original_torch_dtype = AutoConfig.from_pretrained(model, trust_remote_code=True).torch_dtype
        vl_model = load_vl_model(model, backend=None, with_llm=True).vl_model
        model = vl_model
        if hasattr(vl_model, 'language_model'):  # deepseek-vl, ...
            model = vl_model.language_model
        if hasattr(vl_model, 'llm'):  # MiniCPMV, ...
            model = vl_model.llm
        model.config.use_cache = False
-        if dtype == 'float16':
+        if hasattr(model.config, 'text_config'):
+            model.config.text_config.use_cache = False
+        elif hasattr(model.config, 'llm_config'):
+            model.config.llm_config.use_cache = False
+        if dtype == 'float16' or (dtype == 'auto' and original_torch_dtype == torch.float16):
            model.half()
-        elif dtype == 'bfloat16':
+        elif dtype == 'bfloat16' or (dtype == 'auto' and original_torch_dtype == torch.bfloat16):
            assert torch.cuda.is_bf16_supported(


original_torch_dtype from AutoConfig is not guaranteed to be a torch.dtype (it can be None or a string like 'float16'/'bfloat16'), so the equality checks against torch.float16/torch.bfloat16 can silently fail and skip the intended casting. Consider normalizing original_torch_dtype (e.g., mapping strings to torch.dtype) before these comparisons.

Copilot · 2026-04-07T14:48:34Z

lmdeploy/lite/apis/calibrate.py

+        from transformers import AutoConfig
+        original_torch_dtype = AutoConfig.from_pretrained(model, trust_remote_code=True).torch_dtype
        vl_model = load_vl_model(model, backend=None, with_llm=True).vl_model


This adds an extra AutoConfig.from_pretrained() call for VLM calibration, but load_vl_model() already loads the HF config via get_model_arch() (which calls AutoConfig.from_pretrained). Consider reusing that existing config (or reading torch_dtype from the loaded model/config) to avoid duplicate network/cache IO.

fix lite module for transformers>=5.0

2dbcb9f

Copilot AI review requested due to automatic review settings April 2, 2026 13:02

Copilot started reviewing on behalf of 43758726 April 2, 2026 13:02 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

43758726 requested a review from lvhan028 April 2, 2026 13:07

delete debug code

a5c6941

lvhan028 added the improvement label Apr 3, 2026

43758726 added 4 commits April 3, 2026 10:04

fix scales compute in awq.py

67bc719

fix assert nan bug in qwen-72b and InternVL3-2B-Instruct model

0dd1268

fix smooth_quant bug in glm-4-9b model

618d630

fix OOM bug in Qwen2.5-VL-7B-Instruct model calibration

18d1c6e

43758726 requested a review from Copilot April 7, 2026 14:42

Copilot started reviewing on behalf of 43758726 April 7, 2026 14:43 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

fix bug according to copilot suggestion

9922e63

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix lite module for transformers>=5.0#4488

fix lite module for transformers>=5.0#4488
43758726 wants to merge 7 commits intoInternLM:mainfrom
43758726:transformers_compatible

43758726 commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

+    if torch_dtype == torch.bfloat16 and not torch.cuda.is_bf16_supported():
+        if dtype == 'auto':
+            torch_dtype = torch.float16
+            if hasattr(hf_config, 'bf16'):
+                hf_config.bf16 = False
+            if hasattr(hf_config, 'fp16'):
+                hf_config.fp16 = True
+        else:
+            raise RuntimeError('Your device does not supports bf16(bfloat16), '
+                               'please change to fp16(float16)')

	denom = denom.to(dtype=dtype)
	scales = scales / denom
	scales = (scales.float() / denom).to(device=device, dtype=dtype)

Conversation

43758726 commented Apr 2, 2026

Motivation

Modification

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants