make fp8 model quantized by llm-compressor can be inferenced in turbomind by 43758726 · Pull Request #4509 · InternLM/lmdeploy

43758726 · 2026-04-08T09:29:13Z

…mind

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Make fp8 model quanztied by llm-compressor can be inferenced in turbomind engine successfully.

Modification

lmdeploy/lmdeploy/turbomind/deploy/converter.py: Add config about llm-compressor fp8 model.
lmdeploy/lmdeploy/turbomind/deploy/policy.py: Add judge about llm-compressor fp8 model in get_input_policy function.
lmdeploy/lmdeploy/turbomind/deploy/parameter.py: Add WeightScale class for llm-compressor fp8 model.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

…mind

Copilot

Pull request overview

This PR extends Turbomind’s deploy/conversion pipeline to support FP8 models produced by llm-compressor under the compressed-tensors quantization config, enabling successful inference in the Turbomind engine.

Changes:

Add compressed-tensors format branching to map pack-quantized → AWQ(int4) and float-quantized → FP8 pathways.
Extend input tensor processing policy selection to handle compressed-tensors sub-formats.
Add a new WeightScale parameter handler intended for llm-compressor FP8 scale tensors.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
lmdeploy/turbomind/deploy/policy.py	Selects different input tensor processing functions depending on `compressed-tensors` quantized sub-format.
lmdeploy/turbomind/deploy/parameter.py	Adds `WeightScale` parameter export logic and wires it into `get_params()`.
lmdeploy/turbomind/deploy/converter.py	Adds `compressed-tensors` config handling, validates formats, and maps to existing AWQ/FP8 output config paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T09:34:43Z

lmdeploy/turbomind/deploy/converter.py

@@ -42,6 +43,8 @@ def get_output_model_registered_name_and_config(model_path: str, model_format: s
            ['hf', 'awq', 'gptq']
        dtype (str): the data type of the model's weights and activations
        group_size (int): the size of group used by awq model
+        quantized_format (str): the quantized format of compressed-tensors model,
+            which can be one of ['pack-quantized', 'float-quantized']
    """


get_output_model_registered_name_and_config now requires quantized_format, but the repo’s tests and existing callers invoke it without that argument (e.g. tests/test_lmdeploy/test_turbomind/test_converter.py::test_torch_dtype_fallback). This will raise a TypeError at runtime. Consider making quantized_format optional with a default (e.g. None) and updating the docstring/type hint accordingly, only validating it when model_format == 'compressed-tensors'.

Copilot · 2026-04-08T09:34:43Z

lmdeploy/turbomind/deploy/parameter.py

+class WeightScale(Parameter):
+    KEYS = '.weight_scale', '.weight'
+
+    # TODO: flag any operations crossing the quant blocks as illegal
+    def __call__(self, f, g, i):
+        f(i, g('weight_scale'), 'scales', to_float, apply_gs=['w1', 'w3', 'w2'])
+        f(i, g('weight'), 'weight', identity)


WeightScale.take() triggers on any key ending with .weight_scale, which also matches the existing CompressedWeight path. For compressed-tensors pack-quantized models this can cause WeightScale.__call__ to request g('weight') (a .weight tensor) that doesn’t exist, leading to a KeyError during export. Please tighten the selection logic so WeightScale only applies when both .weight_scale and .weight are present (and/or when .weight_packed is absent), or override take() to enforce that invariant.

Copilot · 2026-04-08T09:34:44Z

lmdeploy/turbomind/deploy/policy.py

+        if quantized_format == 'pack-quantized':
+            return process_compressed_packed_tensor
+        elif quantized_format == 'float-quantized':
+            return process_fp8


get_input_policy() has a model_format == 'compressed-tensors' branch that doesn’t return anything if quantized_format is missing/unknown, so the function returns None and later code will fail when calling the policy. It would be safer to either return a default policy or raise a clear ValueError for unsupported quantized_format values.

Suggested change

return process_fp8

return process_fp8

raise ValueError(f'Unsupported quantized_format for compressed-tensors: {quantized_format!r}')

Copilot · 2026-04-08T09:34:44Z

lmdeploy/turbomind/deploy/converter.py

-            assert _format == 'pack-quantized', ('compressed-tennsors only supports pack-quantized format, '
-                                                 f'but got {_format}')
+            assert _format in ['pack-quantized', 'float-quantized'
+                               ], ('compressed-tennsors only supports pack-quantized/float-quantized format, '


Typo in assertion message: 'compressed-tennsors' → 'compressed-tensors'. Keeping error messages correct helps users debug config issues quickly.

Suggested change

], ('compressed-tennsors only supports pack-quantized/float-quantized format, '

], ('compressed-tensors only supports pack-quantized/float-quantized format, '

Copilot · 2026-04-08T09:34:44Z

lmdeploy/turbomind/deploy/converter.py

    if model_format in ['awq', 'gptq', 'compressed-tensors']:
-        weight_type = 'int4'
-        dtype = 'float16'  # force float16 for int4 quantized weights
+        if model_format in ['awq', 'gptq']:
+            weight_type = 'int4'
+            dtype = 'float16'  # force float16 for int4 quantized weights
+        elif model_format == 'compressed-tensors':
+            if quantized_format == 'pack-quantized':
+                weight_type = 'int4'
+                model_format = 'awq'
+                dtype = 'float16'  # force float16 for int4 quantized weights
+            elif quantized_format == 'float-quantized':
+                weight_type = 'fp8'
+                model_format = 'fp8'
        group_size = 128 if group_size == 0 else group_size


New logic for compressed-tensors (pack-quantized vs float-quantized) changes both config mapping (compressed-tensors → awq/fp8) and validation of _num_bits/_type, but there are no unit tests covering these new branches. Since tests/test_lmdeploy/test_turbomind/test_converter.py already exercises this module, please add targeted tests that build a minimal quantization_config dict for each format and assert the selected weight_type/model_format and policy behavior.

Copilot · 2026-04-08T09:34:44Z

lmdeploy/turbomind/deploy/converter.py

            ['hf', 'awq', 'gptq']
        dtype (str): the data type of the model's weights and activations
        group_size (int): the size of group used by awq model
+        quantized_format (str): the quantized format of compressed-tensors model,
+            which can be one of ['pack-quantized', 'float-quantized']


Docstring for get_output_model_registered_name_and_config() still says model_format is one of ['hf', 'awq', 'gptq'], but the function now handles fp8, mxfp4, and compressed-tensors as well. Please update the docstring to reflect the supported values (and clarify how compressed-tensors maps to awq/fp8).

Suggested change

['hf', 'awq', 'gptq']

dtype (str): the data type of the model's weights and activations

group_size (int): the size of group used by awq model

quantized_format (str): the quantized format of compressed-tensors model,

which can be one of ['pack-quantized', 'float-quantized']

['hf', 'awq', 'gptq', 'fp8', 'mxfp4', 'compressed-tensors'].

For 'compressed-tensors', `quantized_format` determines the target

registration path: 'pack-quantized' maps to the AWQ path and

'float-quantized' maps to the FP8 path.

dtype (str): the data type of the model's weights and activations

group_size (int): the size of group used by awq model

quantized_format (str): the quantized format of a

compressed-tensors model, which can be one of

['pack-quantized', 'float-quantized']

lvhan028 · 2026-04-08T10:15:06Z

lmdeploy/turbomind/deploy/converter.py

+            assert (_num_bits == 4 and _type == 'int') or (_num_bits == 8 and _type == 'float'), (
+                'pack-quantized requires 4-bit int, '
+                f'but got {_num_bits}-bit {_type}. '
+                'or float-quantized requires 8-bit float, '
+                f'but got {_num_bits}-bit {_type}')


"pack-quantized requires int4 or fp8, but got type {_type} and {_num_bits} bits"

make fp8 model quantized by llm-compressor can be inferenced in turbo…

9326ac2

…mind

43758726 requested review from Copilot, lvhan028 and lzhangzz April 8, 2026 09:29

Copilot started reviewing on behalf of 43758726 April 8, 2026 09:29 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Merge branch 'main' into add/llm-compressor-fp8-inference

fae346b

lvhan028 added the enhancement New feature or request label Apr 8, 2026

lvhan028 reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make fp8 model quantized by llm-compressor can be inferenced in turbomind#4509

make fp8 model quantized by llm-compressor can be inferenced in turbomind#4509
43758726 wants to merge 2 commits intoInternLM:mainfrom
43758726:add/llm-compressor-fp8-inference

43758726 commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

lvhan028 Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	return process_fp8
	return process_fp8
	raise ValueError(f'Unsupported quantized_format for compressed-tensors: {quantized_format!r}')

	], ('compressed-tennsors only supports pack-quantized/float-quantized format, '
	], ('compressed-tensors only supports pack-quantized/float-quantized format, '

Conversation

43758726 commented Apr 8, 2026

Motivation

Modification

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants