[Frontend] Preserve structured output special tokens in offline LLM.chat by lucianommartins · Pull Request #39352 · vllm-project/vllm

lucianommartins · 2026-04-08T21:31:56Z

Purpose

When using the offline API (LLM.chat) with models that encode structured output syntax as special tokens (such as Gemma 4's thinking delimiters <|channel> and tool call tokens <|tool_call>), the current default skip_special_tokens=True in SamplingParams strips these tokens from output.text. This breaks downstream parsing of both reasoning blocks and tool calls in offline scenarios.
This PR introduces _adjust_params_for_parsing in the LLM class to:

Scan the tokenizer vocabulary for instances of these specific structured tokens.
Automatically enforce skip_special_tokens=False only when enable_thinking=True or tools are provided, and the model actually needs them.
This brings functional parity to the offline API for models requiring raw token preservation, similar to what was merged for the OpenAI server rendering path.

Test Plan

To verify this change, run offline inference using LLM.chat with a model that uses special tokens for reasoning/tools (like Gemma 4):

Call llm.chat with chat_template_kwargs={"enable_thinking": True}.
Verify that the returned output.text contains the raw special tokens (e.g., <|channel|>) instead of having them stripped.
Verify that parsing of thinking blocks and tool calls works correctly on the resulting text.

Test Result

Verified that the reasoning parser could correctly extract thoughts and tool calls from the raw output text, confirming that the delimiters were preserved and processed as intended.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

lucianommartins · 2026-04-08T21:33:42Z

It is a complement to #39081 and #39027 - adding the seemless enable_thinking=True usage for llm.chat() interactions too.

@sfeng33 @bbrowning

gemini-code-assist

Code Review

This pull request adds a mechanism to automatically disable skip_special_tokens when chat features like thinking or tool calling are used with models that represent these features via special tokens (e.g., Gemma 4). This ensures the output text contains the necessary delimiters for parsing. The review feedback recommends refactoring the hardcoded list of structured tokens into a module-level constant or a dynamic configuration to enhance maintainability.

gemini-code-assist · 2026-04-08T21:36:25Z

vllm/entrypoints/llm.py

+        structured_tokens = (
+            "<|channel>",
+            "<channel|>",  # thinking delimiters
+            "<|tool_call>",
+            "<tool_call|>",  # tool call delimiters
+            '<|"|>',  # string quoting in tool args
+        )


The list of structured_tokens is hardcoded within the _adjust_params_for_parsing method. This makes it difficult to maintain and extend support for new models that use different special tokens for structured output.

To improve maintainability, consider moving this tuple to a module-level constant. An even better long-term solution would be to derive this list from the model's configuration if possible.

chaunceyjiang

I’m not entirely sure whether this is necessary.
Users should be able to set skip_special_tokens themselves in SamplingParams.

lucianommartins · 2026-04-09T03:36:32Z

it is not necessary, but it is a relevant UX improvement, @chaunceyjiang.

Many users are facing bad times with enable_thinking=True because they don't realize it is required to change sampling params too.

It was already fixed for the streaming api via #39027, I'm just extending/covering the gap for the llm.chat() with the same kind of approach (ie. if enable_thinking=True, skip_special_tokens is automatically tweaked to False without user intervention needs).

fyi @bbrowning @sfeng33 who were involved on the #39027 too.

chaunceyjiang · 2026-04-09T03:45:52Z

It was already fixed for the streaming api via #39027,

I think this is reasonable, because it’s indeed not appropriate to require users to set skip_special_tokens in the HTTP request.

However, in offline usage, SamplingParams is used directly, so it’s quite natural for users to set skip_special_tokens there.

Would love to hear other opinions as well. /cc @bbrowning @sfeng33 @DarkLight1337

DarkLight1337 · 2026-04-09T03:50:36Z

I am fine with this as long as it's consistent with online API. At least personally, I would expect online and offline APIs to work in the same way. Most users probably don't even know that they should override skip_special_tokens for certain models.

sfeng33

I think this change might be unnecessary, sorry if my comment from the previous brought confusion.

Generally in vllm doing tool parsing in offline usage is not supported, I think perhaps we can drop the Gemma 4 related code in the tool parser util and recipe to make this more clear.

DarkLight1337 · 2026-04-09T03:53:32Z

Generally in vllm doing tool parsing in offline usage is not supported, I think perhaps we can drop the Gemma 4 related code in the tool parser util and recipe to make this more clear.

We are actually planning to enable this after Renderer refactor is complete which unifies the code paths between offline and online APIs.

sfeng33 · 2026-04-09T04:00:53Z

We are actually planning to enable this after Renderer refactor is complete which unifies the code paths between offline and online APIs.

I see, thanks, this is good to know.
Then would it be better to gate the code with a Gemma 4 model check, so that when doing the refactoring it’s more clear it’s model specific logic.

chaunceyjiang · 2026-04-09T04:09:26Z

https://github.com/vllm-project/vllm/blob/main/examples/offline_inference

Currently, there are also some models that require skip_special_tokens=False for offline inference, so to be honest, I’m more inclined to add a Gemma 4 example in the offline examples.

- addresses data loss in offline path where default 'skip_special_tokens=True' strips reasoning and tool-calling delimiters - implements '_adjust_params_for_parsing' to inspect tokenizer vocabulary and detect active special tokens (e.g., <|channel|>, <|tool_call|>, <|"|>) - dynamically enforces 'skip_special_tokens=False' in SamplingParams when 'enable_thinking=True' or 'tools' are present. - restricts override to tokenizers that actually register these strings as special tokens, maintaining no-op transparency for other models Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>

lucianommartins · 2026-04-09T04:15:41Z

I added one gemma4 guardrail as you suggested @sfeng33:

       hf_config = getattr(self.model_config, "hf_config", None)
        architectures = getattr(hf_config, "architectures", [])

        if any("Gemma4" in arch for arch in architectures):

DarkLight1337 · 2026-04-09T04:16:41Z

Currently, there are also some models that require skip_special_tokens=False for offline inference, so to be honest, I’m more inclined to add a Gemma 4 example in the offline examples.

I agree with this. But that doesn't remove the need for both online and offline APIs to be consistent with each other. Since we only have a very minimal online API example, I suppose that users would usually look at the offline example to determine the request parameters.

sfeng33 · 2026-04-09T04:21:23Z

https://github.com/vllm-project/vllm/blob/main/examples/offline_inference

Currently, there are also some models that require skip_special_tokens=False for offline inference, so to be honest, I’m more inclined to add a Gemma 4 example in the offline examples.

Right, if we add the offline example to the vllm recipe, it helps the users that came from there. But to my knowledge, some users are following recipes from product stack/llm-d, etc, that’s kind of out of our control. If we merge this PR, it can resolve the usage confusions, but the tradeoff is it’s a bit tech debt that we are swallowing.

lucianommartins requested a review from DarkLight1337 as a code owner April 8, 2026 21:31

mergify bot added the frontend label Apr 8, 2026

gemini-code-assist bot reviewed Apr 8, 2026

View reviewed changes

chaunceyjiang reviewed Apr 9, 2026

View reviewed changes

sfeng33 reviewed Apr 9, 2026

View reviewed changes

lucianommartins force-pushed the lucianommartins/gemma4 branch from 6fc94e2 to 2ca7af8 Compare April 9, 2026 04:14

Uh oh!

Conversation

lucianommartins commented Apr 8, 2026

Purpose

Test Plan

Test Result

Uh oh!

lucianommartins commented Apr 8, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

chaunceyjiang left a comment

Choose a reason for hiding this comment

Uh oh!

lucianommartins commented Apr 9, 2026

Uh oh!

chaunceyjiang commented Apr 9, 2026

Uh oh!

DarkLight1337 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfeng33 left a comment

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfeng33 commented Apr 9, 2026

Uh oh!

chaunceyjiang commented Apr 9, 2026

Uh oh!

lucianommartins commented Apr 9, 2026

Uh oh!

DarkLight1337 commented Apr 9, 2026

Uh oh!

sfeng33 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DarkLight1337 commented Apr 9, 2026 •

edited

Loading

DarkLight1337 commented Apr 9, 2026 •

edited

Loading