Skip to content

[Frontend] Preserve structured output special tokens in offline LLM.chat#39352

Open
lucianommartins wants to merge 1 commit intovllm-project:mainfrom
lucianommartins:lucianommartins/gemma4
Open

[Frontend] Preserve structured output special tokens in offline LLM.chat#39352
lucianommartins wants to merge 1 commit intovllm-project:mainfrom
lucianommartins:lucianommartins/gemma4

Conversation

@lucianommartins
Copy link
Copy Markdown
Contributor

Purpose

When using the offline API (LLM.chat) with models that encode structured output syntax as special tokens (such as Gemma 4's thinking delimiters <|channel> and tool call tokens <|tool_call>), the current default skip_special_tokens=True in SamplingParams strips these tokens from output.text. This breaks downstream parsing of both reasoning blocks and tool calls in offline scenarios.
This PR introduces _adjust_params_for_parsing in the LLM class to:

  1. Scan the tokenizer vocabulary for instances of these specific structured tokens.
  2. Automatically enforce skip_special_tokens=False only when enable_thinking=True or tools are provided, and the model actually needs them.
  3. This brings functional parity to the offline API for models requiring raw token preservation, similar to what was merged for the OpenAI server rendering path.

Test Plan

To verify this change, run offline inference using LLM.chat with a model that uses special tokens for reasoning/tools (like Gemma 4):

  1. Call llm.chat with chat_template_kwargs={"enable_thinking": True}.
  2. Verify that the returned output.text contains the raw special tokens (e.g., <|channel|>) instead of having them stripped.
  3. Verify that parsing of thinking blocks and tool calls works correctly on the resulting text.

Test Result

Verified that the reasoning parser could correctly extract thoughts and tool calls from the raw output text, confirming that the delimiters were preserved and processed as intended.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@lucianommartins
Copy link
Copy Markdown
Contributor Author

It is a complement to #39081 and #39027 - adding the seemless enable_thinking=True usage for llm.chat() interactions too.

@sfeng33 @bbrowning

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a mechanism to automatically disable skip_special_tokens when chat features like thinking or tool calling are used with models that represent these features via special tokens (e.g., Gemma 4). This ensures the output text contains the necessary delimiters for parsing. The review feedback recommends refactoring the hardcoded list of structured tokens into a module-level constant or a dynamic configuration to enhance maintainability.

Comment on lines +1758 to +1764
structured_tokens = (
"<|channel>",
"<channel|>", # thinking delimiters
"<|tool_call>",
"<tool_call|>", # tool call delimiters
'<|"|>', # string quoting in tool args
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The list of structured_tokens is hardcoded within the _adjust_params_for_parsing method. This makes it difficult to maintain and extend support for new models that use different special tokens for structured output.

To improve maintainability, consider moving this tuple to a module-level constant. An even better long-term solution would be to derive this list from the model's configuration if possible.

Copy link
Copy Markdown
Collaborator

@chaunceyjiang chaunceyjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not entirely sure whether this is necessary.
Users should be able to set skip_special_tokens themselves in SamplingParams.

@lucianommartins
Copy link
Copy Markdown
Contributor Author

it is not necessary, but it is a relevant UX improvement, @chaunceyjiang.

Many users are facing bad times with enable_thinking=True because they don't realize it is required to change sampling params too.

It was already fixed for the streaming api via #39027, I'm just extending/covering the gap for the llm.chat() with the same kind of approach (ie. if enable_thinking=True, skip_special_tokens is automatically tweaked to False without user intervention needs).

fyi @bbrowning @sfeng33 who were involved on the #39027 too.

@chaunceyjiang
Copy link
Copy Markdown
Collaborator

It was already fixed for the streaming api via #39027,

I think this is reasonable, because it’s indeed not appropriate to require users to set skip_special_tokens in the HTTP request.

However, in offline usage, SamplingParams is used directly, so it’s quite natural for users to set skip_special_tokens there.

Would love to hear other opinions as well. /cc @bbrowning @sfeng33 @DarkLight1337

@DarkLight1337
Copy link
Copy Markdown
Member

DarkLight1337 commented Apr 9, 2026

I am fine with this as long as it's consistent with online API. At least personally, I would expect online and offline APIs to work in the same way. Most users probably don't even know that they should override skip_special_tokens for certain models.

Copy link
Copy Markdown
Contributor

@sfeng33 sfeng33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change might be unnecessary, sorry if my comment from the previous brought confusion.

Generally in vllm doing tool parsing in offline usage is not supported, I think perhaps we can drop the Gemma 4 related code in the tool parser util and recipe to make this more clear.

@DarkLight1337
Copy link
Copy Markdown
Member

DarkLight1337 commented Apr 9, 2026

Generally in vllm doing tool parsing in offline usage is not supported, I think perhaps we can drop the Gemma 4 related code in the tool parser util and recipe to make this more clear.

We are actually planning to enable this after Renderer refactor is complete which unifies the code paths between offline and online APIs.

@sfeng33
Copy link
Copy Markdown
Contributor

sfeng33 commented Apr 9, 2026

We are actually planning to enable this after Renderer refactor is complete which unifies the code paths between offline and online APIs.

I see, thanks, this is good to know.
Then would it be better to gate the code with a Gemma 4 model check, so that when doing the refactoring it’s more clear it’s model specific logic.

@chaunceyjiang
Copy link
Copy Markdown
Collaborator

https://github.com/vllm-project/vllm/blob/main/examples/offline_inference

Currently, there are also some models that require skip_special_tokens=False for offline inference, so to be honest, I’m more inclined to add a Gemma 4 example in the offline examples.

- addresses data loss in offline path where default 'skip_special_tokens=True' strips reasoning and tool-calling delimiters
- implements '_adjust_params_for_parsing' to inspect tokenizer vocabulary and detect active special tokens (e.g., <|channel|>, <|tool_call|>, <|"|>)
- dynamically enforces 'skip_special_tokens=False' in SamplingParams when 'enable_thinking=True' or 'tools' are present.
- restricts override to tokenizers that actually register these strings as special tokens, maintaining no-op transparency for other models

Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
@lucianommartins lucianommartins force-pushed the lucianommartins/gemma4 branch from 6fc94e2 to 2ca7af8 Compare April 9, 2026 04:14
@lucianommartins
Copy link
Copy Markdown
Contributor Author

I added one gemma4 guardrail as you suggested @sfeng33:

       hf_config = getattr(self.model_config, "hf_config", None)
        architectures = getattr(hf_config, "architectures", [])

        if any("Gemma4" in arch for arch in architectures):

@DarkLight1337
Copy link
Copy Markdown
Member

Currently, there are also some models that require skip_special_tokens=False for offline inference, so to be honest, I’m more inclined to add a Gemma 4 example in the offline examples.

I agree with this. But that doesn't remove the need for both online and offline APIs to be consistent with each other. Since we only have a very minimal online API example, I suppose that users would usually look at the offline example to determine the request parameters.

@sfeng33
Copy link
Copy Markdown
Contributor

sfeng33 commented Apr 9, 2026

https://github.com/vllm-project/vllm/blob/main/examples/offline_inference

Currently, there are also some models that require skip_special_tokens=False for offline inference, so to be honest, I’m more inclined to add a Gemma 4 example in the offline examples.

Right, if we add the offline example to the vllm recipe, it helps the users that came from there. But to my knowledge, some users are following recipes from product stack/llm-d, etc, that’s kind of out of our control. If we merge this PR, it can resolve the usage confusions, but the tradeoff is it’s a bit tech debt that we are swallowing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants