[Frontend] Preserve structured output special tokens in offline LLM.chat#39352
[Frontend] Preserve structured output special tokens in offline LLM.chat#39352lucianommartins wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds a mechanism to automatically disable skip_special_tokens when chat features like thinking or tool calling are used with models that represent these features via special tokens (e.g., Gemma 4). This ensures the output text contains the necessary delimiters for parsing. The review feedback recommends refactoring the hardcoded list of structured tokens into a module-level constant or a dynamic configuration to enhance maintainability.
vllm/entrypoints/llm.py
Outdated
| structured_tokens = ( | ||
| "<|channel>", | ||
| "<channel|>", # thinking delimiters | ||
| "<|tool_call>", | ||
| "<tool_call|>", # tool call delimiters | ||
| '<|"|>', # string quoting in tool args | ||
| ) |
There was a problem hiding this comment.
The list of structured_tokens is hardcoded within the _adjust_params_for_parsing method. This makes it difficult to maintain and extend support for new models that use different special tokens for structured output.
To improve maintainability, consider moving this tuple to a module-level constant. An even better long-term solution would be to derive this list from the model's configuration if possible.
chaunceyjiang
left a comment
There was a problem hiding this comment.
I’m not entirely sure whether this is necessary.
Users should be able to set skip_special_tokens themselves in SamplingParams.
|
it is not necessary, but it is a relevant UX improvement, @chaunceyjiang. Many users are facing bad times with It was already fixed for the streaming api via #39027, I'm just extending/covering the gap for the fyi @bbrowning @sfeng33 who were involved on the #39027 too. |
I think this is reasonable, because it’s indeed not appropriate to require users to set However, in offline usage, Would love to hear other opinions as well. /cc @bbrowning @sfeng33 @DarkLight1337 |
|
I am fine with this as long as it's consistent with online API. At least personally, I would expect online and offline APIs to work in the same way. Most users probably don't even know that they should override |
sfeng33
left a comment
There was a problem hiding this comment.
I think this change might be unnecessary, sorry if my comment from the previous brought confusion.
Generally in vllm doing tool parsing in offline usage is not supported, I think perhaps we can drop the Gemma 4 related code in the tool parser util and recipe to make this more clear.
We are actually planning to enable this after Renderer refactor is complete which unifies the code paths between offline and online APIs. |
I see, thanks, this is good to know. |
|
https://github.com/vllm-project/vllm/blob/main/examples/offline_inference Currently, there are also some models that require |
- addresses data loss in offline path where default 'skip_special_tokens=True' strips reasoning and tool-calling delimiters - implements '_adjust_params_for_parsing' to inspect tokenizer vocabulary and detect active special tokens (e.g., <|channel|>, <|tool_call|>, <|"|>) - dynamically enforces 'skip_special_tokens=False' in SamplingParams when 'enable_thinking=True' or 'tools' are present. - restricts override to tokenizers that actually register these strings as special tokens, maintaining no-op transparency for other models Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
6fc94e2 to
2ca7af8
Compare
|
I added one gemma4 guardrail as you suggested @sfeng33: |
I agree with this. But that doesn't remove the need for both online and offline APIs to be consistent with each other. Since we only have a very minimal online API example, I suppose that users would usually look at the offline example to determine the request parameters. |
Right, if we add the offline example to the vllm recipe, it helps the users that came from there. But to my knowledge, some users are following recipes from product stack/llm-d, etc, that’s kind of out of our control. If we merge this PR, it can resolve the usage confusions, but the tradeoff is it’s a bit tech debt that we are swallowing. |
Purpose
When using the offline API (
LLM.chat) with models that encode structured output syntax as special tokens (such as Gemma 4's thinking delimiters<|channel>and tool call tokens<|tool_call>), the current defaultskip_special_tokens=TrueinSamplingParamsstrips these tokens fromoutput.text. This breaks downstream parsing of both reasoning blocks and tool calls in offline scenarios.This PR introduces
_adjust_params_for_parsingin theLLMclass to:skip_special_tokens=Falseonly whenenable_thinking=Trueortoolsare provided, and the model actually needs them.Test Plan
To verify this change, run offline inference using
LLM.chatwith a model that uses special tokens for reasoning/tools (like Gemma 4):llm.chatwithchat_template_kwargs={"enable_thinking": True}.output.textcontains the raw special tokens (e.g.,<|channel|>) instead of having them stripped.Test Result
Verified that the reasoning parser could correctly extract thoughts and tool calls from the raw output text, confirming that the delimiters were preserved and processed as intended.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.