[Frontend] Gemma4: preserve structured tokens for reasoning and tool calling#39081
[Frontend] Gemma4: preserve structured tokens for reasoning and tool calling#39081lucianommartins wants to merge 8 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request ensures that special tokens used for structured outputs, such as Gemma4's thinking and tool call delimiters, are preserved by automatically setting skip_special_tokens=False. It introduces an adjust_request interface for reasoning parsers and updates the Gemma4 implementation and utility functions to handle these delimiters and truncated outputs. Reviewers suggested caching vocabulary lookups to improve performance and extending request adjustments to include ResponsesRequest for better consistency across API endpoints.
|
@sfeng33 Gemma 4 needs a way to toggle |
sfeng33
left a comment
There was a problem hiding this comment.
Thanks for working on this! I know it's still WIP, so just sharing some early thoughts that might be helpful as you iterate.
-
adjust_request for reasoning parser — consider wiring it through the shared render path
Currently the new reasoning_parser.adjust_request() is added in chat_completion/serving.py, but the Responses API and batch serving paths also instantiate a reasoning parser without calling it. Rather than adding the call to each endpoint individually, it might be worth wiring it into preprocess_chat in vllm/entrypoints/serve/render/serving.py, that way all API paths get it for free. -
Special token leakage with skip_special_tokens=False
One thing to be mindful of — setting skip_special_tokens=False globally means all special tokens survive detokenization, not just the structural ones needed for parsing (<|channel>, <|tool_call>, etc.). Gemma4's tokenizer has other special tokens (e.g. , <turn|>) that could potentially leak into the user-visible content. It might be worth considering either post-processing the parsed content to strip unwanted tokens, or a more targeted detokenization approach. Just something to keep in mind — curious if you've seen this come up in testing. -
Offline LLM.chat path
For the offline path changes in llm.py — since the offline LLM path doesn't currently have a parser pipeline (reasoning/tool parsing), setting skip_special_tokens=False would surface raw special tokens in output.text without any parsing.
I'd lean toward leaving the offline path out of this PR, as it's not sth we currently support for other models as well. If we do want to include it, I think tying the detection to a model name check would be more robust than probing the tokenizer vocab, which could false-positive on other models that happen to register similar token strings.
Happy to discuss any of these — nice work getting the core mechanism in place!
…e via adjust_request hook Replace the token-ID fallback approach with the cleaner fix from vllm-project#39081: add an adjust_request() hook to ReasoningParser that lets Gemma4ReasoningParser force skip_special_tokens=False before inference. This preserves <|channel> and <channel|> delimiters in the detokenized text so the existing string-matching parser works for both streaming and non-streaming. Changes: - Add adjust_request() no-op hook to ReasoningParser base class - Override in Gemma4ReasoningParser to set skip_special_tokens=False - Call adjust_request() in serving.py before rendering the request - Revert token-ID fallback in extract_reasoning() (no longer needed) - Add truncated thinking handler in gemma4_utils.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fa4dafb to
33abfff
Compare
Add --chat-template-kwargs enable_thinking=false to suppress <|channel>thought tokens leaking into content. Tracked by: vllm-project/vllm#38855, vllm-project/vllm#39081, aaif-goose/goose#6192 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
This pull request has merge conflicts that must be resolved before it can be |
…Chat API and offline paths - automatic skip_special_tokens enforcement in offline path: added `_adjust_params_for_parsing` in `LLM` to scan the vocabulary for registered special tokens used in structured syntax (`<|channel>`, `<|tool_call>`, etc.) and force `skip_special_tokens=False` in `LLM.chat()` when reasoning or tools are active - reasoning parser request adjustment hook: added `adjust_request` to the base `ReasoningParser` class and hooked it into `OpenAIServingChat` to allow specialized parsers to mutate `ChatCompletionRequest` parameters before execution - gemma4 reasoning parser implementation: overrode `adjust_request` in `Gemma4ReasoningParser` to force `skip_special_tokens=False`, ensuring delimiters survive detokenization for proper non-streaming parsing - robustness in streaming reasoning extraction: updated documentation to clarify that while `skip_special_tokens` is now forced to `False` (making `<|channel>` visible), the instance-state approach in `extract_reasoning_streaming` is retained to prevent pre-reasoning content interference - truncated thinking fallback: updated `parse_thinking_output` in `gemma4_utils.py` to gracefully handle incomplete thinking segments when the model hits `max_tokens` before emitting the closing `<channel|>` delimiter. Important: - offline generate path limitation: note that offline/batch requests using `LLM.generate()` still require manual setting of `skip_special_tokens=False` in `SamplingParams` if using `enable_thinking=True` Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
…ng sessions * constrained `Gemma4ReasoningParser.adjust_request` to only set `skip_special_tokens=False` when `enable_thinking` is explicitly true in `chat_template_kwargs`. * prevented the previous unconditional override from bypassing the tool-specific guards defined in `Gemma4ToolParser`. * ensures special tokens are only preserved when reasoning is active or tools are engaged, preventing token leakage in standard chat turns. * confirmed correctness by running `test_gemma4_reasoning_parser.py` Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
…scope * reverted the addition of `_adjust_params_for_parsing` and its usage in `vllm/entrypoints/llm.py` to restrict PR focus to the Chat API path * confirmed zero regression in online reasoning parser logic by executing `test_gemma4_reasoning_parser.py` Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
…ared render path * moved `reasoning_parser.adjust_request` from `OpenAIServingChat` to `OpenAIServingRender.preprocess_chat` to ensure all chat-based endpoints benefit from it * updated `OpenAIServingRender.__init__` to resolve `reasoning_parser` class * updated `api_server.py` to pass `reasoning_parser` from CLI args to `OpenAIServingRender` * confirmed all 17 tests in `test_gemma4_reasoning_parser.py` pass Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
…t_request * updated `adjust_request` to handle `ResponsesRequest` for consistency across OpenAI-compatible endpoints as suggested by reviewer bot * confirmed all 17 tests in `test_gemma4_reasoning_parser.py` pass Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
…er path Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
367a5ea to
ae5a871
Compare
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
b19d861 to
c70d7f9
Compare
|
thanks @bbrowning @sfeng33 @chaunceyjiang for the feedbacks. everything addressed. @sfeng33 - I'm creating a new PR right away to make this fix for the offline chat api separately as you suggested. |
|
Thank you for looking into this, if I'm not mistaken, I think the changes in this PR are covered by #39027 as well. |
|
thanks @sfeng33 - I didn't follow @bbrowning changes and looks like he already addressed all my things here. I will drop this PR. |
Motivation and Context
Gemma4 utilizes special tokens for structural delimiters in thinking mode (
<|channel>,<channel|>) and tool calling (<|tool_call>,<tool_call|>). By default, vLLM detokenization setsskip_special_tokens=True, which strips these tokens from the output text. This breaks both streaming and non-streaming parsers that rely on these delimiters to extract reasoning content and tool calls.This PR ensures these tokens are preserved when needed in both the OpenAI Chat API and the offline
LLM.chatpaths.Proposed Changes
adjust_requesthook toReasoningParserto allow model-specific parsers to mutateChatCompletionRequestbefore inference.adjust_requestinGemma4ReasoningParserto forceskip_special_tokens=False.parse_thinking_outputfor truncated thinking blocks when the model hitsmax_tokensbefore emitting the closer.Testing
Limitations
LLM.generate()still require manual setting ofskip_special_tokens=FalseinSamplingParamsifenable_thinking=Trueis in use.