[Frontend] Gemma4: preserve structured tokens for reasoning and tool calling by lucianommartins · Pull Request #39081 · vllm-project/vllm

lucianommartins · 2026-04-06T14:34:10Z

Motivation and Context

Gemma4 utilizes special tokens for structural delimiters in thinking mode (<|channel>, <channel|>) and tool calling (<|tool_call>, <tool_call|>). By default, vLLM detokenization sets skip_special_tokens=True, which strips these tokens from the output text. This breaks both streaming and non-streaming parsers that rely on these delimiters to extract reasoning content and tool calls.

This PR ensures these tokens are preserved when needed in both the OpenAI Chat API and the offline LLM.chat paths.

Proposed Changes

Reasoning Parser Request Hook: Added an adjust_request hook to ReasoningParser to allow model-specific parsers to mutate ChatCompletionRequest before inference.
Gemma4 Implementation: Overrode adjust_request in Gemma4ReasoningParser to force skip_special_tokens=False.
Robustness: Added fallback in parse_thinking_output for truncated thinking blocks when the model hits max_tokens before emitting the closer.

Testing

Verified via local unittests that tool calling and thinking tests pass and tokens are correctly preserved in the response.

Limitations

Offline/batch requests using LLM.generate() still require manual setting of skip_special_tokens=False in SamplingParams if enable_thinking=True is in use.

gemini-code-assist

Code Review

This pull request ensures that special tokens used for structured outputs, such as Gemma4's thinking and tool call delimiters, are preserved by automatically setting skip_special_tokens=False. It introduces an adjust_request interface for reasoning parsers and updates the Gemma4 implementation and utility functions to handle these delimiters and truncated outputs. Reviewers suggested caching vocabulary lookups to improve performance and extending request adjustments to include ResponsesRequest for better consistency across API endpoints.

vllm/entrypoints/llm.py

vllm/reasoning/gemma4_reasoning_parser.py

bbrowning · 2026-04-06T15:42:36Z

@sfeng33 Gemma 4 needs a way to toggle skip_special_tokens to False in its reasoning parser. I see there are some mypy issues to resolve in this PR to keep type hinting happy, but otherwise mind taking a look at this approach and see if it makes sense to you since you've been in some of these areas recently with refactoring of parsers?

sfeng33

Thanks for working on this! I know it's still WIP, so just sharing some early thoughts that might be helpful as you iterate.

adjust_request for reasoning parser — consider wiring it through the shared render path
Currently the new reasoning_parser.adjust_request() is added in chat_completion/serving.py, but the Responses API and batch serving paths also instantiate a reasoning parser without calling it. Rather than adding the call to each endpoint individually, it might be worth wiring it into preprocess_chat in vllm/entrypoints/serve/render/serving.py, that way all API paths get it for free.
Special token leakage with skip_special_tokens=False
One thing to be mindful of — setting skip_special_tokens=False globally means all special tokens survive detokenization, not just the structural ones needed for parsing (<|channel>, <|tool_call>, etc.). Gemma4's tokenizer has other special tokens (e.g. , <turn|>) that could potentially leak into the user-visible content. It might be worth considering either post-processing the parsed content to strip unwanted tokens, or a more targeted detokenization approach. Just something to keep in mind — curious if you've seen this come up in testing.
Offline LLM.chat path
For the offline path changes in llm.py — since the offline LLM path doesn't currently have a parser pipeline (reasoning/tool parsing), setting skip_special_tokens=False would surface raw special tokens in output.text without any parsing.
I'd lean toward leaving the offline path out of this PR, as it's not sth we currently support for other models as well. If we do want to include it, I think tying the detection to a model name check would be more robust than probing the tokenizer vocab, which could false-positive on other models that happen to register similar token strings.

Happy to discuss any of these — nice work getting the core mechanism in place!

vllm/reasoning/gemma4_reasoning_parser.py

vllm/entrypoints/llm.py

vllm/reasoning/gemma4_reasoning_parser.py

…e via adjust_request hook Replace the token-ID fallback approach with the cleaner fix from vllm-project#39081: add an adjust_request() hook to ReasoningParser that lets Gemma4ReasoningParser force skip_special_tokens=False before inference. This preserves <|channel> and <channel|> delimiters in the detokenized text so the existing string-matching parser works for both streaming and non-streaming. Changes: - Add adjust_request() no-op hook to ReasoningParser base class - Override in Gemma4ReasoningParser to set skip_special_tokens=False - Call adjust_request() in serving.py before rendering the request - Revert token-ID fallback in extract_reasoning() (no longer needed) - Add truncated thinking handler in gemma4_utils.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add --chat-template-kwargs enable_thinking=false to suppress <|channel>thought tokens leaking into content. Tracked by: vllm-project/vllm#38855, vllm-project/vllm#39081, aaif-goose/goose#6192 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mergify · 2026-04-08T19:08:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lucianommartins.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…Chat API and offline paths - automatic skip_special_tokens enforcement in offline path: added `_adjust_params_for_parsing` in `LLM` to scan the vocabulary for registered special tokens used in structured syntax (`<|channel>`, `<|tool_call>`, etc.) and force `skip_special_tokens=False` in `LLM.chat()` when reasoning or tools are active - reasoning parser request adjustment hook: added `adjust_request` to the base `ReasoningParser` class and hooked it into `OpenAIServingChat` to allow specialized parsers to mutate `ChatCompletionRequest` parameters before execution - gemma4 reasoning parser implementation: overrode `adjust_request` in `Gemma4ReasoningParser` to force `skip_special_tokens=False`, ensuring delimiters survive detokenization for proper non-streaming parsing - robustness in streaming reasoning extraction: updated documentation to clarify that while `skip_special_tokens` is now forced to `False` (making `<|channel>` visible), the instance-state approach in `extract_reasoning_streaming` is retained to prevent pre-reasoning content interference - truncated thinking fallback: updated `parse_thinking_output` in `gemma4_utils.py` to gracefully handle incomplete thinking segments when the model hits `max_tokens` before emitting the closing `<channel|>` delimiter. Important: - offline generate path limitation: note that offline/batch requests using `LLM.generate()` still require manual setting of `skip_special_tokens=False` in `SamplingParams` if using `enable_thinking=True` Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>

…ng sessions * constrained `Gemma4ReasoningParser.adjust_request` to only set `skip_special_tokens=False` when `enable_thinking` is explicitly true in `chat_template_kwargs`. * prevented the previous unconditional override from bypassing the tool-specific guards defined in `Gemma4ToolParser`. * ensures special tokens are only preserved when reasoning is active or tools are engaged, preventing token leakage in standard chat turns. * confirmed correctness by running `test_gemma4_reasoning_parser.py` Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>

…scope * reverted the addition of `_adjust_params_for_parsing` and its usage in `vllm/entrypoints/llm.py` to restrict PR focus to the Chat API path * confirmed zero regression in online reasoning parser logic by executing `test_gemma4_reasoning_parser.py` Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>

…ared render path * moved `reasoning_parser.adjust_request` from `OpenAIServingChat` to `OpenAIServingRender.preprocess_chat` to ensure all chat-based endpoints benefit from it * updated `OpenAIServingRender.__init__` to resolve `reasoning_parser` class * updated `api_server.py` to pass `reasoning_parser` from CLI args to `OpenAIServingRender` * confirmed all 17 tests in `test_gemma4_reasoning_parser.py` pass Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>

…t_request * updated `adjust_request` to handle `ResponsesRequest` for consistency across OpenAI-compatible endpoints as suggested by reviewer bot * confirmed all 17 tests in `test_gemma4_reasoning_parser.py` pass Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>

…er path Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>

Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>

lucianommartins · 2026-04-08T19:55:18Z

thanks @bbrowning @sfeng33 @chaunceyjiang for the feedbacks.

everything addressed.

@sfeng33 - I'm creating a new PR right away to make this fix for the offline chat api separately as you suggested.

sfeng33 · 2026-04-08T20:47:19Z

Thank you for looking into this, if I'm not mistaken, I think the changes in this PR are covered by #39027 as well.

lucianommartins · 2026-04-08T21:08:50Z

thanks @sfeng33 - I didn't follow @bbrowning changes and looks like he already addressed all my things here. I will drop this PR.

mergify bot added the frontend label Apr 6, 2026

gemini-code-assist bot reviewed Apr 6, 2026

View reviewed changes

vllm/entrypoints/llm.py Outdated Show resolved Hide resolved

vllm/reasoning/gemma4_reasoning_parser.py Show resolved Hide resolved

lucianommartins mentioned this pull request Apr 6, 2026

[Bugfix] Fix Gemma4 non-streaming reasoning parsing #38858

Closed

bbrowning mentioned this pull request Apr 6, 2026

[Tool] adjust_request to reasoning parser, and Gemma4 fixes #39027

Merged

bbrowning mentioned this pull request Apr 6, 2026

[Bug]: Gemma4 reasoning parser fails to separate reasoning_content — <|channel> tokens stripped before parsing #38855

Open

sfeng33 reviewed Apr 6, 2026

View reviewed changes

vllm/reasoning/gemma4_reasoning_parser.py Show resolved Hide resolved

bbrowning mentioned this pull request Apr 6, 2026

[Bug]: Gemma4 on vLLM + PI coding agent: Validation failed for tool "edit": - path: must have required property 'path' #39072

Open

chaunceyjiang reviewed Apr 7, 2026

View reviewed changes

vllm/entrypoints/llm.py Outdated Show resolved Hide resolved

chaunceyjiang reviewed Apr 7, 2026

View reviewed changes

vllm/reasoning/gemma4_reasoning_parser.py Outdated Show resolved Hide resolved

tysonmcnulty mentioned this pull request Apr 7, 2026

[Bugfix] Fix Gemma4 streaming tool parser stale state between requests #39214

Closed

lucianommartins force-pushed the lucianommartins/gemma4 branch from fa4dafb to 33abfff Compare April 8, 2026 18:38

mergify bot added the needs-rebase label Apr 8, 2026

lucianommartins added 7 commits April 8, 2026 19:12

[Frontend] Gemma4: Fix ruff formatting and lint issues in shared rend…

a752b71

…er path Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>

[Frontend] Gemma4: Fix mypy union-attr error in reasoning parser

ae5a871

Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>

lucianommartins force-pushed the lucianommartins/gemma4 branch from 367a5ea to ae5a871 Compare April 8, 2026 19:19

mergify bot removed the needs-rebase label Apr 8, 2026

[Frontend] Gemma4: Fix pre-commit and mypy errors

c70d7f9

Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>

lucianommartins force-pushed the lucianommartins/gemma4 branch from b19d861 to c70d7f9 Compare April 8, 2026 19:48

lucianommartins marked this pull request as ready for review April 8, 2026 19:55

lucianommartins requested review from aarnphm and njhill as code owners April 8, 2026 19:55

lucianommartins requested review from chaunceyjiang and sfeng33 April 8, 2026 20:06

lucianommartins closed this Apr 8, 2026

lucianommartins mentioned this pull request Apr 8, 2026

[Frontend] Preserve structured output special tokens in offline LLM.chat #39352

Open

5 tasks

douhashi mentioned this pull request Apr 9, 2026

vLLM イメージ更新後に Gemma 4 thinking 有効化と新 chat template を適用する douhashi/runpod-vllm-gemma#1

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Frontend] Gemma4: preserve structured tokens for reasoning and tool calling#39081

[Frontend] Gemma4: preserve structured tokens for reasoning and tool calling#39081
lucianommartins wants to merge 8 commits intovllm-project:mainfrom
lucianommartins:lucianommartins/gemma4

lucianommartins commented Apr 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

bbrowning commented Apr 6, 2026

Uh oh!

sfeng33 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Apr 8, 2026

Uh oh!

lucianommartins commented Apr 8, 2026

Uh oh!

sfeng33 commented Apr 8, 2026

Uh oh!

lucianommartins commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

lucianommartins commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Proposed Changes

Testing

Limitations

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

bbrowning commented Apr 6, 2026

Uh oh!

sfeng33 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Apr 8, 2026

Uh oh!

lucianommartins commented Apr 8, 2026

Uh oh!

sfeng33 commented Apr 8, 2026

Uh oh!

lucianommartins commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lucianommartins commented Apr 6, 2026 •

edited

Loading