Skip to content

[Frontend] Gemma4: preserve structured tokens for reasoning and tool calling#39081

Closed
lucianommartins wants to merge 8 commits intovllm-project:mainfrom
lucianommartins:lucianommartins/gemma4
Closed

[Frontend] Gemma4: preserve structured tokens for reasoning and tool calling#39081
lucianommartins wants to merge 8 commits intovllm-project:mainfrom
lucianommartins:lucianommartins/gemma4

Conversation

@lucianommartins
Copy link
Copy Markdown
Contributor

@lucianommartins lucianommartins commented Apr 6, 2026

Motivation and Context

Gemma4 utilizes special tokens for structural delimiters in thinking mode (<|channel>, <channel|>) and tool calling (<|tool_call>, <tool_call|>). By default, vLLM detokenization sets skip_special_tokens=True, which strips these tokens from the output text. This breaks both streaming and non-streaming parsers that rely on these delimiters to extract reasoning content and tool calls.

This PR ensures these tokens are preserved when needed in both the OpenAI Chat API and the offline LLM.chat paths.

Proposed Changes

  • Reasoning Parser Request Hook: Added an adjust_request hook to ReasoningParser to allow model-specific parsers to mutate ChatCompletionRequest before inference.
  • Gemma4 Implementation: Overrode adjust_request in Gemma4ReasoningParser to force skip_special_tokens=False.
  • Robustness: Added fallback in parse_thinking_output for truncated thinking blocks when the model hits max_tokens before emitting the closer.

Testing

  • Verified via local unittests that tool calling and thinking tests pass and tokens are correctly preserved in the response.

Limitations

  • Offline/batch requests using LLM.generate() still require manual setting of skip_special_tokens=False in SamplingParams if enable_thinking=True is in use.

@mergify mergify bot added the frontend label Apr 6, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ensures that special tokens used for structured outputs, such as Gemma4's thinking and tool call delimiters, are preserved by automatically setting skip_special_tokens=False. It introduces an adjust_request interface for reasoning parsers and updates the Gemma4 implementation and utility functions to handle these delimiters and truncated outputs. Reviewers suggested caching vocabulary lookups to improve performance and extending request adjustments to include ResponsesRequest for better consistency across API endpoints.

@bbrowning
Copy link
Copy Markdown
Contributor

@sfeng33 Gemma 4 needs a way to toggle skip_special_tokens to False in its reasoning parser. I see there are some mypy issues to resolve in this PR to keep type hinting happy, but otherwise mind taking a look at this approach and see if it makes sense to you since you've been in some of these areas recently with refactoring of parsers?

Copy link
Copy Markdown
Contributor

@sfeng33 sfeng33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I know it's still WIP, so just sharing some early thoughts that might be helpful as you iterate.

  1. adjust_request for reasoning parser — consider wiring it through the shared render path
    Currently the new reasoning_parser.adjust_request() is added in chat_completion/serving.py, but the Responses API and batch serving paths also instantiate a reasoning parser without calling it. Rather than adding the call to each endpoint individually, it might be worth wiring it into preprocess_chat in vllm/entrypoints/serve/render/serving.py, that way all API paths get it for free.

  2. Special token leakage with skip_special_tokens=False
    One thing to be mindful of — setting skip_special_tokens=False globally means all special tokens survive detokenization, not just the structural ones needed for parsing (<|channel>, <|tool_call>, etc.). Gemma4's tokenizer has other special tokens (e.g. , <turn|>) that could potentially leak into the user-visible content. It might be worth considering either post-processing the parsed content to strip unwanted tokens, or a more targeted detokenization approach. Just something to keep in mind — curious if you've seen this come up in testing.

  3. Offline LLM.chat path
    For the offline path changes in llm.py — since the offline LLM path doesn't currently have a parser pipeline (reasoning/tool parsing), setting skip_special_tokens=False would surface raw special tokens in output.text without any parsing.
    I'd lean toward leaving the offline path out of this PR, as it's not sth we currently support for other models as well. If we do want to include it, I think tying the detection to a model name check would be more robust than probing the tokenizer vocab, which could false-positive on other models that happen to register similar token strings.

Happy to discuss any of these — nice work getting the core mechanism in place!

aidendle94 pushed a commit to aidendle94/vllm that referenced this pull request Apr 8, 2026
…e via adjust_request hook

Replace the token-ID fallback approach with the cleaner fix from
vllm-project#39081: add an adjust_request() hook to
ReasoningParser that lets Gemma4ReasoningParser force
skip_special_tokens=False before inference. This preserves <|channel>
and <channel|> delimiters in the detokenized text so the existing
string-matching parser works for both streaming and non-streaming.

Changes:
- Add adjust_request() no-op hook to ReasoningParser base class
- Override in Gemma4ReasoningParser to set skip_special_tokens=False
- Call adjust_request() in serving.py before rendering the request
- Revert token-ID fallback in extract_reasoning() (no longer needed)
- Add truncated thinking handler in gemma4_utils.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lucianommartins lucianommartins force-pushed the lucianommartins/gemma4 branch from fa4dafb to 33abfff Compare April 8, 2026 18:38
douhashi added a commit to douhashi/runpod-vllm-gemma that referenced this pull request Apr 8, 2026
Add --chat-template-kwargs enable_thinking=false to suppress
<|channel>thought tokens leaking into content.

Tracked by: vllm-project/vllm#38855, vllm-project/vllm#39081,
aaif-goose/goose#6192

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 8, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lucianommartins.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 8, 2026
…Chat API and offline paths

- automatic skip_special_tokens enforcement in offline path: added `_adjust_params_for_parsing` in `LLM` to scan the vocabulary for registered special tokens used in structured syntax (`<|channel>`, `<|tool_call>`, etc.) and force `skip_special_tokens=False` in `LLM.chat()` when reasoning or tools are active
- reasoning parser request adjustment hook: added `adjust_request` to the base `ReasoningParser` class and hooked it into `OpenAIServingChat` to allow specialized parsers to mutate `ChatCompletionRequest` parameters before execution
- gemma4 reasoning parser implementation: overrode `adjust_request` in `Gemma4ReasoningParser` to force `skip_special_tokens=False`, ensuring delimiters survive detokenization for proper non-streaming parsing
- robustness in streaming reasoning extraction: updated documentation to clarify that while `skip_special_tokens` is now forced to `False` (making `<|channel>` visible), the instance-state approach in `extract_reasoning_streaming` is retained to prevent pre-reasoning content interference
- truncated thinking fallback: updated `parse_thinking_output` in `gemma4_utils.py` to gracefully handle incomplete thinking segments when the model hits `max_tokens` before emitting the closing `<channel|>` delimiter.
Important:
- offline generate path limitation: note that offline/batch requests using `LLM.generate()` still require manual setting of `skip_special_tokens=False` in `SamplingParams` if using `enable_thinking=True`

Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
…ng sessions

* constrained `Gemma4ReasoningParser.adjust_request` to only set `skip_special_tokens=False` when `enable_thinking` is explicitly true in `chat_template_kwargs`.
* prevented the previous unconditional override from bypassing the tool-specific guards defined in `Gemma4ToolParser`.
* ensures special tokens are only preserved when reasoning is active or tools are engaged, preventing token leakage in standard chat turns.
* confirmed correctness by running `test_gemma4_reasoning_parser.py`

Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
…scope

* reverted the addition of `_adjust_params_for_parsing` and its usage in `vllm/entrypoints/llm.py` to restrict PR focus to the Chat API path
* confirmed zero regression in online reasoning parser logic by executing `test_gemma4_reasoning_parser.py`

Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
…ared render path

* moved `reasoning_parser.adjust_request` from `OpenAIServingChat` to `OpenAIServingRender.preprocess_chat` to ensure all chat-based endpoints benefit from it
* updated `OpenAIServingRender.__init__` to resolve `reasoning_parser` class
* updated `api_server.py` to pass `reasoning_parser` from CLI args to `OpenAIServingRender`
* confirmed all 17 tests in `test_gemma4_reasoning_parser.py` pass

Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
…t_request

* updated `adjust_request` to handle `ResponsesRequest` for consistency across OpenAI-compatible endpoints as suggested by reviewer bot
* confirmed all 17 tests in `test_gemma4_reasoning_parser.py` pass

Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
…er path

Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
@lucianommartins lucianommartins force-pushed the lucianommartins/gemma4 branch from 367a5ea to ae5a871 Compare April 8, 2026 19:19
@mergify mergify bot removed the needs-rebase label Apr 8, 2026
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
@lucianommartins lucianommartins force-pushed the lucianommartins/gemma4 branch from b19d861 to c70d7f9 Compare April 8, 2026 19:48
@lucianommartins lucianommartins marked this pull request as ready for review April 8, 2026 19:55
@lucianommartins
Copy link
Copy Markdown
Contributor Author

thanks @bbrowning @sfeng33 @chaunceyjiang for the feedbacks.

everything addressed.

@sfeng33 - I'm creating a new PR right away to make this fix for the offline chat api separately as you suggested.

@sfeng33
Copy link
Copy Markdown
Contributor

sfeng33 commented Apr 8, 2026

Thank you for looking into this, if I'm not mistaken, I think the changes in this PR are covered by #39027 as well.

@lucianommartins
Copy link
Copy Markdown
Contributor Author

thanks @sfeng33 - I didn't follow @bbrowning changes and looks like he already addressed all my things here. I will drop this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants