[Tool] adjust_request to reasoning parser, and Gemma4 fixes#39027
[Tool] adjust_request to reasoning parser, and Gemma4 fixes#39027aarnphm merged 4 commits intovllm-project:mainfrom
adjust_request to reasoning parser, and Gemma4 fixes#39027Conversation
|
Documentation preview: https://vllm--39027.org.readthedocs.build/en/39027/ |
There was a problem hiding this comment.
Code Review
This pull request introduces a new Jinja chat template for Gemma 4, along with infrastructure to support custom Jinja filters and normalized tool responses. However, several critical issues were identified in the implementation. Specifically, the hardcoding of skip_special_tokens = False in the OpenAI serving layer is a global change that overrides user intent for all models. Additionally, there is debug logging to a hardcoded local file (gemma_turns.log) which is unsuitable for production. The use of global monkey-patching on jinja2.sandbox.ImmutableSandboxedEnvironment is also discouraged as it creates dangerous side effects across the entire application; a more localized injection of filters is preferred. Finally, it is recommended to log exceptions during Jinja filter patching rather than swallowing them silently.
|
FYI #38858 (comment) |
Thanks, I commented there. We'll need to handle parsing reasoning content properly within vLLM vs asking the user to set |
039bf4e to
21518bb
Compare
|
@lucianommartins The draft gemma4 chat template added here at Also, optimally we would coerce tool outputs that are JSON strings into actual JSON objects. I need to validate how much of a % difference this makes in overall multi-turn eval results, as my best results so far were based on turning JSON strings into JSON objects before the tool output reaches the chat template so that Gemini sees the tool response as an actual JSON object with string token delimiters and such instead of as a single encoded string. The previous draft of this did some hacks to get this happening in the chat template via external python helper functions, but I'd like to drop those hacks if I can verify the model results are substantially identical even if we just give it a JSON string as one string blob for tool call outputs. @sfeng33 @chaunceyjiang I'd like your eyes on this for the adjustments to wire in an I'll be pulling this out of draft soon once I clean up a few more loose ends. |
|
thanks @bbrowning - I will update my transformers PR with the nice additions from your jinja. |
|
@lucianommartins opened #39081 to focus just on handling the stripping of special tokens during reasoning. Once that gets in a good place and merges, I'll remove the equivalent-ish code from this PR as here I don't handle it for the offline inference case. |
|
with patch applied it worked better, about 10 tool calls went fine, but at some point it happens again: It failed to apply file changes after tokens leakage for some reason, had to stop it after some time. |
|
Note this builds on top of changes from 39114, and the pre-commit failures are because of that. |
|
This pull request has merge conflicts that must be resolved before it can be |
928a3ad to
8bdf630
Compare
|
Just waiting on #39114 to land to get the pre-commit fixed and pull this out of draft. This is a slightly simpler approach to wiring in skip_special_tokens for the Gemma 4 reasoning parser than #39081. I don't have a strong preference, but this has to happen for the reasoning parser to work out of the box. If 39081 merges first I'll rebase this to remove my version of that. Otherwise, if this merges 39081 can either be closed or adjusted to tweak the logic I added here. We'll also want to update the Gemma 4 Usage Guide in the recipes repo pointing to our new chat template. There is huggingface/transformers#45257 to get this updated by default in the model's official chat template, but it's not clear if/when that will merge. I've confirmed that all the logic is the same between what that transformers PR renders and what we have here. |
Fix multiple issues preventing Gemma4 models from working correctly with multi-turn tool calling and reasoning in vLLM: - Add new Gemma4 chat template that properly encodes tool results using the model's native format, handles multi-turn conversations with interleaved tool calls and reasoning, and strips thinking content from prior assistant turns - Add adjust_request() to ReasoningParser base class (mirroring ToolParser) so reasoning parsers can modify request parameters before generation, used by Gemma4 to set skip_special_tokens=False - Fix reasoning parser to extract non-streaming thinking content and handle the "thought\n" prefix correctly in streaming - Fix pre-existing mypy error in ReasoningParserManager.register_module - Add unit tests for reasoning parser and chat template rendering Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>
When translating from Anthropic Messages API to Chat Completions, we were inserting entirely empty "user" turns for tool call outputs, as those come in via the user in the Messages API but get turned into a tool role in the Chat Completions API. These empty user role turs make their way into some chat templates, and for example led to issues in long multi-turn scenarios when testing with Gemma 4 models. Signed-off-by: Ben Browning <bbrownin@redhat.com>
8bdf630 to
fd51456
Compare
adjust_request to reasoning parser, and Gemma4 fixes
sfeng33
left a comment
There was a problem hiding this comment.
Thank you for the elegant fix.
This documents the updated chat template to use with Gemma 4 models for reasoning and/or tool calling that was merged in vllm-project/vllm#39027 . It also adds instructions for how to enable thinking by default, if a user prefers to always think. And, it replaces the deprecated `reasoning_content` field with the updated `reasoning` field. Signed-off-by: Ben Browning <bbrownin@redhat.com>
This documents the updated chat template to use with Gemma 4 models for reasoning and/or tool calling that was merged in vllm-project/vllm#39027 . It also adds instructions for how to enable thinking by default, if a user prefers to always think. And, it replaces the deprecated `reasoning_content` field with the updated `reasoning` field. Signed-off-by: Ben Browning <bbrownin@redhat.com>
…r, multi-turn tool fixes - Add adjust_request() to ReasoningParser base class and wire it through the full serving pipeline (api_server, responses, render) - Gemma4ReasoningParser.adjust_request() sets skip_special_tokens=False unconditionally to preserve boundary tokens - Add is_reasoning_end() with tool-call/turn-boundary detection via reverse scan for <|turn>, <|tool_call>, <|tool_response> tokens - Fix streaming prefix stripping to return empty reasoning instead of None when thought\n prefix is fully consumed - Add adjust_request() to abstract_parser delegating to both reasoning and tool parsers - Rename _parse_gemma4_args streaming→partial; withhold trailing keys without values in partial mode - Skip empty user messages in Anthropic Messages API translation - Fix mypy cast in ReasoningParserManager.register_module - Add Gemma4 tool chat template (331-line jinja) Based on vllm-project#39027 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Output Claude Code launch command with env vars after deploy - Update TODO: reference merged vllm-project/vllm#39027 and document how to enable thinking when new image is available - Add Claude Code connection section to README Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Current vllm/vllm-openai:gemma4 image does not support this flag. Thinking disable will be possible after image update with --default-chat-template-kwargs from vllm-project/vllm#39027. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…roject#39027) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>
…roject#39027) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>
…roject#39027) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>
…roject#39027) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Purpose
Fix multiple issues preventing Gemma4 models from working correctly
with multi-turn tool calling and reasoning in vLLM:
is_reasoning_endclean ups for the Gemma 4 reasoning parserThe net result of these fixes shows larger Gemma4 models are very competitive at multi-turn tool calling for their size. I won't share any specific numbers here, but all of these fixes were guided by both direct inspection of prompting and multi-turn behavior and some simple quantitative eval with the BFCL multi_turn suite.
You'll need to both enable thinking and select the correct chat template when testing Gemma 4 models with these fixes:
Test Plan
BFCL multi_turn suite to uncover bugs and validate fixes
(expand for BFCL clone, setup, adding models)
Run BFCL multi_turn eval suite:
Unit Tests
Claude Code pointed at a Gemma 4 model running locally
Test Result
Unit Tests
pytest tests/reasoning/test_gemma4_reasoning_parser.py29 passed, 2 warnings in 3.24spytest tests/renderers/test_gemma4_chat_template.py14 passed, 2 warnings in 0.98stests/reasoning318 passed, 5 warnings in 40.41sBFCL Results
I have BFCL results and they are far better after this change than before. I'm not sure it's my place to share those publicly here, but the results for the larger Gemma4 models (MoE and Dense) are very good for models of their size.
Claude Code usability
I was able to execute multiple complex refactoring and new code generation sessions in existing codebases with both Gemma-4-31B and Gemma-4-26B-A4B. After the latest fixes here, I'm not seeing any unparsed tool calls nor any leaked reasoning content into the session.