Skip to content

[Tool] adjust_request to reasoning parser, and Gemma4 fixes#39027

Merged
aarnphm merged 4 commits intovllm-project:mainfrom
bbrowning:gemma4-multi-turn-fixes
Apr 8, 2026
Merged

[Tool] adjust_request to reasoning parser, and Gemma4 fixes#39027
aarnphm merged 4 commits intovllm-project:mainfrom
bbrowning:gemma4-multi-turn-fixes

Conversation

@bbrowning
Copy link
Copy Markdown
Contributor

@bbrowning bbrowning commented Apr 5, 2026

Purpose

Fix multiple issues preventing Gemma4 models from working correctly
with multi-turn tool calling and reasoning in vLLM:

  • Add new Gemma4 chat template that properly encodes tool results using the model's native format, handles multi-turn conversations with interleaved tool calls and reasoning, and strips thinking content from prior assistant turns
  • Add adjust_request() to ReasoningParser base class (mirroring ToolParser) so reasoning parsers can modify request parameters before generation, used by Gemma4 to set skip_special_tokens=False
  • Fix reasoning parser to extract non-streaming thinking content and handle the "thought\n" prefix correctly in streaming
  • Fix pre-existing mypy error in ReasoningParserManager.register_module
  • Add unit tests for reasoning parser and chat template rendering
  • Fix empty "user" turns created when handling tool outputs by our Messages API to Chat Completions translation
  • is_reasoning_end clean ups for the Gemma 4 reasoning parser
    • don't assuming reasoning has ended when we scan prompts backwards across user turn boundaries or after tool responses
    • explicitly mark reasoning as ended when we start generating tool calls

The net result of these fixes shows larger Gemma4 models are very competitive at multi-turn tool calling for their size. I won't share any specific numbers here, but all of these fixes were guided by both direct inspection of prompting and multi-turn behavior and some simple quantitative eval with the BFCL multi_turn suite.

You'll need to both enable thinking and select the correct chat template when testing Gemma 4 models with these fixes:

vllm serve google/gemma-4-31B-it \
  --tool-call-parser gemma4 \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --default-chat-template-kwargs '{"enable_thinking": true}' \
  --chat-template examples/tool_chat_template_gemma4.jinja

Test Plan

BFCL multi_turn suite to uncover bugs and validate fixes

(expand for BFCL clone, setup, adding models)
git clone https://github.com/ShishirPatil/gorilla

cd gorilla/berkeley-function-call-leaderboard/

uv venv --python 3.12 --seed

source .venv/bin/activate

uv pip install -e .

cat <<EOF >> bfcl_eval/constants/model_config.py
    "google/gemma-4-E2B-it": ModelConfig(
        model_name="google/gemma-4-E2B-it",
        display_name="google/gemma-4-E2B-it (FC) (vLLM)",
        url="https://huggingface.co/google/gemma-4-E2B-it",
        org="Google",
        license="apache-2.0",
        model_handler=OpenAICompletionsHandler,
        input_price=None,
        output_price=None,
        is_fc_model=True,
        underscore_to_dot=True,
    ),
    "google/gemma-4-26B-A4B-it": ModelConfig(
        model_name="google/gemma-4-26B-A4B-it",
        display_name="google/gemma-4-26B-A4B-it (FC) (vLLM)",
        url="https://huggingface.co/google/gemma-4-26B-A4B-it",
        org="Google",
        license="apache-2.0",
        model_handler=OpenAICompletionsHandler,
        input_price=None,
        output_price=None,
        is_fc_model=True,
        underscore_to_dot=True,
    ),
    "google/gemma-4-31B-it": ModelConfig(
        model_name="google/gemma-4-31B-it",
        display_name="google/gemma-4-31B-it (FC) (vLLM)",
        url="https://huggingface.co/google/gemma-4-31B-it",
        org="Google",
        license="apache-2.0",
        model_handler=OpenAICompletionsHandler,
        input_price=None,
        output_price=None,
        is_fc_model=True,
        underscore_to_dot=True,
    ),
}
EOF

Run BFCL multi_turn eval suite:

OPENAI_BASE_URL="http://localhost:8000/v1" \
OPENAI_API_KEY="fake" \
bfcl generate \
  --model google/gemma-4-31B-it \
  --num-threads 4 \
  --allow-overwrite \
  --test-category multi_turn

OPENAI_API_KEY="fake" \
bfcl evaluate

Unit Tests

# Note: this test has a pre-existing dependency on transformers 5.x
# `pip install --upgrade transformers`
pytest tests/reasoning/test_gemma4_reasoning_parser.py

pytest tests/renderers/test_gemma4_chat_template.py

# Run all reasoning parser tests, since we added `adjust_request`
# skip the ones that CI skips because they already fail
# and skip step3p5 because it requires trusting remote code 
pytest tests/reasoning \
  --ignore=tests/reasoning/test_seedoss_reasoning_parser.py \
  --ignore=tests/reasoning/test_glm4_moe_reasoning_parser.py \
  --ignore=tests/reasoning/test_step3p5_reasoning_parser.py

Claude Code pointed at a Gemma 4 model running locally

CLAUDE_CODE_USE_VERTEX=0 \
ANTHROPIC_BASE_URL="http://localhost:8000" \
ANTHROPIC_DEFAULT_OPUS_MODEL="google/gemma-4-31B-it" \
ANTHROPIC_DEFAULT_SONNET_MODEL="google/gemma-4-31B-it" \
ANTHROPIC_DEFAULT_HAIKU_MODEL="google/gemma-4-31B-it" \
ANTHROPIC_AUTH_TOKEN="dummy" \
claude \
  --model sonnet

Test Result

Unit Tests

pytest tests/reasoning/test_gemma4_reasoning_parser.py

29 passed, 2 warnings in 3.24s

pytest tests/renderers/test_gemma4_chat_template.py

14 passed, 2 warnings in 0.98s

tests/reasoning

318 passed, 5 warnings in 40.41s

BFCL Results

I have BFCL results and they are far better after this change than before. I'm not sure it's my place to share those publicly here, but the results for the larger Gemma4 models (MoE and Dense) are very good for models of their size.

Claude Code usability

I was able to execute multiple complex refactoring and new code generation sessions in existing codebases with both Gemma-4-31B and Gemma-4-26B-A4B. After the latest fixes here, I'm not seeing any unparsed tool calls nor any leaked reasoning content into the session.

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 5, 2026

Documentation preview: https://vllm--39027.org.readthedocs.build/en/39027/

@mergify mergify bot added documentation Improvements or additions to documentation frontend tool-calling labels Apr 5, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Jinja chat template for Gemma 4, along with infrastructure to support custom Jinja filters and normalized tool responses. However, several critical issues were identified in the implementation. Specifically, the hardcoding of skip_special_tokens = False in the OpenAI serving layer is a global change that overrides user intent for all models. Additionally, there is debug logging to a hardcoded local file (gemma_turns.log) which is unsuitable for production. The use of global monkey-patching on jinja2.sandbox.ImmutableSandboxedEnvironment is also discouraged as it creates dangerous side effects across the entire application; a more localized injection of filters is preferred. Finally, it is recommended to log exceptions during Jinja filter patching rather than swallowing them silently.

@ywang96
Copy link
Copy Markdown
Member

ywang96 commented Apr 6, 2026

FYI #38858 (comment)

@bbrowning
Copy link
Copy Markdown
Contributor Author

FYI #38858 (comment)

Thanks, I commented there. We'll need to handle parsing reasoning content properly within vLLM vs asking the user to set skip_special_tokens=False on the request, as the user knob is enable_thinking=True and then it's the server's job to parse reasoning out of that when the server was configured to use the gemma4 reasoning parser.

@bbrowning bbrowning force-pushed the gemma4-multi-turn-fixes branch from 039bf4e to 21518bb Compare April 6, 2026 13:20
@bbrowning
Copy link
Copy Markdown
Contributor Author

@lucianommartins The draft gemma4 chat template added here at examples/tool_chat_template_gemma4.jinja changes a number of things that we might want to get into the model's default chat template overall, so that all inference servers can benefit. I know some of these are already on your radar, but things like handling the reasoning content in multi-turn scenarios I don't know if you've considered yet.

Also, optimally we would coerce tool outputs that are JSON strings into actual JSON objects. I need to validate how much of a % difference this makes in overall multi-turn eval results, as my best results so far were based on turning JSON strings into JSON objects before the tool output reaches the chat template so that Gemini sees the tool response as an actual JSON object with string token delimiters and such instead of as a single encoded string. The previous draft of this did some hacks to get this happening in the chat template via external python helper functions, but I'd like to drop those hacks if I can verify the model results are substantially identical even if we just give it a JSON string as one string blob for tool call outputs.

@sfeng33 @chaunceyjiang I'd like your eyes on this for the adjustments to wire in an adjust_request for reasoning parsers in case you have another preferred way to do this. So many reasoning parser unit tests fail today because we don't CI them that it's a bit hard to get a good signal here, but I'll do more testing on my end as well.

I'll be pulling this out of draft soon once I clean up a few more loose ends.

@lucianommartins
Copy link
Copy Markdown
Contributor

thanks @bbrowning - I will update my transformers PR with the nice additions from your jinja.

@bbrowning
Copy link
Copy Markdown
Contributor Author

@lucianommartins opened #39081 to focus just on handling the stripping of special tokens during reasoning. Once that gets in a good place and merges, I'll remove the equivalent-ish code from this PR as here I don't handle it for the offline inference case.

@drrros
Copy link
Copy Markdown

drrros commented Apr 6, 2026

with patch applied it worked better, about 10 tool calls went fine, but at some point it happens again:

                                                                                                                                                                                                                                                                                                                                                                                                                                            
FAILED (failures=7, errors=31)                                                                                                                                                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                                                                                                                                            

  Read 3 files, listed 1 directory (ctrl+o to expand)

● <|channel>thought
  I will remove the invalid @patch decorators from tests/rest_api_backend/test_cc_write_protection.py and then run the tests to see the effect.<channel|>

● Update(tests/rest_api_backend/test_cc_write_protection.py)
  ⎿  Error editing file

● Update(tests/rest_api_backend/test_cc_write_protection.py)
  ⎿  Error editing file

● Update(tests/rest_api_backend/test_cc_write_protection.py)
  ⎿  Error editing file

● ---

● Update(tests/rest_api_backend/test_cc_write_protection.py)
  ⎿  Error editing file

● Update(tests/rest_api_backend/test_cc_write_protection.py)
  ⎿  Error editing file

● Update(tests/rest_api_backend/test_cc_write_protection.py)      
  ⎿  Error editing file
  ⎿  Interrupted · What should Claude do instead?                                                                                                                                                                                                                                                                                                                                                                                           

It failed to apply file changes after tokens leakage for some reason, had to stop it after some time.

@bbrowning
Copy link
Copy Markdown
Contributor Author

@drrros That sounds like we still have some edge cases to sort out in streaming reasoning parsing or perhaps our Messages API implementation. Let's debug that more specifically in #39043 since that focuses on making sure this works great for Claude Code specific tool calls.

@bbrowning
Copy link
Copy Markdown
Contributor Author

Adding a few more commits to my staging area here. With these additional fixes plus #38909 and #39114 Gemma4 multi-turn reasoning and tool calling performance is getting to a really good place.

@bbrowning
Copy link
Copy Markdown
Contributor Author

Note this builds on top of changes from 39114, and the pre-commit failures are because of that.

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 8, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bbrowning.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 8, 2026
@bbrowning bbrowning force-pushed the gemma4-multi-turn-fixes branch from 928a3ad to 8bdf630 Compare April 8, 2026 10:50
@mergify mergify bot removed the needs-rebase label Apr 8, 2026
@bbrowning
Copy link
Copy Markdown
Contributor Author

Just waiting on #39114 to land to get the pre-commit fixed and pull this out of draft.

This is a slightly simpler approach to wiring in skip_special_tokens for the Gemma 4 reasoning parser than #39081. I don't have a strong preference, but this has to happen for the reasoning parser to work out of the box. If 39081 merges first I'll rebase this to remove my version of that. Otherwise, if this merges 39081 can either be closed or adjusted to tweak the logic I added here.

We'll also want to update the Gemma 4 Usage Guide in the recipes repo pointing to our new chat template. There is huggingface/transformers#45257 to get this updated by default in the model's official chat template, but it's not clear if/when that will merge. I've confirmed that all the logic is the same between what that transformers PR renders and what we have here.

bbrowning and others added 2 commits April 8, 2026 13:18
Fix multiple issues preventing Gemma4 models from working correctly
with multi-turn tool calling and reasoning in vLLM:

- Add new Gemma4 chat template that properly encodes tool results
  using the model's native format, handles multi-turn conversations
  with interleaved tool calls and reasoning, and strips thinking
  content from prior assistant turns
- Add adjust_request() to ReasoningParser base class (mirroring
  ToolParser) so reasoning parsers can modify request parameters
  before generation, used by Gemma4 to set skip_special_tokens=False
- Fix reasoning parser to extract non-streaming thinking content
  and handle the "thought\n" prefix correctly in streaming
- Fix pre-existing mypy error in ReasoningParserManager.register_module
- Add unit tests for reasoning parser and chat template rendering

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
When translating from Anthropic Messages API to Chat Completions, we
were inserting entirely empty "user" turns for tool call outputs, as
those come in via the user in the Messages API but get turned into a
tool role in the Chat Completions API. These empty user role turs make
their way into some chat templates, and for example led to issues in
long multi-turn scenarios when testing with Gemma 4 models.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
@bbrowning bbrowning force-pushed the gemma4-multi-turn-fixes branch from 8bdf630 to fd51456 Compare April 8, 2026 17:27
@bbrowning bbrowning marked this pull request as ready for review April 8, 2026 17:28
@aarnphm aarnphm changed the title Gemma4 multi-turn, tool calling, and reasoning fixes [Tool] adjust_request to reasoning parser, and Gemma4 fixes Apr 8, 2026
Copy link
Copy Markdown
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@sfeng33 sfeng33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the elegant fix.

@aarnphm aarnphm added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 8, 2026
@aarnphm aarnphm enabled auto-merge (squash) April 8, 2026 17:36
@aarnphm aarnphm merged commit 8477fe4 into vllm-project:main Apr 8, 2026
51 checks passed
@bbrowning bbrowning deleted the gemma4-multi-turn-fixes branch April 8, 2026 19:06
bbrowning added a commit to bbrowning/vllm-recipes that referenced this pull request Apr 8, 2026
This documents the updated chat template to use with Gemma 4 models for
reasoning and/or tool calling that was merged in
vllm-project/vllm#39027 .

It also adds instructions for how to enable thinking by default, if a
user prefers to always think.

And, it replaces the deprecated `reasoning_content` field with the
updated `reasoning` field.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
bbrowning added a commit to bbrowning/vllm-recipes that referenced this pull request Apr 8, 2026
This documents the updated chat template to use with Gemma 4 models for
reasoning and/or tool calling that was merged in
vllm-project/vllm#39027 .

It also adds instructions for how to enable thinking by default, if a
user prefers to always think.

And, it replaces the deprecated `reasoning_content` field with the
updated `reasoning` field.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
aidendle94 pushed a commit to aidendle94/vllm that referenced this pull request Apr 9, 2026
…r, multi-turn tool fixes

- Add adjust_request() to ReasoningParser base class and wire it through
  the full serving pipeline (api_server, responses, render)
- Gemma4ReasoningParser.adjust_request() sets skip_special_tokens=False
  unconditionally to preserve boundary tokens
- Add is_reasoning_end() with tool-call/turn-boundary detection via
  reverse scan for <|turn>, <|tool_call>, <|tool_response> tokens
- Fix streaming prefix stripping to return empty reasoning instead of
  None when thought\n prefix is fully consumed
- Add adjust_request() to abstract_parser delegating to both reasoning
  and tool parsers
- Rename _parse_gemma4_args streaming→partial; withhold trailing keys
  without values in partial mode
- Skip empty user messages in Anthropic Messages API translation
- Fix mypy cast in ReasoningParserManager.register_module
- Add Gemma4 tool chat template (331-line jinja)

Based on vllm-project#39027

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
douhashi added a commit to douhashi/runpod-vllm-gemma that referenced this pull request Apr 9, 2026
- Output Claude Code launch command with env vars after deploy
- Update TODO: reference merged vllm-project/vllm#39027 and
  document how to enable thinking when new image is available
- Add Claude Code connection section to README

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
douhashi added a commit to douhashi/runpod-vllm-gemma that referenced this pull request Apr 9, 2026
Current vllm/vllm-openai:gemma4 image does not support this flag.
Thinking disable will be possible after image update with
--default-chat-template-kwargs from vllm-project/vllm#39027.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jdebache pushed a commit to jdebache/vllm that referenced this pull request Apr 9, 2026
…roject#39027)

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Elm8116 pushed a commit to Elm8116/vllm that referenced this pull request Apr 9, 2026
…roject#39027)

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Apr 9, 2026
…roject#39027)

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
…roject#39027)

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed tool-calling

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants