[Tool] `adjust_request` to reasoning parser, and Gemma4 fixes by bbrowning · Pull Request #39027 · vllm-project/vllm

bbrowning · 2026-04-05T13:14:14Z

Purpose

Fix multiple issues preventing Gemma4 models from working correctly
with multi-turn tool calling and reasoning in vLLM:

Add new Gemma4 chat template that properly encodes tool results using the model's native format, handles multi-turn conversations with interleaved tool calls and reasoning, and strips thinking content from prior assistant turns
Add adjust_request() to ReasoningParser base class (mirroring ToolParser) so reasoning parsers can modify request parameters before generation, used by Gemma4 to set skip_special_tokens=False
Fix reasoning parser to extract non-streaming thinking content and handle the "thought\n" prefix correctly in streaming
Fix pre-existing mypy error in ReasoningParserManager.register_module
Add unit tests for reasoning parser and chat template rendering
Fix empty "user" turns created when handling tool outputs by our Messages API to Chat Completions translation
is_reasoning_end clean ups for the Gemma 4 reasoning parser
- don't assuming reasoning has ended when we scan prompts backwards across user turn boundaries or after tool responses
- explicitly mark reasoning as ended when we start generating tool calls

The net result of these fixes shows larger Gemma4 models are very competitive at multi-turn tool calling for their size. I won't share any specific numbers here, but all of these fixes were guided by both direct inspection of prompting and multi-turn behavior and some simple quantitative eval with the BFCL multi_turn suite.

You'll need to both enable thinking and select the correct chat template when testing Gemma 4 models with these fixes:

vllm serve google/gemma-4-31B-it \
  --tool-call-parser gemma4 \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --default-chat-template-kwargs '{"enable_thinking": true}' \
  --chat-template examples/tool_chat_template_gemma4.jinja

Test Plan

BFCL multi_turn suite to uncover bugs and validate fixes

(expand for BFCL clone, setup, adding models)

git clone https://github.com/ShishirPatil/gorilla

cd gorilla/berkeley-function-call-leaderboard/

uv venv --python 3.12 --seed

source .venv/bin/activate

uv pip install -e .

cat <<EOF >> bfcl_eval/constants/model_config.py
    "google/gemma-4-E2B-it": ModelConfig(
        model_name="google/gemma-4-E2B-it",
        display_name="google/gemma-4-E2B-it (FC) (vLLM)",
        url="https://huggingface.co/google/gemma-4-E2B-it",
        org="Google",
        license="apache-2.0",
        model_handler=OpenAICompletionsHandler,
        input_price=None,
        output_price=None,
        is_fc_model=True,
        underscore_to_dot=True,
    ),
    "google/gemma-4-26B-A4B-it": ModelConfig(
        model_name="google/gemma-4-26B-A4B-it",
        display_name="google/gemma-4-26B-A4B-it (FC) (vLLM)",
        url="https://huggingface.co/google/gemma-4-26B-A4B-it",
        org="Google",
        license="apache-2.0",
        model_handler=OpenAICompletionsHandler,
        input_price=None,
        output_price=None,
        is_fc_model=True,
        underscore_to_dot=True,
    ),
    "google/gemma-4-31B-it": ModelConfig(
        model_name="google/gemma-4-31B-it",
        display_name="google/gemma-4-31B-it (FC) (vLLM)",
        url="https://huggingface.co/google/gemma-4-31B-it",
        org="Google",
        license="apache-2.0",
        model_handler=OpenAICompletionsHandler,
        input_price=None,
        output_price=None,
        is_fc_model=True,
        underscore_to_dot=True,
    ),
}
EOF

Run BFCL multi_turn eval suite:

OPENAI_BASE_URL="http://localhost:8000/v1" \
OPENAI_API_KEY="fake" \
bfcl generate \
  --model google/gemma-4-31B-it \
  --num-threads 4 \
  --allow-overwrite \
  --test-category multi_turn

OPENAI_API_KEY="fake" \
bfcl evaluate

Unit Tests

# Note: this test has a pre-existing dependency on transformers 5.x
# `pip install --upgrade transformers`
pytest tests/reasoning/test_gemma4_reasoning_parser.py

pytest tests/renderers/test_gemma4_chat_template.py

# Run all reasoning parser tests, since we added `adjust_request`
# skip the ones that CI skips because they already fail
# and skip step3p5 because it requires trusting remote code 
pytest tests/reasoning \
  --ignore=tests/reasoning/test_seedoss_reasoning_parser.py \
  --ignore=tests/reasoning/test_glm4_moe_reasoning_parser.py \
  --ignore=tests/reasoning/test_step3p5_reasoning_parser.py

Claude Code pointed at a Gemma 4 model running locally

CLAUDE_CODE_USE_VERTEX=0 \
ANTHROPIC_BASE_URL="http://localhost:8000" \
ANTHROPIC_DEFAULT_OPUS_MODEL="google/gemma-4-31B-it" \
ANTHROPIC_DEFAULT_SONNET_MODEL="google/gemma-4-31B-it" \
ANTHROPIC_DEFAULT_HAIKU_MODEL="google/gemma-4-31B-it" \
ANTHROPIC_AUTH_TOKEN="dummy" \
claude \
  --model sonnet

Test Result

Unit Tests

`pytest tests/reasoning/test_gemma4_reasoning_parser.py`

29 passed, 2 warnings in 3.24s

`pytest tests/renderers/test_gemma4_chat_template.py`

14 passed, 2 warnings in 0.98s

`tests/reasoning`

318 passed, 5 warnings in 40.41s

BFCL Results

I have BFCL results and they are far better after this change than before. I'm not sure it's my place to share those publicly here, but the results for the larger Gemma4 models (MoE and Dense) are very good for models of their size.

Claude Code usability

I was able to execute multiple complex refactoring and new code generation sessions in existing codebases with both Gemma-4-31B and Gemma-4-26B-A4B. After the latest fixes here, I'm not seeing any unparsed tool calls nor any leaked reasoning content into the session.

mergify · 2026-04-05T13:14:53Z

Documentation preview: https://vllm--39027.org.readthedocs.build/en/39027/

gemini-code-assist

Code Review

This pull request introduces a new Jinja chat template for Gemma 4, along with infrastructure to support custom Jinja filters and normalized tool responses. However, several critical issues were identified in the implementation. Specifically, the hardcoding of skip_special_tokens = False in the OpenAI serving layer is a global change that overrides user intent for all models. Additionally, there is debug logging to a hardcoded local file (gemma_turns.log) which is unsuitable for production. The use of global monkey-patching on jinja2.sandbox.ImmutableSandboxedEnvironment is also discouraged as it creates dangerous side effects across the entire application; a more localized injection of filters is preferred. Finally, it is recommended to log exceptions during Jinja filter patching rather than swallowing them silently.

vllm/entrypoints/openai/chat_completion/serving.py

vllm/env_override.py

vllm/transformers_utils/chat_template_json_filters.py

ywang96 · 2026-04-06T01:48:26Z

FYI #38858 (comment)

bbrowning · 2026-04-06T11:53:46Z

FYI #38858 (comment)

Thanks, I commented there. We'll need to handle parsing reasoning content properly within vLLM vs asking the user to set skip_special_tokens=False on the request, as the user knob is enable_thinking=True and then it's the server's job to parse reasoning out of that when the server was configured to use the gemma4 reasoning parser.

bbrowning · 2026-04-06T13:43:48Z

@lucianommartins The draft gemma4 chat template added here at examples/tool_chat_template_gemma4.jinja changes a number of things that we might want to get into the model's default chat template overall, so that all inference servers can benefit. I know some of these are already on your radar, but things like handling the reasoning content in multi-turn scenarios I don't know if you've considered yet.

Also, optimally we would coerce tool outputs that are JSON strings into actual JSON objects. I need to validate how much of a % difference this makes in overall multi-turn eval results, as my best results so far were based on turning JSON strings into JSON objects before the tool output reaches the chat template so that Gemini sees the tool response as an actual JSON object with string token delimiters and such instead of as a single encoded string. The previous draft of this did some hacks to get this happening in the chat template via external python helper functions, but I'd like to drop those hacks if I can verify the model results are substantially identical even if we just give it a JSON string as one string blob for tool call outputs.

@sfeng33 @chaunceyjiang I'd like your eyes on this for the adjustments to wire in an adjust_request for reasoning parsers in case you have another preferred way to do this. So many reasoning parser unit tests fail today because we don't CI them that it's a bit hard to get a good signal here, but I'll do more testing on my end as well.

I'll be pulling this out of draft soon once I clean up a few more loose ends.

lucianommartins · 2026-04-06T14:39:26Z

thanks @bbrowning - I will update my transformers PR with the nice additions from your jinja.

bbrowning · 2026-04-06T14:51:42Z

@lucianommartins opened #39081 to focus just on handling the stripping of special tokens during reasoning. Once that gets in a good place and merges, I'll remove the equivalent-ish code from this PR as here I don't handle it for the offline inference case.

drrros · 2026-04-06T20:07:38Z

with patch applied it worked better, about 10 tool calls went fine, but at some point it happens again:

                                                                                                                                                                                                                                                                                                                                                                                                                                            
FAILED (failures=7, errors=31)                                                                                                                                                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                                                                                                                                            

  Read 3 files, listed 1 directory (ctrl+o to expand)

● <|channel>thought
  I will remove the invalid @patch decorators from tests/rest_api_backend/test_cc_write_protection.py and then run the tests to see the effect.<channel|>

● Update(tests/rest_api_backend/test_cc_write_protection.py)
  ⎿  Error editing file

● Update(tests/rest_api_backend/test_cc_write_protection.py)
  ⎿  Error editing file

● Update(tests/rest_api_backend/test_cc_write_protection.py)
  ⎿  Error editing file

● ---

● Update(tests/rest_api_backend/test_cc_write_protection.py)
  ⎿  Error editing file

● Update(tests/rest_api_backend/test_cc_write_protection.py)
  ⎿  Error editing file

● Update(tests/rest_api_backend/test_cc_write_protection.py)      
  ⎿  Error editing file
  ⎿  Interrupted · What should Claude do instead?

It failed to apply file changes after tokens leakage for some reason, had to stop it after some time.

bbrowning · 2026-04-06T20:40:19Z

@drrros That sounds like we still have some edge cases to sort out in streaming reasoning parsing or perhaps our Messages API implementation. Let's debug that more specifically in #39043 since that focuses on making sure this works great for Claude Code specific tool calls.

bbrowning · 2026-04-08T02:22:34Z

Adding a few more commits to my staging area here. With these additional fixes plus #38909 and #39114 Gemma4 multi-turn reasoning and tool calling performance is getting to a really good place.

bbrowning · 2026-04-08T02:27:13Z

Note this builds on top of changes from 39114, and the pre-commit failures are because of that.

mergify · 2026-04-08T03:19:08Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bbrowning.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

bbrowning · 2026-04-08T10:59:58Z

Just waiting on #39114 to land to get the pre-commit fixed and pull this out of draft.

This is a slightly simpler approach to wiring in skip_special_tokens for the Gemma 4 reasoning parser than #39081. I don't have a strong preference, but this has to happen for the reasoning parser to work out of the box. If 39081 merges first I'll rebase this to remove my version of that. Otherwise, if this merges 39081 can either be closed or adjusted to tweak the logic I added here.

We'll also want to update the Gemma 4 Usage Guide in the recipes repo pointing to our new chat template. There is huggingface/transformers#45257 to get this updated by default in the model's official chat template, but it's not clear if/when that will merge. I've confirmed that all the logic is the same between what that transformers PR renders and what we have here.

Fix multiple issues preventing Gemma4 models from working correctly with multi-turn tool calling and reasoning in vLLM: - Add new Gemma4 chat template that properly encodes tool results using the model's native format, handles multi-turn conversations with interleaved tool calls and reasoning, and strips thinking content from prior assistant turns - Add adjust_request() to ReasoningParser base class (mirroring ToolParser) so reasoning parsers can modify request parameters before generation, used by Gemma4 to set skip_special_tokens=False - Fix reasoning parser to extract non-streaming thinking content and handle the "thought\n" prefix correctly in streaming - Fix pre-existing mypy error in ReasoningParserManager.register_module - Add unit tests for reasoning parser and chat template rendering Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>

When translating from Anthropic Messages API to Chat Completions, we were inserting entirely empty "user" turns for tool call outputs, as those come in via the user in the Messages API but get turned into a tool role in the Chat Completions API. These empty user role turs make their way into some chat templates, and for example led to issues in long multi-turn scenarios when testing with Gemma 4 models. Signed-off-by: Ben Browning <bbrownin@redhat.com>

aarnphm

LGTM

sfeng33

Thank you for the elegant fix.

This documents the updated chat template to use with Gemma 4 models for reasoning and/or tool calling that was merged in vllm-project/vllm#39027 . It also adds instructions for how to enable thinking by default, if a user prefers to always think. And, it replaces the deprecated `reasoning_content` field with the updated `reasoning` field. Signed-off-by: Ben Browning <bbrownin@redhat.com>

…r, multi-turn tool fixes - Add adjust_request() to ReasoningParser base class and wire it through the full serving pipeline (api_server, responses, render) - Gemma4ReasoningParser.adjust_request() sets skip_special_tokens=False unconditionally to preserve boundary tokens - Add is_reasoning_end() with tool-call/turn-boundary detection via reverse scan for <|turn>, <|tool_call>, <|tool_response> tokens - Fix streaming prefix stripping to return empty reasoning instead of None when thought\n prefix is fully consumed - Add adjust_request() to abstract_parser delegating to both reasoning and tool parsers - Rename _parse_gemma4_args streaming→partial; withhold trailing keys without values in partial mode - Skip empty user messages in Anthropic Messages API translation - Fix mypy cast in ReasoningParserManager.register_module - Add Gemma4 tool chat template (331-line jinja) Based on vllm-project#39027 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Output Claude Code launch command with env vars after deploy - Update TODO: reference merged vllm-project/vllm#39027 and document how to enable thinking when new image is available - Add Claude Code connection section to README Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Current vllm/vllm-openai:gemma4 image does not support this flag. Thinking disable will be possible after image update with --default-chat-template-kwargs from vllm-project/vllm#39027. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…roject#39027) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>

mergify bot added documentation Improvements or additions to documentation frontend tool-calling labels Apr 5, 2026

github-project-automation bot added this to Tool Calling Apr 5, 2026

gemini-code-assist bot reviewed Apr 5, 2026

View reviewed changes

This was referenced Apr 6, 2026

[Bug]: gemma4_utils._parse_tool_arguments truncates string values containing internal quotes #39069

Open

[Bugfix] Fix gemma4_utils._parse_tool_arguments truncating strings with internal quotes #39070

Open

bbrowning force-pushed the gemma4-multi-turn-fixes branch from 039bf4e to 21518bb Compare April 6, 2026 13:20

This was referenced Apr 6, 2026

[Bug]: Vllm + Gemma 4 + claude code: tool calling problems #39043

Open

[Bug]: Gemma4 on vLLM + PI coding agent: Validation failed for tool "edit": - path: must have required property 'path' #39072

Open

aldehir mentioned this pull request Apr 7, 2026

common : add gemma 4 specialized parser ggml-org/llama.cpp#21418

Merged

tysonmcnulty mentioned this pull request Apr 7, 2026

[Bugfix] Fix Gemma4 streaming tool parser stale state between requests #39214

Closed

mergify bot added the needs-rebase label Apr 8, 2026

bbrowning force-pushed the gemma4-multi-turn-fixes branch from 928a3ad to 8bdf630 Compare April 8, 2026 10:50

mergify bot removed the needs-rebase label Apr 8, 2026

bbrowning mentioned this pull request Apr 8, 2026

[Feature] Extend Gemma4 tool parser to support XML-style <tool_call> format #39172

Open

bbrowning and others added 2 commits April 8, 2026 13:18

bbrowning force-pushed the gemma4-multi-turn-fixes branch from 8bdf630 to fd51456 Compare April 8, 2026 17:27

bbrowning marked this pull request as ready for review April 8, 2026 17:28

bbrowning requested review from DarkLight1337, aarnphm, chaunceyjiang, mgoin, njhill and russellb as code owners April 8, 2026 17:28

aarnphm changed the title ~~Gemma4 multi-turn, tool calling, and reasoning fixes~~ [Tool] adjust_request to reasoning parser, and Gemma4 fixes Apr 8, 2026

aarnphm approved these changes Apr 8, 2026

View reviewed changes

sfeng33 approved these changes Apr 8, 2026

View reviewed changes

aarnphm added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 8, 2026

aarnphm enabled auto-merge (squash) April 8, 2026 17:36

aarnphm merged commit 8477fe4 into vllm-project:main Apr 8, 2026
51 checks passed

github-project-automation bot moved this to Done in Tool Calling Apr 8, 2026

bbrowning deleted the gemma4-multi-turn-fixes branch April 8, 2026 19:06

This was referenced Apr 8, 2026

[Bug]: Gemma4 reasoning parser fails to separate reasoning_content — <|channel> tokens stripped before parsing #38855

Open

[Frontend] Gemma4: preserve structured tokens for reasoning and tool calling #39081

Closed

bbrowning mentioned this pull request Apr 8, 2026

[Gemma4] Updated chat template, reasoning property vllm-project/recipes#322

Open

lucianommartins mentioned this pull request Apr 8, 2026

[Frontend] Preserve structured output special tokens in offline LLM.chat #39352

Open

5 tasks

douhashi mentioned this pull request Apr 9, 2026

vLLM イメージ更新後に Gemma 4 thinking 有効化と新 chat template を適用する douhashi/runpod-vllm-gemma#1

Open

5 tasks

Uh oh!

Conversation

bbrowning commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

BFCL multi_turn suite to uncover bugs and validate fixes

Unit Tests

Claude Code pointed at a Gemma 4 model running locally

Test Result

Unit Tests

pytest tests/reasoning/test_gemma4_reasoning_parser.py

pytest tests/renderers/test_gemma4_chat_template.py

tests/reasoning

BFCL Results

Claude Code usability

Uh oh!

mergify bot commented Apr 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ywang96 commented Apr 6, 2026

Uh oh!

bbrowning commented Apr 6, 2026

Uh oh!

bbrowning commented Apr 6, 2026

Uh oh!

lucianommartins commented Apr 6, 2026

Uh oh!

bbrowning commented Apr 6, 2026

Uh oh!

drrros commented Apr 6, 2026

Uh oh!

bbrowning commented Apr 6, 2026

Uh oh!

bbrowning commented Apr 8, 2026

Uh oh!

bbrowning commented Apr 8, 2026

Uh oh!

mergify bot commented Apr 8, 2026

Uh oh!

bbrowning commented Apr 8, 2026

Uh oh!

aarnphm left a comment

Choose a reason for hiding this comment

Uh oh!

sfeng33 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

bbrowning commented Apr 5, 2026 •

edited

Loading

`pytest tests/reasoning/test_gemma4_reasoning_parser.py`

`pytest tests/renderers/test_gemma4_chat_template.py`

`tests/reasoning`