[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility by lucianommartins · Pull Request #45257 · huggingface/transformers

lucianommartins · 2026-04-05T22:07:53Z

[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility

What does this PR do?

Rewrites the _patch_template_for_openai_tool_role() function in convert_gemma4_weights.py to fully support OpenAI Chat Completions tool-calling semantics for Gemma4 (E4B and 31B).

Chat template patcher

Forward-scan tool rendering: role: "tool" messages are skipped in the outer loop and rendered proactively as <|tool_response> blocks from the preceding assistant turn that issued the tool_calls
Turn suppression: Suppresses duplicate <|turn>model when consecutive assistant messages are separated only by tool messages (multi-round tool-call loops)
tool_call_id resolution: Matches tool results back to the originating tool_calls array by ID to resolve function names correctly (prevents response:unknown)
Content-parts robustness: Handles tool response content as both plain strings and OpenAI content-parts arrays ([{type: "text", text: "..."}])
format_tool_response_block macro: Injects a reusable macro to centralize tool response rendering (used by both legacy Gemma native tool_responses and OpenAI-style role: "tool" paths)
reasoning/reasoning_content support: Renders thinking fields as <|channel>thought blocks (compatible with vLLM, DeepSeek, and o1-style inference servers)
Legacy compat: Preserves native tool_responses on assistant messages (Google/Gemma format)

Stop tokens (`eos_token_id`)

Removed <tool_call|> (etc_token) from the stop token list
Keeps only <eos> + <turn|> (eot_token)
Enables parallel tool calls without premature truncation after the first <tool_call|>; <turn|> still terminates the model turn correctly

Testing

Validated with 17 functional test scenarios across both E4B and 31B templates:

Simple chat, tool declarations, single/multi/parallel tool calls
Multi-round tool loops (exactly 1 <|turn>model emitted)
Legacy tool_responses, tool_call_id resolution, content-parts arrays
reasoning/reasoning_content field rendering
add_generation_prompt correctness, Jinja2 syntax validation

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum?
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Models:

multimodal models: @zucchini-nlp

Library:

generate: @zucchini-nlp (visual-language models)

…atibility - Chat Template: Added handler for OpenAI-standard 'role: "tool"' messages to render inline as <|tool_response> without initiating a new <|turn> block. - Chat Template: Extended turn-close condition to inhibit <turn|> emission when model has pending 'tool_calls' without corresponding responses, preserving the continuous turn structure. - Generation Config: Updated 'eos_token_id' derivation in convert_gemma4_weights.py to prioritize the terminal '<tool_call|>' token over the starting '<|tool_response>' token, resolving post-call generation hallucinations in HuggingFace inference. Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>

Chat template patcher (_patch_template_for_openai_tool_role): - Inject format_tool_response_block macro after strip_thinking to DRY up tool-response rendering (used by both legacy and OpenAI paths) - Replace the entire message loop instead of two point patches: * Skip role:'tool' messages in outer loop; render them proactively via forward-scan from the preceding assistant message * Suppress duplicate <|turn>model on consecutive assistant messages separated only by tool messages (multi-round tool-call loops) * Resolve tool_call_id back to function name from originating tool_calls array (prevents response:unknown fallback) * Handle tool response content as both plain strings and OpenAI content-parts arrays ([{type:'text', text:'...'}]) * Render reasoning/reasoning_content fields as <|channel>thought blocks (supports both vLLM and older inference server variants) - Preserve legacy tool_responses on assistant messages (Gemma native) - Pre-scan loop_messages for last_user_idx to guard reasoning injection Stop tokens (eos_token_id): - Remove <tool_call|> (etc_token) from the stop token list - Keeps only <eos> + <turn|> (eot_token) - Enables parallel tool calls without premature truncation after the first <tool_call|>; <turn|> still terminates the model turn correctly Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>

github-actions · 2026-04-06T15:31:26Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: gemma4

zucchini-nlp · 2026-04-07T12:30:53Z

cc @Rocketknight1 for chat templates and tool calling, but this seems to be only the conversion. Prob the latest changes you made before release, didn't make it to conversion script 😅

lucianommartins added 2 commits April 5, 2026 22:01

This was referenced Apr 8, 2026

[Tool] adjust_request to reasoning parser, and Gemma4 fixes vllm-project/vllm#39027

Merged

common : add gemma 4 specialized parser ggml-org/llama.cpp#21418

Merged

wbste mentioned this pull request Apr 8, 2026

Eval bug: Infinite repetition loop in llama-server with peg-gemma4 parser during tool calls ggml-org/llama.cpp#21375

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility#45257

[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility#45257
lucianommartins wants to merge 2 commits intohuggingface:mainfrom
lucianommartins:lucianommartins/gemma4

lucianommartins commented Apr 5, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

zucchini-nlp commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lucianommartins commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility

What does this PR do?

Chat template patcher

Stop tokens (eos_token_id)

Testing

Before submitting

Who can review?

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

zucchini-nlp commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lucianommartins commented Apr 5, 2026 •

edited

Loading

Stop tokens (`eos_token_id`)