Skip to content

[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility#45257

Draft
lucianommartins wants to merge 2 commits intohuggingface:mainfrom
lucianommartins:lucianommartins/gemma4
Draft

[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility#45257
lucianommartins wants to merge 2 commits intohuggingface:mainfrom
lucianommartins:lucianommartins/gemma4

Conversation

@lucianommartins
Copy link
Copy Markdown
Contributor

@lucianommartins lucianommartins commented Apr 5, 2026

[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility

What does this PR do?

Rewrites the _patch_template_for_openai_tool_role() function in convert_gemma4_weights.py to fully support OpenAI Chat Completions tool-calling semantics for Gemma4 (E4B and 31B).

Chat template patcher

  • Forward-scan tool rendering: role: "tool" messages are skipped in the outer loop and rendered proactively as <|tool_response> blocks from the preceding assistant turn that issued the tool_calls
  • Turn suppression: Suppresses duplicate <|turn>model when consecutive assistant messages are separated only by tool messages (multi-round tool-call loops)
  • tool_call_id resolution: Matches tool results back to the originating tool_calls array by ID to resolve function names correctly (prevents response:unknown)
  • Content-parts robustness: Handles tool response content as both plain strings and OpenAI content-parts arrays ([{type: "text", text: "..."}])
  • format_tool_response_block macro: Injects a reusable macro to centralize tool response rendering (used by both legacy Gemma native tool_responses and OpenAI-style role: "tool" paths)
  • reasoning/reasoning_content support: Renders thinking fields as <|channel>thought blocks (compatible with vLLM, DeepSeek, and o1-style inference servers)
  • Legacy compat: Preserves native tool_responses on assistant messages (Google/Gemma format)

Stop tokens (eos_token_id)

  • Removed <tool_call|> (etc_token) from the stop token list
  • Keeps only <eos> + <turn|> (eot_token)
  • Enables parallel tool calls without premature truncation after the first <tool_call|>; <turn|> still terminates the model turn correctly

Testing

Validated with 17 functional test scenarios across both E4B and 31B templates:

  • Simple chat, tool declarations, single/multi/parallel tool calls
  • Multi-round tool loops (exactly 1 <|turn>model emitted)
  • Legacy tool_responses, tool_call_id resolution, content-parts arrays
  • reasoning/reasoning_content field rendering
  • add_generation_prompt correctness, Jinja2 syntax validation

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum?
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Models:

Library:

…atibility

- Chat Template: Added handler for OpenAI-standard 'role: "tool"' messages to render inline as <|tool_response> without initiating a new <|turn> block.
- Chat Template: Extended turn-close condition to inhibit <turn|> emission when model has pending 'tool_calls' without corresponding responses, preserving the continuous turn structure.
- Generation Config: Updated 'eos_token_id' derivation in convert_gemma4_weights.py to prioritize the terminal '<tool_call|>' token over the starting '<|tool_response>' token, resolving post-call generation hallucinations in HuggingFace inference.

Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Chat template patcher (_patch_template_for_openai_tool_role):
- Inject format_tool_response_block macro after strip_thinking to DRY
  up tool-response rendering (used by both legacy and OpenAI paths)
- Replace the entire message loop instead of two point patches:
  * Skip role:'tool' messages in outer loop; render them proactively
    via forward-scan from the preceding assistant message
  * Suppress duplicate <|turn>model on consecutive assistant messages
    separated only by tool messages (multi-round tool-call loops)
  * Resolve tool_call_id back to function name from originating
    tool_calls array (prevents response:unknown fallback)
  * Handle tool response content as both plain strings and OpenAI
    content-parts arrays ([{type:'text', text:'...'}])
  * Render reasoning/reasoning_content fields as <|channel>thought
    blocks (supports both vLLM and older inference server variants)
- Preserve legacy tool_responses on assistant messages (Gemma native)
- Pre-scan loop_messages for last_user_idx to guard reasoning injection
Stop tokens (eos_token_id):
- Remove <tool_call|> (etc_token) from the stop token list
- Keeps only <eos> + <turn|> (eot_token)
- Enables parallel tool calls without premature truncation after the
  first <tool_call|>; <turn|> still terminates the model turn correctly

Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: gemma4

@zucchini-nlp
Copy link
Copy Markdown
Member

cc @Rocketknight1 for chat templates and tool calling, but this seems to be only the conversion. Prob the latest changes you made before release, didn't make it to conversion script 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants