You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This documents the updated chat template to use with Gemma 4 models for
reasoning and/or tool calling that was merged in
vllm-project/vllm#39027 .
It also adds instructions for how to enable thinking by default, if a
user prefers to always think.
And, it replaces the deprecated `reasoning_content` field with the
updated `reasoning` field.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Gemma 4 supports structured thinking, where the model can reason step-by-step before producing a final answer. The reasoning process is exposed via the `reasoning_content` field in the API response.
527
+
Gemma 4 supports structured thinking, where the model can reason step-by-step before producing a final answer. The reasoning process is exposed via the `reasoning` field in the API response.
528
+
529
+
> ℹ️ **Note**
530
+
> The example chat template file is included in the official container and can also be downloaded from the [vLLM repository](https://github.com/vllm-project/vllm/blob/main/examples/tool_chat_template_gemma4.jinja).
If you want to default to thinking enabled for all requests, add the argument `--default-chat-template-kwargs '{"enable_thinking": true}'` to the above command.
Gemma 4 supports function calling with a dedicated tool-call protocol using custom special tokens (`<|tool_call|>`, `<tool_call|>`, etc.).
593
599
600
+
> ℹ️ **Note**
601
+
> The example chat template file is included in the official container and can also be downloaded from the [vLLM repository](https://github.com/vllm-project/vllm/blob/main/examples/tool_chat_template_gemma4.jinja).
|`--tool-call-parser gemma4`| Enable Gemma 4 tool call parser | Required for function calling |
1027
1037
|`--enable-auto-tool-choice`| Auto-detect tool calls in output | Required for function calling |
1038
+
|`--chat-template examples/tool_chat_template_gemma4.jinja`| Override the model's default chat template to one optimized for reasoning and tool calling with vLLM |
1028
1039
|`--mm-processor-kwargs '{"max_soft_tokens": N}'`| Set default vision token budget | 280 (default), up to 1120 |
1029
1040
|`--async-scheduling`| Overlap scheduling with decoding | Recommended for throughput |
1030
1041
|`--gpu-memory-utilization 0.90`| GPU memory fraction for model + KV cache | 0.85-0.95 |
0 commit comments