[Bugfix] Fix Gemma4 non-streaming reasoning parsing by jacobzhang22 · Pull Request #38858 · vllm-project/vllm

jacobzhang22 · 2026-04-03T00:00:34Z

Purpose

Fixes #38855, where Gemma4 non-streaming chat completions fail to populate reasoning_content and instead return the thought trace in content.

The issue report suggested a parser-side token ID fix, but after inspecting the merged Gemma4 implementation from #38826, the Gemma4 parser already inherits start_token_id / end_token_id support from BaseThinkingReasoningParser. The actual failure is in the non-streaming OpenAI chat serving path: it passes output.text to the parser after special tokens have already been stripped, so Gemma4 never sees <|channel> / <channel|> boundaries.

This PR fixes that handoff in vllm/entrypoints/openai/chat_completion/serving.py by reconstructing parser input from output.token_ids with skip_special_tokens=False when the reasoning parser's boundary token IDs are present. This keeps the fix narrow, preserves the existing parser contract, and avoids adding parser-side logic that would need to infer reasoning boundaries from text after the delimiter tokens have already been removed.

Test Plan

.venv/bin/python -m pytest tests/entrypoints/openai/chat_completion/test_serving_chat.py -k gemma4_non_streaming_reasoning_uses_token_ids -v
.venv/bin/pre-commit run --files vllm/entrypoints/openai/chat_completion/serving.py tests/entrypoints/openai/chat_completion/test_serving_chat.py

Test Result

Passed:

Targeted regression test for the new non-streaming Gemma4 behavior in tests/entrypoints/openai/chat_completion/test_serving_chat.py
pre-commit on all changed files

The focused regression covers the before/after behavior for this bug:

Before: reasoning_content was null and content contained the thought trace
After: reasoning and final content are separated correctly

gemini-code-assist

Code Review

This pull request introduces the _get_reasoning_parser_input_text helper function to preserve reasoning markers during non-streaming chat completions by re-decoding token IDs when special tokens are detected. It also includes unit tests for the Gemma4ReasoningParser. Feedback suggests relaxing the condition in the helper function to allow for reasoning parsers that define only a start or an end token, rather than requiring both.

vllm/entrypoints/openai/chat_completion/serving.py

Co-authored-by: OpenAI Codex Signed-off-by: Jacob <jaco8123@gmail.com>

lucianommartins · 2026-04-03T01:31:13Z

Hi @jacobzhang22 - it is not a bug, it is working as expected.

when you start a server with the reasoning parser,it is reasoning capable but depends on the inference request instructions.

if you request without reasoning, everything comes as a single structure with skip_special_tokens on the default True.

if you ensble thinking for a request, you must tell the server to keep the special tokens then the thoughts can be parsed.

it is a design/UX decision and it is in sync with the Transformers implementation too (some code samples at fhe official docs: https://ai.google.dev/gemma/docs/capabilities/thinking

jacobzhang22 · 2026-04-03T01:50:13Z

lucianommartins

Thanks @lucianommartins, this is helpful context. Looking at the Gemma thinking docs, I see that the intended parsing flow depends on preserving the special tokens during decode (skip_special_tokens=False) before parsing the response. I was treating this as a vLLM bug in the non-streaming path, but it sounds like this is an intentional behavior/design choice rather than a regression.

I’m going to close this PR. Thanks for clarifying.

bbrowning · 2026-04-06T11:51:42Z

Hi @jacobzhang22 - it is not a bug, it is working as expected.

when you start a server with the reasoning parser,it is reasoning capable but depends on the inference request instructions.

if you request without reasoning, everything comes as a single structure with skip_special_tokens on the default True.

if you ensble thinking for a request, you must tell the server to keep the special tokens then the thoughts can be parsed.

it is a design/UX decision and it is in sync with the Transformers implementation too (some code samples at fhe official docs: https://ai.google.dev/gemma/docs/capabilities/thinking

I don't think these semantics are quite right from a user's point-of-view. In vLLM, enable_thinking=True (either via setting it as a default chat template kwarg that applies to every request or sending it per-request) should be enough to get reasoning working as expected, including reasoning content parsed by vLLM into the reasoning property on Chat Completion message responses. We wouldn't want users to also have to set skip_special_tokens=False on their request, as that's leaking vLLM internals of how we handle reasoning to the user.

lucianommartins · 2026-04-06T13:30:39Z

@bbrowning - I understand... I'm baking one PR to submit in the next minutes with a proposal to address that... but it requires on some vLLM core engine changes that I will need @ywang96 and the other vLLM folks to evaluate the idea.

lucianommartins · 2026-04-06T14:40:41Z

fyi @bbrowning - trying to fix that here #39081

jacobzhang22 requested review from DarkLight1337, NickLucche, aarnphm, chaunceyjiang, robertgshaw2-redhat and russellb as code owners April 3, 2026 00:00

mergify bot added frontend bug Something isn't working labels Apr 3, 2026

gemini-code-assist bot reviewed Apr 3, 2026

View reviewed changes

vllm/entrypoints/openai/chat_completion/serving.py Outdated Show resolved Hide resolved

[Bugfix] Fix Gemma4 non-streaming reasoning parsing

999cb4d

Co-authored-by: OpenAI Codex Signed-off-by: Jacob <jaco8123@gmail.com>

jacobzhang22 force-pushed the fix-gemma4-reasoning-parser branch from 863e9e6 to 999cb4d Compare April 3, 2026 00:12

ywang96 closed this Apr 3, 2026

ywang96 mentioned this pull request Apr 6, 2026

[Tool] adjust_request to reasoning parser, and Gemma4 fixes #39027

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix Gemma4 non-streaming reasoning parsing#38858

[Bugfix] Fix Gemma4 non-streaming reasoning parsing#38858
jacobzhang22 wants to merge 1 commit intovllm-project:mainfrom
jacobzhang22:fix-gemma4-reasoning-parser

jacobzhang22 commented Apr 3, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

lucianommartins commented Apr 3, 2026

Uh oh!

jacobzhang22 commented Apr 3, 2026

Uh oh!

bbrowning commented Apr 6, 2026

Uh oh!

lucianommartins commented Apr 6, 2026

Uh oh!

lucianommartins commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

jacobzhang22 commented Apr 3, 2026

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

lucianommartins commented Apr 3, 2026

Uh oh!

jacobzhang22 commented Apr 3, 2026

Uh oh!

bbrowning commented Apr 6, 2026

Uh oh!

lucianommartins commented Apr 6, 2026

Uh oh!

lucianommartins commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants