Skip to content

[Bugfix] Fix Gemma4 non-streaming reasoning parsing#38858

Closed
jacobzhang22 wants to merge 1 commit intovllm-project:mainfrom
jacobzhang22:fix-gemma4-reasoning-parser
Closed

[Bugfix] Fix Gemma4 non-streaming reasoning parsing#38858
jacobzhang22 wants to merge 1 commit intovllm-project:mainfrom
jacobzhang22:fix-gemma4-reasoning-parser

Conversation

@jacobzhang22
Copy link
Copy Markdown

Purpose

Fixes #38855, where Gemma4 non-streaming chat completions fail to populate reasoning_content and instead return the thought trace in content.

The issue report suggested a parser-side token ID fix, but after inspecting the merged Gemma4 implementation from #38826, the Gemma4 parser already inherits start_token_id / end_token_id support from BaseThinkingReasoningParser. The actual failure is in the non-streaming OpenAI chat serving path: it passes output.text to the parser after special tokens have already been stripped, so Gemma4 never sees <|channel> / <channel|> boundaries.

This PR fixes that handoff in vllm/entrypoints/openai/chat_completion/serving.py by reconstructing parser input from output.token_ids with skip_special_tokens=False when the reasoning parser's boundary token IDs are present. This keeps the fix narrow, preserves the existing parser contract, and avoids adding parser-side logic that would need to infer reasoning boundaries from text after the delimiter tokens have already been removed.

Test Plan

.venv/bin/python -m pytest tests/entrypoints/openai/chat_completion/test_serving_chat.py -k gemma4_non_streaming_reasoning_uses_token_ids -v
.venv/bin/pre-commit run --files vllm/entrypoints/openai/chat_completion/serving.py tests/entrypoints/openai/chat_completion/test_serving_chat.py

Test Result

Passed:

  • Targeted regression test for the new non-streaming Gemma4 behavior in tests/entrypoints/openai/chat_completion/test_serving_chat.py
  • pre-commit on all changed files

The focused regression covers the before/after behavior for this bug:

  • Before: reasoning_content was null and content contained the thought trace
  • After: reasoning and final content are separated correctly

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the _get_reasoning_parser_input_text helper function to preserve reasoning markers during non-streaming chat completions by re-decoding token IDs when special tokens are detected. It also includes unit tests for the Gemma4ReasoningParser. Feedback suggests relaxing the condition in the helper function to allow for reasoning parsers that define only a start or an end token, rather than requiring both.

Co-authored-by: OpenAI Codex
Signed-off-by: Jacob <jaco8123@gmail.com>
@jacobzhang22 jacobzhang22 force-pushed the fix-gemma4-reasoning-parser branch from 863e9e6 to 999cb4d Compare April 3, 2026 00:12
@lucianommartins
Copy link
Copy Markdown
Contributor

Hi @jacobzhang22 - it is not a bug, it is working as expected.

when you start a server with the reasoning parser,it is reasoning capable but depends on the inference request instructions.

if you request without reasoning, everything comes as a single structure with skip_special_tokens on the default True.

if you ensble thinking for a request, you must tell the server to keep the special tokens then the thoughts can be parsed.

it is a design/UX decision and it is in sync with the Transformers implementation too (some code samples at fhe official docs: https://ai.google.dev/gemma/docs/capabilities/thinking

@ywang96 ywang96 closed this Apr 3, 2026
@jacobzhang22
Copy link
Copy Markdown
Author

lucianommartins

Thanks @lucianommartins, this is helpful context. Looking at the Gemma thinking docs, I see that the intended parsing flow depends on preserving the special tokens during decode (skip_special_tokens=False) before parsing the response. I was treating this as a vLLM bug in the non-streaming path, but it sounds like this is an intentional behavior/design choice rather than a regression.

I’m going to close this PR. Thanks for clarifying.

@bbrowning
Copy link
Copy Markdown
Contributor

Hi @jacobzhang22 - it is not a bug, it is working as expected.

when you start a server with the reasoning parser,it is reasoning capable but depends on the inference request instructions.

if you request without reasoning, everything comes as a single structure with skip_special_tokens on the default True.

if you ensble thinking for a request, you must tell the server to keep the special tokens then the thoughts can be parsed.

it is a design/UX decision and it is in sync with the Transformers implementation too (some code samples at fhe official docs: https://ai.google.dev/gemma/docs/capabilities/thinking

I don't think these semantics are quite right from a user's point-of-view. In vLLM, enable_thinking=True (either via setting it as a default chat template kwarg that applies to every request or sending it per-request) should be enough to get reasoning working as expected, including reasoning content parsed by vLLM into the reasoning property on Chat Completion message responses. We wouldn't want users to also have to set skip_special_tokens=False on their request, as that's leaking vLLM internals of how we handle reasoning to the user.

@lucianommartins
Copy link
Copy Markdown
Contributor

@bbrowning - I understand... I'm baking one PR to submit in the next minutes with a proposal to address that... but it requires on some vLLM core engine changes that I will need @ywang96 and the other vLLM folks to evaluate the idea.

@lucianommartins
Copy link
Copy Markdown
Contributor

fyi @bbrowning - trying to fix that here #39081

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working frontend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Gemma4 reasoning parser fails to separate reasoning_content — <|channel> tokens stripped before parsing

4 participants