[Bugfix] Fix Gemma4 non-streaming reasoning parsing#38858
[Bugfix] Fix Gemma4 non-streaming reasoning parsing#38858jacobzhang22 wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the _get_reasoning_parser_input_text helper function to preserve reasoning markers during non-streaming chat completions by re-decoding token IDs when special tokens are detected. It also includes unit tests for the Gemma4ReasoningParser. Feedback suggests relaxing the condition in the helper function to allow for reasoning parsers that define only a start or an end token, rather than requiring both.
Co-authored-by: OpenAI Codex Signed-off-by: Jacob <jaco8123@gmail.com>
863e9e6 to
999cb4d
Compare
|
Hi @jacobzhang22 - it is not a bug, it is working as expected. when you start a server with the reasoning parser,it is reasoning capable but depends on the inference request instructions. if you request without reasoning, everything comes as a single structure with skip_special_tokens on the default True. if you ensble thinking for a request, you must tell the server to keep the special tokens then the thoughts can be parsed. it is a design/UX decision and it is in sync with the Transformers implementation too (some code samples at fhe official docs: https://ai.google.dev/gemma/docs/capabilities/thinking |
Thanks @lucianommartins, this is helpful context. Looking at the Gemma thinking docs, I see that the intended parsing flow depends on preserving the special tokens during decode (skip_special_tokens=False) before parsing the response. I was treating this as a vLLM bug in the non-streaming path, but it sounds like this is an intentional behavior/design choice rather than a regression. I’m going to close this PR. Thanks for clarifying. |
I don't think these semantics are quite right from a user's point-of-view. In vLLM, |
|
@bbrowning - I understand... I'm baking one PR to submit in the next minutes with a proposal to address that... but it requires on some vLLM core engine changes that I will need @ywang96 and the other vLLM folks to evaluate the idea. |
|
fyi @bbrowning - trying to fix that here #39081 |
Purpose
Fixes #38855, where Gemma4 non-streaming chat completions fail to populate
reasoning_contentand instead return the thought trace incontent.The issue report suggested a parser-side token ID fix, but after inspecting the merged Gemma4 implementation from #38826, the Gemma4 parser already inherits
start_token_id/end_token_idsupport fromBaseThinkingReasoningParser. The actual failure is in the non-streaming OpenAI chat serving path: it passesoutput.textto the parser after special tokens have already been stripped, so Gemma4 never sees<|channel>/<channel|>boundaries.This PR fixes that handoff in
vllm/entrypoints/openai/chat_completion/serving.pyby reconstructing parser input fromoutput.token_idswithskip_special_tokens=Falsewhen the reasoning parser's boundary token IDs are present. This keeps the fix narrow, preserves the existing parser contract, and avoids adding parser-side logic that would need to infer reasoning boundaries from text after the delimiter tokens have already been removed.Test Plan
Test Result
Passed:
tests/entrypoints/openai/chat_completion/test_serving_chat.pypre-commiton all changed filesThe focused regression covers the before/after behavior for this bug:
reasoning_contentwasnullandcontentcontained the thought trace