feat(async-grpo): add sampling parameter parity#5418
Open
kdubovikov wants to merge 2 commits intohuggingface:mainfrom
Open
feat(async-grpo): add sampling parameter parity#5418kdubovikov wants to merge 2 commits intohuggingface:mainfrom
kdubovikov wants to merge 2 commits intohuggingface:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR adds GRPO-style sampling parameter support to
AsyncGRPOConfigand wires those options through the async rollout path soAsyncGRPOTrainermatches the regularGRPOTrainercontract more closely.Before this change, async GRPO only exposed
temperatureandmax_completion_lengthfor generation. It did not support the standard GRPO sampling controls liketop_p,top_k,min_p,repetition_penalty, or the genericgeneration_kwargsoverride path.This PR adds support for:
top_ptop_kmin_prepetition_penaltygeneration_kwargsThe values are passed from
AsyncGRPOConfigtoAsyncGRPOTrainer, then intoAsyncRolloutWorker, and finally into the/v1/completionsrequest payload sent to the vLLM server.To maintain the same user-facing contract as
GRPOTrainer, keys provided ingeneration_kwargsoverride the named sampling arguments on conflict.This PR also updates the async GRPO docs to explicitly mention the supported sampling controls and their precedence behavior.
Fixes # (issue)
Before submitting
AI writing disclosure
We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.
Note
Medium Risk
Changes the generation request payload sent to the vLLM server by introducing additional sampling parameters and a generic override mechanism (
generation_kwargs), which can materially alter rollout distributions and training behavior. Risk is contained to async GRPO/vLLM integration and is covered by new unit tests verifying parameter wiring and precedence.Overview
Adds GRPO-style sampling controls to async GRPO generation:
top_p,top_k,min_p,repetition_penalty, andgeneration_kwargsare introduced onAsyncGRPOConfigand plumbed throughAsyncGRPOTrainerintoAsyncRolloutWorker.Updates
AsyncRolloutWorker._generate_one_turnto include these fields in the/v1/completionspayload and to applygeneration_kwargsafter named params so overrides win. Documentation is updated to describe supported sampling options and precedence, and tests assert both config-to-worker wiring and override behavior.Written by Cursor Bugbot for commit ca1e529. This will update automatically on new commits. Configure here.