feat(async-grpo): add sampling parameter parity by kdubovikov · Pull Request #5418 · huggingface/trl

kdubovikov · 2026-03-31T15:42:47Z

What does this PR do?

This PR adds GRPO-style sampling parameter support to AsyncGRPOConfig and wires those options through the async rollout path so AsyncGRPOTrainer matches the regular GRPOTrainer contract more closely.

Before this change, async GRPO only exposed temperature and max_completion_length for generation. It did not support the standard GRPO sampling controls like top_p, top_k, min_p, repetition_penalty, or the generic generation_kwargs override path.

This PR adds support for:

top_p
top_k
min_p
repetition_penalty
generation_kwargs

The values are passed from AsyncGRPOConfig to AsyncGRPOTrainer, then into AsyncRolloutWorker, and finally into the /v1/completions request payload sent to the vLLM server.

To maintain the same user-facing contract as GRPOTrainer, keys provided in generation_kwargs override the named sampling arguments on conflict.

This PR also updates the async GRPO docs to explicitly mention the supported sampling controls and their precedence behavior.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Note

Medium Risk
Changes the generation request payload sent to the vLLM server by introducing additional sampling parameters and a generic override mechanism (generation_kwargs), which can materially alter rollout distributions and training behavior. Risk is contained to async GRPO/vLLM integration and is covered by new unit tests verifying parameter wiring and precedence.

Overview
Adds GRPO-style sampling controls to async GRPO generation: top_p, top_k, min_p, repetition_penalty, and generation_kwargs are introduced on AsyncGRPOConfig and plumbed through AsyncGRPOTrainer into AsyncRolloutWorker.

Updates AsyncRolloutWorker._generate_one_turn to include these fields in the /v1/completions payload and to apply generation_kwargs after named params so overrides win. Documentation is updated to describe supported sampling options and precedence, and tests assert both config-to-worker wiring and override behavior.

^{Written by Cursor Bugbot for commit ca1e529. This will update automatically on new commits. Configure here.}

Kirill Dubovikov and others added 2 commits March 31, 2026 18:40

feat(async-grpo): add sampling parameter parity

219c85c

Merge branch 'main' into async-grpo-sampling-parity

ca1e529

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(async-grpo): add sampling parameter parity#5418

feat(async-grpo): add sampling parameter parity#5418
kdubovikov wants to merge 2 commits intohuggingface:mainfrom
kdubovikov:async-grpo-sampling-parity

kdubovikov commented Mar 31, 2026 •

edited by cursor bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kdubovikov commented Mar 31, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

AI writing disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kdubovikov commented Mar 31, 2026 •

edited by cursor bot

Loading