Skip to content

feat(async-grpo): add sampling parameter parity#5418

Open
kdubovikov wants to merge 2 commits intohuggingface:mainfrom
kdubovikov:async-grpo-sampling-parity
Open

feat(async-grpo): add sampling parameter parity#5418
kdubovikov wants to merge 2 commits intohuggingface:mainfrom
kdubovikov:async-grpo-sampling-parity

Conversation

@kdubovikov
Copy link
Copy Markdown
Contributor

@kdubovikov kdubovikov commented Mar 31, 2026

What does this PR do?

This PR adds GRPO-style sampling parameter support to AsyncGRPOConfig and wires those options through the async rollout path so AsyncGRPOTrainer matches the regular GRPOTrainer contract more closely.

Before this change, async GRPO only exposed temperature and max_completion_length for generation. It did not support the standard GRPO sampling controls like top_p, top_k, min_p, repetition_penalty, or the generic generation_kwargs override path.

This PR adds support for:

  • top_p
  • top_k
  • min_p
  • repetition_penalty
  • generation_kwargs

The values are passed from AsyncGRPOConfig to AsyncGRPOTrainer, then into AsyncRolloutWorker, and finally into the /v1/completions request payload sent to the vLLM server.

To maintain the same user-facing contract as GRPOTrainer, keys provided in generation_kwargs override the named sampling arguments on conflict.

This PR also updates the async GRPO docs to explicitly mention the supported sampling controls and their precedence behavior.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

  • No AI usage: the PR was written entirely by a human.
  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Note

Medium Risk
Changes the generation request payload sent to the vLLM server by introducing additional sampling parameters and a generic override mechanism (generation_kwargs), which can materially alter rollout distributions and training behavior. Risk is contained to async GRPO/vLLM integration and is covered by new unit tests verifying parameter wiring and precedence.

Overview
Adds GRPO-style sampling controls to async GRPO generation: top_p, top_k, min_p, repetition_penalty, and generation_kwargs are introduced on AsyncGRPOConfig and plumbed through AsyncGRPOTrainer into AsyncRolloutWorker.

Updates AsyncRolloutWorker._generate_one_turn to include these fields in the /v1/completions payload and to apply generation_kwargs after named params so overrides win. Documentation is updated to describe supported sampling options and precedence, and tests assert both config-to-worker wiring and override behavior.

Written by Cursor Bugbot for commit ca1e529. This will update automatically on new commits. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant