FIPO loss by kdubovikov · Pull Request #5434 · huggingface/trl

kdubovikov · 2026-04-02T07:09:53Z

What does this PR do?

This is a port of https://github.com/qwenpilot/FIPO as a GRPOTrainer loss function.

Additional manual validation outside the committed test suite:

ran short GRPO/FIPO training jobs on AI-MO/NuminaMath-TIR
confirmed that both trainers produce nonzero updates when reward variance is present
confirmed that with num_iterations > 1, FIPO’s inner reuse steps show nonzero log_ratio, nonzero Future-KL, and influence weights that move away from 1.0, indicating that the FIPO-specific reweighting path is active

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Note

Medium Risk
Adds a new GRPO loss implementation and associated hyperparameters, which can materially change training dynamics and stability for users selecting loss_type="fipo". Core loss computation/masking paths are modified (via loss_mask), so regressions could affect loss normalization and metrics across loss types.

Overview
Adds FIPO (Future-KL Influenced Policy Optimization) as a new GRPOTrainer loss_type, computing discounted Future-KL influence weights to reweight token advantages and applying FIPO-specific dual clipping and sequence/token masking.

Extends GRPOConfig with FIPO hyperparameters and validations, adds warnings/incompatibility checks (e.g., importance_sampling_level ignored; no Liger support), logs new fipo/* training metrics, updates tests to cover the new loss type, and documents an example FIPO recipe in the paper index.

^{Written by Cursor Bugbot for commit 1d65c2f. This will update automatically on new commits. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

trl/trainer/grpo_trainer.py

Kirill Dubovikov added 2 commits April 2, 2026 10:03

feat(grpo): add fipo loss

83f8e58

refactor(grpo): trim fipo logging metrics

b29b935

cursor bot reviewed Apr 2, 2026

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

Kirill Dubovikov and others added 2 commits April 2, 2026 10:32

perf(grpo): remove dead fipo helper ops

12b5204

Merge branch 'main' into fipo-loss

1d65c2f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIPO loss#5434

FIPO loss#5434
kdubovikov wants to merge 4 commits intohuggingface:mainfrom
kdubovikov:fipo-loss

kdubovikov commented Apr 2, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kdubovikov commented Apr 2, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

AI writing disclosure

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kdubovikov commented Apr 2, 2026 •

edited by cursor bot

Loading