Skip to content

FIPO loss#5434

Open
kdubovikov wants to merge 4 commits intohuggingface:mainfrom
kdubovikov:fipo-loss
Open

FIPO loss#5434
kdubovikov wants to merge 4 commits intohuggingface:mainfrom
kdubovikov:fipo-loss

Conversation

@kdubovikov
Copy link
Copy Markdown
Contributor

@kdubovikov kdubovikov commented Apr 2, 2026

What does this PR do?

This is a port of https://github.com/qwenpilot/FIPO as a GRPOTrainer loss function.

Additional manual validation outside the committed test suite:

  • ran short GRPO/FIPO training jobs on AI-MO/NuminaMath-TIR
  • confirmed that both trainers produce nonzero updates when reward variance is present
  • confirmed that with num_iterations > 1, FIPO’s inner reuse steps show nonzero log_ratio, nonzero Future-KL, and influence weights that move away from 1.0, indicating that the FIPO-specific reweighting path is active

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

  • No AI usage: the PR was written entirely by a human.
  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Note

Medium Risk
Adds a new GRPO loss implementation and associated hyperparameters, which can materially change training dynamics and stability for users selecting loss_type="fipo". Core loss computation/masking paths are modified (via loss_mask), so regressions could affect loss normalization and metrics across loss types.

Overview
Adds FIPO (Future-KL Influenced Policy Optimization) as a new GRPOTrainer loss_type, computing discounted Future-KL influence weights to reweight token advantages and applying FIPO-specific dual clipping and sequence/token masking.

Extends GRPOConfig with FIPO hyperparameters and validations, adds warnings/incompatibility checks (e.g., importance_sampling_level ignored; no Liger support), logs new fipo/* training metrics, updates tests to cover the new loss type, and documents an example FIPO recipe in the paper index.

Written by Cursor Bugbot for commit 1d65c2f. This will update automatically on new commits. Configure here.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant