Add length-normalized sigmoid loss type to DPO trainer#5406
Add length-normalized sigmoid loss type to DPO trainer#5406BrownianNotion wants to merge 5 commits intohuggingface:mainfrom
Conversation
| | `"sigmoid"` (default) | Given the preference data, we can fit a binary classifier according to the Bradley-Terry model and in fact the [DPO](https://huggingface.co/papers/2305.18290) authors propose the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression. | | ||
| | `"hinge"` | The [RSO](https://huggingface.co/papers/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. In this case, the `beta` is the reciprocal of the margin. | | ||
| | `"ipo"` | The [IPO](https://huggingface.co/papers/2310.12036) authors argue the logit transform can overfit and propose the identity transform to optimize preferences directly; TRL exposes this as `loss_type="ipo"`. | | ||
| | `"sigmoid_norm"` | The [SIMPO](https://huggingface.co/papers/2405.14734) authors address the length-bias in the original sigmoid loss by normalizing by the number of non-mask tokens; TRL exposes this as `loss_type="sigmoid_norm"`. | |
There was a problem hiding this comment.
Missing paper_index.md update for new loss type
Low Severity
This PR adds the sigmoid_norm loss type implementing the length-normalized DPO loss from the Tulu-3 paper (arXiv 2411.15124), but paper_index.md is not updated with a corresponding subsection. The project rule in AGENTS.md requires that PRs implementing methods from research papers must also add a subsection to paper_index.md. While SimPO is already listed in paper_index.md, it's associated with the CPO trainer, and the Tulu-3 paper is not referenced at all.
Triggered by project rule: ../.ai/AGENTS.md
There was a problem hiding this comment.
See PR description: Tulu-3 references SimPO for the length-normalised DPO loss but SimPO focuses on CPO, not DPO. Waiting on Maintainers to indicate the best way to add this to paper index.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| chosen_avg_score = chosen_scores / chosen_mask.sum(dim=1).clamp(min=1.0) | ||
| rejected_avg_score = rejected_scores / rejected_mask.sum(dim=1).clamp(min=1.0) | ||
| delta = chosen_avg_score - rejected_avg_score | ||
| per_sequence_loss = -F.logsigmoid(self.beta * delta) |
There was a problem hiding this comment.
Missing required paper_index.md update for new loss
Low Severity
This PR implements the length-normalized sigmoid DPO loss from a research paper (Tulu-3 / SimPO), but paper_index.md is not updated with a corresponding subsection. The project rule in AGENTS.md requires that any PR implementing a method from a research paper must also add an entry to paper_index.md. While SimPO already has an entry under CPO, the Tulu-3 paper (which is the primary source for this DPO variant) has no entry, and there is no DPO-specific subsection for this loss type.
Additional Locations (1)
Triggered by project rule: ../.ai/AGENTS.md
|
@qgallouedec would you be able to take a look when you can? Let me know how you want the paper index to be updated and if any further changes/tests are required. Thank you! |


What does this PR do?
Adds the length-normalised DPO loss which is used in the Tulu-3/OLMo models for DPO. Source: https://arxiv.org/pdf/2411.15124, section 5.1.2 equation 6.
Let me know if a section for paper index should also be added. The source cited by Tulu-3 is SimPO, section 2.2 equation 4, but this paper focuses on CPO (and is already in the paper index) not DPO.
Fixes #2964 (issue was closed without being addressed)
Before submitting
AI writing disclosure
We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.
Note
Medium Risk
Medium risk because it extends core
DPOTrainerloss computation and could affect training behavior when selected, though it is opt-in and covered by an added loss-type training test.Overview
Adds a new DPO
loss_type="sigmoid_norm"that normalizes chosen/rejected scores by completion token count before applying the sigmoid (-logsigmoid(beta * delta)), to reduce length bias.Updates
DPOConfighelp text, DPO trainer docs, and the loss-type parametrized training test list to includesigmoid_norm(and improves the unknown-loss error message accordingly).Reviewed by Cursor Bugbot for commit 8f55007. Bugbot is set up for automated code reviews on this repo. Configure here.