Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/dpo_trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ Several formulations of the objective have been proposed in the literature. Init
| `"sigmoid"` (default) | Given the preference data, we can fit a binary classifier according to the Bradley-Terry model and in fact the [DPO](https://huggingface.co/papers/2305.18290) authors propose the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression. |
| `"hinge"` | The [RSO](https://huggingface.co/papers/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. In this case, the `beta` is the reciprocal of the margin. |
| `"ipo"` | The [IPO](https://huggingface.co/papers/2310.12036) authors argue the logit transform can overfit and propose the identity transform to optimize preferences directly; TRL exposes this as `loss_type="ipo"`. |
| `"sigmoid_norm"` | The [SIMPO](https://huggingface.co/papers/2405.14734) authors address the length-bias in the original sigmoid loss by normalizing by the number of non-mask tokens; TRL exposes this as `loss_type="sigmoid_norm"`. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing paper_index.md update for new loss type

Low Severity

This PR adds the sigmoid_norm loss type implementing the length-normalized DPO loss from the Tulu-3 paper (arXiv 2411.15124), but paper_index.md is not updated with a corresponding subsection. The project rule in AGENTS.md requires that PRs implementing methods from research papers must also add a subsection to paper_index.md. While SimPO is already listed in paper_index.md, it's associated with the CPO trainer, and the Tulu-3 paper is not referenced at all.

Fix in Cursor Fix in Web

Triggered by project rule: ../.ai/AGENTS.md

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See PR description: Tulu-3 references SimPO for the length-normalised DPO loss but SimPO focuses on CPO, not DPO. Waiting on Maintainers to indicate the best way to add this to paper index.

| `"exo_pair"` | The [EXO](https://huggingface.co/papers/2402.00856) authors propose reverse-KL preference optimization. `label_smoothing` must be strictly greater than `0.0`; a recommended value is `1e-3` (see Eq. 16 for the simplified pairwise variant). The full method uses `K>2` SFT completions and approaches PPO as `K` grows. |
| `"nca_pair"` | The [NCA](https://huggingface.co/papers/2402.05369) authors shows that NCA optimizes the absolute likelihood for each response rather than the relative likelihood. |
| `"robust"` | The [Robust DPO](https://huggingface.co/papers/2403.00409) authors propose an unbiased DPO loss under noisy preferences. Use `label_smoothing` in [`DPOConfig`] to model label-flip probability; valid values are in the range `[0.0, 0.5)`. |
Expand Down
1 change: 1 addition & 0 deletions tests/test_dpo_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,7 @@ def test_train_model(self):
"sigmoid",
"hinge",
"ipo",
"sigmoid_norm",
"exo_pair",
"nca_pair",
"robust",
Expand Down
4 changes: 2 additions & 2 deletions trl/trainer/dpo_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ class DPOConfig(_BaseConfig):
> Parameters that control the training

loss_type (`list[str]`, *optional*, defaults to `["sigmoid"]`):
Type of loss to use. Possible values are: `'sigmoid'`, `'hinge'`, `'ipo'`, `'exo_pair'`, `'nca_pair'`,
Type of loss to use. Possible values are: `'sigmoid'`, `'hinge'`, `'ipo'`, `'sigmoid_norm'`, `'exo_pair'`, `'nca_pair'`,
`'robust'`, `'bco_pair'`, `'sppo_hard'`, `'aot'`, `'aot_unpaired'`, `'apo_zero'`, `'apo_down'`,
`'discopop'`, `'sft'`. If multiple loss types are provided, they will be combined using the weights
specified in `loss_weights`.
Expand Down Expand Up @@ -209,7 +209,7 @@ class DPOConfig(_BaseConfig):
loss_type: list[str] = field(
default_factory=lambda: ["sigmoid"],
metadata={
"help": "Type of loss to use. Possible values are: `'sigmoid'`, `'hinge'`, `'ipo'`, `'exo_pair'`, "
"help": "Type of loss to use. Possible values are: `'sigmoid'`, `'hinge'`, `'ipo'`, `'sigmoid_norm'`, `'exo_pair'`, "
"`'nca_pair'`, `'robust'`, `'bco_pair'`, `'sppo_hard'`, `'aot'`, `'aot_unpaired'`, `'apo_zero'`, "
"`'apo_down'`, `'discopop'`, `'sft'`. If multiple loss types are provided, they will be combined using "
"the weights specified in `loss_weights`.",
Expand Down
9 changes: 8 additions & 1 deletion trl/trainer/dpo_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -1273,6 +1273,13 @@ def _compute_loss(self, model, inputs, return_outputs):
# (Eq. 17) of the paper where beta is the regularization parameter for the IPO loss, denoted by τ.
per_sequence_loss = (ipo_delta - 1 / (2 * self.beta)) ** 2

elif loss_type == "sigmoid_norm":
chosen_mask, rejected_mask = completion_mask.chunk(2, dim=0)
chosen_avg_score = chosen_scores / chosen_mask.sum(dim=1).clamp(min=1.0)
rejected_avg_score = rejected_scores / rejected_mask.sum(dim=1).clamp(min=1.0)
delta = chosen_avg_score - rejected_avg_score
per_sequence_loss = -F.logsigmoid(self.beta * delta)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing required paper_index.md update for new loss

Low Severity

This PR implements the length-normalized sigmoid DPO loss from a research paper (Tulu-3 / SimPO), but paper_index.md is not updated with a corresponding subsection. The project rule in AGENTS.md requires that any PR implementing a method from a research paper must also add an entry to paper_index.md. While SimPO already has an entry under CPO, the Tulu-3 paper (which is the primary source for this DPO variant) has no entry, and there is no DPO-specific subsection for this loss type.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by project rule: ../.ai/AGENTS.md


elif loss_type == "exo_pair":
# Implements EXO-pref from the paper https://huggingface.co/papers/2402.00856, (Eq. 16)
# Minimize KL(p_fθ || p_rh) for K=2; p_fθ = softmax(βπ * (log πθ − log π_ref)) over {chosen, rejected}
Expand Down Expand Up @@ -1373,7 +1380,7 @@ def _compute_loss(self, model, inputs, return_outputs):

else:
raise ValueError(
f"Unknown loss type: {loss_type}. Should be one of ['sigmoid', 'hinge', 'ipo', 'exo_pair', "
f"Unknown loss type: {loss_type}. Should be one of ['sigmoid', 'hinge', 'ipo', 'sigmoid_norm', 'exo_pair', "
"'nca_pair', 'robust', 'bco_pair', 'sppo_hard', 'aot', 'aot_unpaired', 'apo_zero', 'apo_down', "
"'discopop', 'sft']"
)
Expand Down
Loading