Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/dpo_trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ Several formulations of the objective have been proposed in the literature. Init
| `"sigmoid"` (default) | Given the preference data, we can fit a binary classifier according to the Bradley-Terry model and in fact the [DPO](https://huggingface.co/papers/2305.18290) authors propose the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression. |
| `"hinge"` | The [RSO](https://huggingface.co/papers/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. In this case, the `beta` is the reciprocal of the margin. |
| `"ipo"` | The [IPO](https://huggingface.co/papers/2310.12036) authors argue the logit transform can overfit and propose the identity transform to optimize preferences directly; TRL exposes this as `loss_type="ipo"`. |
| `"sigmoid_norm"` | The [SIMPO](https://huggingface.co/papers/2405.14734) authors address the length-bias in the original sigmoid loss by normalizing by the number of non-mask tokens; TRL exposes this as `loss_type="sigmoid_norm"`. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing paper_index.md update for new loss type

Low Severity

This PR adds the sigmoid_norm loss type implementing the length-normalized DPO loss from the Tulu-3 paper (arXiv 2411.15124), but paper_index.md is not updated with a corresponding subsection. The project rule in AGENTS.md requires that PRs implementing methods from research papers must also add a subsection to paper_index.md. While SimPO is already listed in paper_index.md, it's associated with the CPO trainer, and the Tulu-3 paper is not referenced at all.

Fix in Cursor Fix in Web

Triggered by project rule: ../.ai/AGENTS.md

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See PR description: Tulu-3 references SimPO for the length-normalised DPO loss but SimPO focuses on CPO, not DPO. Waiting on Maintainers to indicate the best way to add this to paper index.

| `"exo_pair"` | The [EXO](https://huggingface.co/papers/2402.00856) authors propose reverse-KL preference optimization. `label_smoothing` must be strictly greater than `0.0`; a recommended value is `1e-3` (see Eq. 16 for the simplified pairwise variant). The full method uses `K>2` SFT completions and approaches PPO as `K` grows. |
| `"nca_pair"` | The [NCA](https://huggingface.co/papers/2402.05369) authors shows that NCA optimizes the absolute likelihood for each response rather than the relative likelihood. |
| `"robust"` | The [Robust DPO](https://huggingface.co/papers/2403.00409) authors propose an unbiased DPO loss under noisy preferences. Use `label_smoothing` in [`DPOConfig`] to model label-flip probability; valid values are in the range `[0.0, 0.5)`. |
Expand Down
1 change: 1 addition & 0 deletions tests/test_dpo_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,7 @@ def test_train_model(self):
"sigmoid",
"hinge",
"ipo",
"sigmoid_norm",
"exo_pair",
"nca_pair",
"robust",
Expand Down
4 changes: 2 additions & 2 deletions trl/trainer/dpo_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ class DPOConfig(_BaseConfig):
> Parameters that control the training

loss_type (`list[str]`, *optional*, defaults to `["sigmoid"]`):
Type of loss to use. Possible values are: `'sigmoid'`, `'hinge'`, `'ipo'`, `'exo_pair'`, `'nca_pair'`,
Type of loss to use. Possible values are: `'sigmoid'`, `'hinge'`, `'ipo'`, `'sigmoid_norm'`, `'exo_pair'`, `'nca_pair'`,
`'robust'`, `'bco_pair'`, `'sppo_hard'`, `'aot'`, `'aot_unpaired'`, `'apo_zero'`, `'apo_down'`,
`'discopop'`, `'sft'`. If multiple loss types are provided, they will be combined using the weights
specified in `loss_weights`.
Expand Down Expand Up @@ -209,7 +209,7 @@ class DPOConfig(_BaseConfig):
loss_type: list[str] = field(
default_factory=lambda: ["sigmoid"],
metadata={
"help": "Type of loss to use. Possible values are: `'sigmoid'`, `'hinge'`, `'ipo'`, `'exo_pair'`, "
"help": "Type of loss to use. Possible values are: `'sigmoid'`, `'hinge'`, `'ipo'`, `'sigmoid_norm'`, `'exo_pair'`, "
"`'nca_pair'`, `'robust'`, `'bco_pair'`, `'sppo_hard'`, `'aot'`, `'aot_unpaired'`, `'apo_zero'`, "
"`'apo_down'`, `'discopop'`, `'sft'`. If multiple loss types are provided, they will be combined using "
"the weights specified in `loss_weights`.",
Expand Down
9 changes: 8 additions & 1 deletion trl/trainer/dpo_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -1273,6 +1273,13 @@ def _compute_loss(self, model, inputs, return_outputs):
# (Eq. 17) of the paper where beta is the regularization parameter for the IPO loss, denoted by τ.
per_sequence_loss = (ipo_delta - 1 / (2 * self.beta)) ** 2

elif loss_type == "sigmoid_norm":
chosen_mask, rejected_mask = completion_mask.chunk(2, dim=0)
chosen_avg_score = chosen_scores / chosen_mask.sum(dim=1).clamp(min=1.0)
rejected_avg_score = rejected_scores / rejected_mask.sum(dim=1).clamp(min=1.0)
delta = chosen_avg_score - rejected_avg_score
per_sequence_loss = -F.logsigmoid(self.beta * delta)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing required paper_index.md update for new loss

Low Severity

This PR implements the length-normalized sigmoid DPO loss from a research paper (Tulu-3 / SimPO), but paper_index.md is not updated with a corresponding subsection. The project rule in AGENTS.md requires that any PR implementing a method from a research paper must also add an entry to paper_index.md. While SimPO already has an entry under CPO, the Tulu-3 paper (which is the primary source for this DPO variant) has no entry, and there is no DPO-specific subsection for this loss type.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by project rule: ../.ai/AGENTS.md


elif loss_type == "exo_pair":
# Implements EXO-pref from the paper https://huggingface.co/papers/2402.00856, (Eq. 16)
# Minimize KL(p_fθ || p_rh) for K=2; p_fθ = softmax(βπ * (log πθ − log π_ref)) over {chosen, rejected}
Expand Down Expand Up @@ -1373,7 +1380,7 @@ def _compute_loss(self, model, inputs, return_outputs):

else:
raise ValueError(
f"Unknown loss type: {loss_type}. Should be one of ['sigmoid', 'hinge', 'ipo', 'exo_pair', "
f"Unknown loss type: {loss_type}. Should be one of ['sigmoid', 'hinge', 'ipo', 'sigmoid_norm', 'exo_pair', "
"'nca_pair', 'robust', 'bco_pair', 'sppo_hard', 'aot', 'aot_unpaired', 'apo_zero', 'apo_down', "
"'discopop', 'sft']"
)
Expand Down
Loading