Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/dpo_trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ Several formulations of the objective have been proposed in the literature. Init
| `"sigmoid"` (default) | Given the preference data, we can fit a binary classifier according to the Bradley-Terry model and in fact the [DPO](https://huggingface.co/papers/2305.18290) authors propose the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression. |
| `"hinge"` | The [RSO](https://huggingface.co/papers/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. In this case, the `beta` is the reciprocal of the margin. |
| `"ipo"` | The [IPO](https://huggingface.co/papers/2310.12036) authors argue the logit transform can overfit and propose the identity transform to optimize preferences directly; TRL exposes this as `loss_type="ipo"`. |
| `"sigmoid_norm"` | The [SIMPO](https://arxiv.org/pdf/2405.14734) authors address the length-bias in the original sigmoid loss by normalizing by the number of non-mask tokens; TRL exposes this as `loss_type="sigmoid_norm"`. |
| `"exo_pair"` | The [EXO](https://huggingface.co/papers/2402.00856) authors propose reverse-KL preference optimization. `label_smoothing` must be strictly greater than `0.0`; a recommended value is `1e-3` (see Eq. 16 for the simplified pairwise variant). The full method uses `K>2` SFT completions and approaches PPO as `K` grows. |
| `"nca_pair"` | The [NCA](https://huggingface.co/papers/2402.05369) authors shows that NCA optimizes the absolute likelihood for each response rather than the relative likelihood. |
| `"robust"` | The [Robust DPO](https://huggingface.co/papers/2403.00409) authors propose an unbiased DPO loss under noisy preferences. Use `label_smoothing` in [`DPOConfig`] to model label-flip probability; valid values are in the range `[0.0, 0.5)`. |
Expand Down
1 change: 1 addition & 0 deletions tests/test_dpo_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,7 @@ def test_train_model(self):
"sigmoid",
"hinge",
"ipo",
"sigmoid_norm",
"exo_pair",
"nca_pair",
"robust",
Expand Down
4 changes: 2 additions & 2 deletions trl/trainer/dpo_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ class DPOConfig(_BaseConfig):
> Parameters that control the training

loss_type (`list[str]`, *optional*, defaults to `["sigmoid"]`):
Type of loss to use. Possible values are: `'sigmoid'`, `'hinge'`, `'ipo'`, `'exo_pair'`, `'nca_pair'`,
Type of loss to use. Possible values are: `'sigmoid'`, `'hinge'`, `'ipo'`, `'sigmoid_norm'`, `'exo_pair'`, `'nca_pair'`,
`'robust'`, `'bco_pair'`, `'sppo_hard'`, `'aot'`, `'aot_unpaired'`, `'apo_zero'`, `'apo_down'`,
`'discopop'`, `'sft'`. If multiple loss types are provided, they will be combined using the weights
specified in `loss_weights`.
Expand Down Expand Up @@ -209,7 +209,7 @@ class DPOConfig(_BaseConfig):
loss_type: list[str] = field(
default_factory=lambda: ["sigmoid"],
metadata={
"help": "Type of loss to use. Possible values are: `'sigmoid'`, `'hinge'`, `'ipo'`, `'exo_pair'`, "
"help": "Type of loss to use. Possible values are: `'sigmoid'`, `'hinge'`, `'ipo'`, `'sigmoid_norm'`, `'exo_pair'`, "
"`'nca_pair'`, `'robust'`, `'bco_pair'`, `'sppo_hard'`, `'aot'`, `'aot_unpaired'`, `'apo_zero'`, "
"`'apo_down'`, `'discopop'`, `'sft'`. If multiple loss types are provided, they will be combined using "
"the weights specified in `loss_weights`.",
Expand Down
9 changes: 8 additions & 1 deletion trl/trainer/dpo_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -1273,6 +1273,13 @@ def _compute_loss(self, model, inputs, return_outputs):
# (Eq. 17) of the paper where beta is the regularization parameter for the IPO loss, denoted by τ.
per_sequence_loss = (ipo_delta - 1 / (2 * self.beta)) ** 2

elif loss_type == "sigmoid_norm":
chosen_mask, rejected_mask = completion_mask.chunk(2, dim=0)
chosen_avg_score = chosen_scores / chosen_mask.sum(dim=1).clamp(min=1.0)
rejected_avg_score = rejected_scores / rejected_mask.sum(dim=1).clamp(min=1.0)
delta = chosen_avg_score - rejected_avg_score
per_sequence_loss = -F.logsigmoid(self.beta * delta_score)

elif loss_type == "exo_pair":
# Implements EXO-pref from the paper https://huggingface.co/papers/2402.00856, (Eq. 16)
# Minimize KL(p_fθ || p_rh) for K=2; p_fθ = softmax(βπ * (log πθ − log π_ref)) over {chosen, rejected}
Expand Down Expand Up @@ -1373,7 +1380,7 @@ def _compute_loss(self, model, inputs, return_outputs):

else:
raise ValueError(
f"Unknown loss type: {loss_type}. Should be one of ['sigmoid', 'hinge', 'ipo', 'exo_pair', "
f"Unknown loss type: {loss_type}. Should be one of ['sigmoid', 'hinge', 'ipo', 'sigmoid_norm', 'exo_pair', "
"'nca_pair', 'robust', 'bco_pair', 'sppo_hard', 'aot', 'aot_unpaired', 'apo_zero', 'apo_down', "
"'discopop', 'sft']"
)
Expand Down
Loading