DPO transformers v0.29 fixes by BrownianNotion · Pull Request #3560 · axolotl-ai-cloud/axolotl

BrownianNotion · 2026-03-30T17:05:06Z

Description

Summary of changes:

Deprecate dpo_norm_loss. As outlined in DPO dpo_norm_loss no longer works on trl==0.29.0 #3548, the 0.29 refactors in TRL's DPOTrainer break Axolotl's implementation. The goal is to add this back in once TRL natively supports this, PR already here: Add length-normalized sigmoid loss type to DPO trainer huggingface/trl#5406
Rename chosen/rejected_input_ids to chosen/rejected_ids to be consistent with TRL. Rename input keys in RewardTrainer collator from chosen/rejected_input_ids to chosen/rejected_ids huggingface/trl#5179
Deprecate rpo_alpha. RPO is now configured by passing list loss_type=["sigmoid", "sft"] https://github.com/huggingface/trl/blob/main/docs/source/paper_index.md#iterative-reasoning-preference-optimization
Replace tokenize_row override (deprecated) with _tokenize to handle bos_tokens. Previously, this override handled bos token bugs. The only bug that still exists is the double bos token bug for tokenizers with bos_tokens such as llama. The new _tokenize method handles this.
Update IPO's loss_type to a list (was previously a string). In TRL 0.29, DPOTrainer's loss_type now takes a list of strings rather than a single string allowing multiple losses to be combined. Note: This needs to be supported, but I have made a separate issue here Support DPO loss_type and loss_weights. #3565 .

I recommend reviewing commit by commit.

Motivation and Context

Breaking changes were introduced in TRL v0.29.0 for DPO, parts of Axolotl need to be updated to interface with the new code. eg. #3548.

How has this been tested?

Unit tests

AI Usage Disclaimer

All fixes written completely by me. Claude helped find some but not all of the bugs.

…uggingface/trl#5179

coderabbitai · 2026-03-30T17:05:26Z

📝 Walkthrough

Walkthrough

Removing deprecated DPO/RL configuration parameters (rpo_alpha, dpo_norm_loss) and updating rejected sequence field names from rejected_input_ids to rejected_ids across multiple trainers and prompt strategies. Refactoring DPOTrainer tokenization logic and adding utility for double BOS token removal.

Changes

Cohort / File(s)	Summary
Configuration Deprecation `src/axolotl/utils/schemas/config.py`, `src/axolotl/utils/schemas/deprecated.py`	Marked `dpo_norm_loss` and `rpo_alpha` as deprecated in `AxolotlInputConfig` with v0.15.1 deprecation messages. Added field validator in `DeprecatedParameters` to warn when `dpo_norm_loss` is provided.
DPO Configuration & Arguments `src/axolotl/core/trainers/dpo/args.py`, `src/axolotl/core/trainers/dpo/__init__.py`	Removed `dpo_norm_loss` and `rpo_alpha` fields from `AxolotlDPOConfig`. Updated `set_training_args_kwargs` to pass `loss_type` as a list for IPO and removed `dpo_norm_loss` forwarding.
Field Rename: Rejected Sequences `src/axolotl/core/trainers/base.py`, `src/axolotl/prompt_strategies/bradley_terry/chat_template.py`, `src/axolotl/prompt_strategies/orpo/chat_template.py`	Renamed rejected token field from `rejected_input_ids` to `rejected_ids` across trainer concatenation and prompt strategy tokenization output contracts.
Trainer Refactoring `src/axolotl/core/trainers/dpo/trainer.py`, `src/axolotl/core/builders/rl.py`	Replaced static `tokenize_row` override with instance `_tokenize` override including double BOS token removal logic. Removed `concatenated_forward` override and its conditional `dpo_norm_loss` handling. Removed `rpo_alpha` copying from RL builder.
Utility & Testing `src/axolotl/utils/data/utils.py`, `tests/e2e/test_dpo.py`, `tests/test_prompt_tokenizers.py`, `tests/utils/data/test_utils.py`	Added `remove_double_bos_token` utility function. Updated test assertions for renamed `rejected_ids` field and removed `rpo_alpha` config parameter. Added comprehensive test coverage for double BOS token removal.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

support for QAT w RL (DPO) #2776: Conflicts in DPO trainer handling—this PR removes rpo_alpha/dpo_norm_loss and related logic while the other PR adds/expands them.

Suggested labels

ready to merge

Suggested reviewers

winglian
djsaunde

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: addressing breaking changes from TRL (Transformers Reinforcement Learning) v0.29.0, which is the primary focus across all modified files in this PR.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

src/axolotl/utils/data/utils.py (1)

354-363: In-place mutation and type assumption on dict values.

The function mutates example in-place by reassigning example[key]. This is fine if callers expect mutation, but could be surprising. Additionally, the loop assumes all values in example are sliceable (lists). If example contains non-list metadata (e.g., a scalar length field), this will raise a TypeError.

Consider either:

Documenting that mutation is intentional and all values must be lists, or
Adding a safeguard for non-list values

💡 Optional safeguard for non-list values

 def remove_double_bos_token(example: dict[str, list], bos_token_id: int | None):
     """Remove double bos tokens that may occur when retokenizing preprocessed data
-    for tokenizers and chat templates that have a bos_token - eg. DPO + Llama.
+    for tokenizers and chat templates that have a bos_token - eg. DPO + Llama.
+
+    Note: Mutates `example` in-place. All values must be list-like.
     """
     if bos_token_id is not None:
         input_ids = example["input_ids"]
         if len(input_ids) >= 2 and input_ids[0] == input_ids[1] == bos_token_id:
             for key in example:
-                example[key] = example[key][1:]
+                if isinstance(example[key], list):
+                    example[key] = example[key][1:]
     return example

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/utils/data/utils.py` around lines 354 - 363, The function
remove_double_bos_token mutates example in-place and assumes every example[key]
is a sliceable list, which can raise TypeError for scalar metadata; change it to
either return a new dict or guard mutations: when bos_token_id is not None and a
double-BOS is detected (in remove_double_bos_token), iterate keys and only
slice/modify values that are instances of list (or collections.abc.Sequence) and
leave non-list values unchanged (or copy them into the new dict if you choose to
return a new object); ensure the function's docstring is updated to state
whether mutation is intentional and that only list-like fields are affected, and
reference the input_ids check and keys loop (example["input_ids"] and for key in
example) when applying the guard.

tests/utils/data/test_utils.py (1)

544-582: Consider adding edge case tests for short sequences.

The tests cover the main scenarios well. Consider adding tests for edge cases:

Empty input_ids list (would fail on len(input_ids) >= 2 check)
Single-element input_ids list
Exactly two elements where both are bos_token_id (result would be single-element list)

Also, these tests use assert statements while the rest of the file uses self.assertEqual - minor style inconsistency.

💡 Suggested additional test case

def test_remove_bos_token_boundary_length_two(self):
    """Test when input_ids has exactly two elements both being bos_token_id."""
    input_ids = [0, 0]
    labels = [1, 2]

    example = {
        "input_ids": input_ids,
        "labels": labels,
    }

    example = remove_double_bos_token(example, 0)
    self.assertEqual(example["input_ids"], [0])
    self.assertEqual(example["labels"], [2])

def test_short_input_ids_no_error(self):
    """Test that short input_ids (len < 2) don't cause errors."""
    example = {"input_ids": [0], "labels": [1]}
    result = remove_double_bos_token(example, 0)
    self.assertEqual(result["input_ids"], [0])

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/utils/data/test_utils.py` around lines 544 - 582, Add unit tests in
TestRemoveDoubleBOSToken to cover short-sequence edge cases and fix style: add
three new test methods that call remove_double_bos_token to verify behavior for
(1) empty input_ids and labels (ensure it returns unchanged and does not error),
(2) single-element input_ids (len==1) with bos_token_id and non-bos and assert
it returns the same sequence, and (3) boundary case of exactly two elements both
equal to bos_token_id to assert it collapses to a single-element result; use
self.assertEqual instead of bare assert to match existing style and reference
the existing TestRemoveDoubleBOSToken class and remove_double_bos_token function
to locate where to add these tests.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/axolotl/core/trainers/dpo/trainer.py`:
- Around line 58-74: The _tokenize method currently assumes processing_class has
bos_token_id which is only guaranteed on PreTrainedTokenizerBase; update
_tokenize (and rename the parameter input to inputs to avoid shadowing) to first
resolve a tokenizer that exposes bos_token_id—e.g., if
isinstance(processing_class, PreTrainedTokenizerBase) use processing_class, else
try getattr(processing_class, "tokenizer", None) or getattr(processing_class,
"tokenizer", "processor", None) and then check hasattr(tokenizer,
"bos_token_id"); only call remove_double_bos_token(result, bos_id) when bos_id
is present, otherwise return result unchanged; keep references to the existing
_tokenize method, ProcessorMixin, PreTrainedTokenizerBase,
remove_double_bos_token, and bos_token_id to locate the change.

---

Nitpick comments:
In `@src/axolotl/utils/data/utils.py`:
- Around line 354-363: The function remove_double_bos_token mutates example
in-place and assumes every example[key] is a sliceable list, which can raise
TypeError for scalar metadata; change it to either return a new dict or guard
mutations: when bos_token_id is not None and a double-BOS is detected (in
remove_double_bos_token), iterate keys and only slice/modify values that are
instances of list (or collections.abc.Sequence) and leave non-list values
unchanged (or copy them into the new dict if you choose to return a new object);
ensure the function's docstring is updated to state whether mutation is
intentional and that only list-like fields are affected, and reference the
input_ids check and keys loop (example["input_ids"] and for key in example) when
applying the guard.

In `@tests/utils/data/test_utils.py`:
- Around line 544-582: Add unit tests in TestRemoveDoubleBOSToken to cover
short-sequence edge cases and fix style: add three new test methods that call
remove_double_bos_token to verify behavior for (1) empty input_ids and labels
(ensure it returns unchanged and does not error), (2) single-element input_ids
(len==1) with bos_token_id and non-bos and assert it returns the same sequence,
and (3) boundary case of exactly two elements both equal to bos_token_id to
assert it collapses to a single-element result; use self.assertEqual instead of
bare assert to match existing style and reference the existing
TestRemoveDoubleBOSToken class and remove_double_bos_token function to locate
where to add these tests.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 07aa515d-4a0e-437c-94b8-b7cb6c06969d

📥 Commits

Reviewing files that changed from the base of the PR and between 00dee05 and 98abe54.

📒 Files selected for processing (13)

src/axolotl/core/builders/rl.py
src/axolotl/core/trainers/base.py
src/axolotl/core/trainers/dpo/__init__.py
src/axolotl/core/trainers/dpo/args.py
src/axolotl/core/trainers/dpo/trainer.py
src/axolotl/prompt_strategies/bradley_terry/chat_template.py
src/axolotl/prompt_strategies/orpo/chat_template.py
src/axolotl/utils/data/utils.py
src/axolotl/utils/schemas/config.py
src/axolotl/utils/schemas/deprecated.py
tests/e2e/test_dpo.py
tests/test_prompt_tokenizers.py
tests/utils/data/test_utils.py

💤 Files with no reviewable changes (2)

src/axolotl/core/builders/rl.py
tests/e2e/test_dpo.py

src/axolotl/core/trainers/dpo/trainer.py

NanoCode012

Thanks for the cleanup, took a glance and noted the below.

NanoCode012 · 2026-03-31T10:16:06Z

src/axolotl/utils/schemas/config.py

+    dpo_norm_loss: bool | None = Field(
+        default=None,
+        deprecated="Deprecated in v0.15.1 due to breaking changes in TRL >=v0.29.0. Will be readded upon TRL support.",
+    )


We should remove this as this class inherits Deprecatedparameters and would be a duplicate. Same for the other change below.

NanoCode012 · 2026-03-31T10:16:59Z

tests/e2e/test_dpo.py


    @with_temp_dir
    def test_dpo_nll_lora(self, temp_dir):
        cfg = DictDefault(


I couldn't rmb if this test was specifically for rpo_alpha. If it is, we'd need to adjust to support it or remove this?

Thanks for catching, the configs are identical without the parameter so you're probably right. Will remove

src/axolotl/prompt_strategies/bradley_terry/chat_template.py

src/axolotl/core/trainers/dpo/trainer.py

codecov · 2026-03-31T14:55:27Z

Codecov Report

❌ Patch coverage is 57.57576% with 14 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/core/trainers/dpo/trainer.py	33.33%	6 Missing ⚠️
src/axolotl/utils/schemas/deprecated.py	61.53%	5 Missing ⚠️
src/axolotl/core/trainers/base.py	0.00%	2 Missing ⚠️
...rc/axolotl/prompt_strategies/orpo/chat_template.py	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

BrownianNotion added 7 commits March 30, 2026 14:43

Deperecate dpo_norm_loss

dd24fd6

Rename chosen/rejected_input_ids to chosen/rejected_ids to match TRL h…

d26d9b2

…uggingface/trl#5179

Remove deprecated rpo_alpha

63273a4

Remove dead_code tokenize_row

3739266

Add _tokenize override to prevent double bos token on Llama DPO

d1c5002

Fix DPO loss type now list not string

fbbb6e3

Linting fix

98abe54

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

src/axolotl/core/trainers/dpo/trainer.py Outdated Show resolved Hide resolved

NanoCode012 reviewed Mar 31, 2026

View reviewed changes

BrownianNotion and others added 3 commits March 31, 2026 12:03

PR fixes

13903f0

update _tokenize override for DPO for multimodal

2706ab8

Merge branch 'main' into dpo-transformers-v0.29-fixes

b477929

This was referenced Mar 31, 2026

Support DPO loss_type and loss_weights. #3565

Open

DPO support loss types #3566

Open

winglian approved these changes Mar 31, 2026

View reviewed changes

winglian added the ready to merge label Mar 31, 2026

winglian merged commit a81feab into axolotl-ai-cloud:main Mar 31, 2026
18 checks passed

Uh oh!

Conversation

BrownianNotion commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Uh oh!

coderabbitai bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NanoCode012 left a comment

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

BrownianNotion Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Mar 31, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BrownianNotion commented Mar 30, 2026 •

edited

Loading

coderabbitai bot commented Mar 30, 2026 •

edited

Loading