Skip to content

DPO transformers v0.29 fixes#3560

Merged
winglian merged 10 commits intoaxolotl-ai-cloud:mainfrom
BrownianNotion:dpo-transformers-v0.29-fixes
Mar 31, 2026
Merged

DPO transformers v0.29 fixes#3560
winglian merged 10 commits intoaxolotl-ai-cloud:mainfrom
BrownianNotion:dpo-transformers-v0.29-fixes

Conversation

@BrownianNotion
Copy link
Copy Markdown
Contributor

@BrownianNotion BrownianNotion commented Mar 30, 2026

Description

Summary of changes:

  1. Deprecate dpo_norm_loss. As outlined in DPO dpo_norm_loss no longer works on trl==0.29.0 #3548, the 0.29 refactors in TRL's DPOTrainer break Axolotl's implementation. The goal is to add this back in once TRL natively supports this, PR already here: Add length-normalized sigmoid loss type to DPO trainer huggingface/trl#5406
  2. Rename chosen/rejected_input_ids to chosen/rejected_ids to be consistent with TRL. Rename input keys in RewardTrainer collator from chosen/rejected_input_ids to chosen/rejected_ids huggingface/trl#5179
  3. Deprecate rpo_alpha. RPO is now configured by passing list loss_type=["sigmoid", "sft"] https://github.com/huggingface/trl/blob/main/docs/source/paper_index.md#iterative-reasoning-preference-optimization
  4. Replace tokenize_row override (deprecated) with _tokenize to handle bos_tokens. Previously, this override handled bos token bugs. The only bug that still exists is the double bos token bug for tokenizers with bos_tokens such as llama. The new _tokenize method handles this.
  5. Update IPO's loss_type to a list (was previously a string). In TRL 0.29, DPOTrainer's loss_type now takes a list of strings rather than a single string allowing multiple losses to be combined. Note: This needs to be supported, but I have made a separate issue here Support DPO loss_type and loss_weights. #3565 .

I recommend reviewing commit by commit.

Motivation and Context

Breaking changes were introduced in TRL v0.29.0 for DPO, parts of Axolotl need to be updated to interface with the new code. eg. #3548.

How has this been tested?

Unit tests

AI Usage Disclaimer

All fixes written completely by me. Claude helped find some but not all of the bugs.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 30, 2026

📝 Walkthrough

Walkthrough

Removing deprecated DPO/RL configuration parameters (rpo_alpha, dpo_norm_loss) and updating rejected sequence field names from rejected_input_ids to rejected_ids across multiple trainers and prompt strategies. Refactoring DPOTrainer tokenization logic and adding utility for double BOS token removal.

Changes

Cohort / File(s) Summary
Configuration Deprecation
src/axolotl/utils/schemas/config.py, src/axolotl/utils/schemas/deprecated.py
Marked dpo_norm_loss and rpo_alpha as deprecated in AxolotlInputConfig with v0.15.1 deprecation messages. Added field validator in DeprecatedParameters to warn when dpo_norm_loss is provided.
DPO Configuration & Arguments
src/axolotl/core/trainers/dpo/args.py, src/axolotl/core/trainers/dpo/__init__.py
Removed dpo_norm_loss and rpo_alpha fields from AxolotlDPOConfig. Updated set_training_args_kwargs to pass loss_type as a list for IPO and removed dpo_norm_loss forwarding.
Field Rename: Rejected Sequences
src/axolotl/core/trainers/base.py, src/axolotl/prompt_strategies/bradley_terry/chat_template.py, src/axolotl/prompt_strategies/orpo/chat_template.py
Renamed rejected token field from rejected_input_ids to rejected_ids across trainer concatenation and prompt strategy tokenization output contracts.
Trainer Refactoring
src/axolotl/core/trainers/dpo/trainer.py, src/axolotl/core/builders/rl.py
Replaced static tokenize_row override with instance _tokenize override including double BOS token removal logic. Removed concatenated_forward override and its conditional dpo_norm_loss handling. Removed rpo_alpha copying from RL builder.
Utility & Testing
src/axolotl/utils/data/utils.py, tests/e2e/test_dpo.py, tests/test_prompt_tokenizers.py, tests/utils/data/test_utils.py
Added remove_double_bos_token utility function. Updated test assertions for renamed rejected_ids field and removed rpo_alpha config parameter. Added comprehensive test coverage for double BOS token removal.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • support for QAT w RL (DPO) #2776: Conflicts in DPO trainer handling—this PR removes rpo_alpha/dpo_norm_loss and related logic while the other PR adds/expands them.

Suggested labels

ready to merge

Suggested reviewers

  • winglian
  • djsaunde
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: addressing breaking changes from TRL (Transformers Reinforcement Learning) v0.29.0, which is the primary focus across all modified files in this PR.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/axolotl/utils/data/utils.py (1)

354-363: In-place mutation and type assumption on dict values.

The function mutates example in-place by reassigning example[key]. This is fine if callers expect mutation, but could be surprising. Additionally, the loop assumes all values in example are sliceable (lists). If example contains non-list metadata (e.g., a scalar length field), this will raise a TypeError.

Consider either:

  1. Documenting that mutation is intentional and all values must be lists, or
  2. Adding a safeguard for non-list values
💡 Optional safeguard for non-list values
 def remove_double_bos_token(example: dict[str, list], bos_token_id: int | None):
     """Remove double bos tokens that may occur when retokenizing preprocessed data
-    for tokenizers and chat templates that have a bos_token - eg. DPO + Llama.
+    for tokenizers and chat templates that have a bos_token - eg. DPO + Llama.
+
+    Note: Mutates `example` in-place. All values must be list-like.
     """
     if bos_token_id is not None:
         input_ids = example["input_ids"]
         if len(input_ids) >= 2 and input_ids[0] == input_ids[1] == bos_token_id:
             for key in example:
-                example[key] = example[key][1:]
+                if isinstance(example[key], list):
+                    example[key] = example[key][1:]
     return example
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/utils/data/utils.py` around lines 354 - 363, The function
remove_double_bos_token mutates example in-place and assumes every example[key]
is a sliceable list, which can raise TypeError for scalar metadata; change it to
either return a new dict or guard mutations: when bos_token_id is not None and a
double-BOS is detected (in remove_double_bos_token), iterate keys and only
slice/modify values that are instances of list (or collections.abc.Sequence) and
leave non-list values unchanged (or copy them into the new dict if you choose to
return a new object); ensure the function's docstring is updated to state
whether mutation is intentional and that only list-like fields are affected, and
reference the input_ids check and keys loop (example["input_ids"] and for key in
example) when applying the guard.
tests/utils/data/test_utils.py (1)

544-582: Consider adding edge case tests for short sequences.

The tests cover the main scenarios well. Consider adding tests for edge cases:

  • Empty input_ids list (would fail on len(input_ids) >= 2 check)
  • Single-element input_ids list
  • Exactly two elements where both are bos_token_id (result would be single-element list)

Also, these tests use assert statements while the rest of the file uses self.assertEqual - minor style inconsistency.

💡 Suggested additional test case
def test_remove_bos_token_boundary_length_two(self):
    """Test when input_ids has exactly two elements both being bos_token_id."""
    input_ids = [0, 0]
    labels = [1, 2]

    example = {
        "input_ids": input_ids,
        "labels": labels,
    }

    example = remove_double_bos_token(example, 0)
    self.assertEqual(example["input_ids"], [0])
    self.assertEqual(example["labels"], [2])

def test_short_input_ids_no_error(self):
    """Test that short input_ids (len < 2) don't cause errors."""
    example = {"input_ids": [0], "labels": [1]}
    result = remove_double_bos_token(example, 0)
    self.assertEqual(result["input_ids"], [0])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/utils/data/test_utils.py` around lines 544 - 582, Add unit tests in
TestRemoveDoubleBOSToken to cover short-sequence edge cases and fix style: add
three new test methods that call remove_double_bos_token to verify behavior for
(1) empty input_ids and labels (ensure it returns unchanged and does not error),
(2) single-element input_ids (len==1) with bos_token_id and non-bos and assert
it returns the same sequence, and (3) boundary case of exactly two elements both
equal to bos_token_id to assert it collapses to a single-element result; use
self.assertEqual instead of bare assert to match existing style and reference
the existing TestRemoveDoubleBOSToken class and remove_double_bos_token function
to locate where to add these tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/axolotl/core/trainers/dpo/trainer.py`:
- Around line 58-74: The _tokenize method currently assumes processing_class has
bos_token_id which is only guaranteed on PreTrainedTokenizerBase; update
_tokenize (and rename the parameter input to inputs to avoid shadowing) to first
resolve a tokenizer that exposes bos_token_id—e.g., if
isinstance(processing_class, PreTrainedTokenizerBase) use processing_class, else
try getattr(processing_class, "tokenizer", None) or getattr(processing_class,
"tokenizer", "processor", None) and then check hasattr(tokenizer,
"bos_token_id"); only call remove_double_bos_token(result, bos_id) when bos_id
is present, otherwise return result unchanged; keep references to the existing
_tokenize method, ProcessorMixin, PreTrainedTokenizerBase,
remove_double_bos_token, and bos_token_id to locate the change.

---

Nitpick comments:
In `@src/axolotl/utils/data/utils.py`:
- Around line 354-363: The function remove_double_bos_token mutates example
in-place and assumes every example[key] is a sliceable list, which can raise
TypeError for scalar metadata; change it to either return a new dict or guard
mutations: when bos_token_id is not None and a double-BOS is detected (in
remove_double_bos_token), iterate keys and only slice/modify values that are
instances of list (or collections.abc.Sequence) and leave non-list values
unchanged (or copy them into the new dict if you choose to return a new object);
ensure the function's docstring is updated to state whether mutation is
intentional and that only list-like fields are affected, and reference the
input_ids check and keys loop (example["input_ids"] and for key in example) when
applying the guard.

In `@tests/utils/data/test_utils.py`:
- Around line 544-582: Add unit tests in TestRemoveDoubleBOSToken to cover
short-sequence edge cases and fix style: add three new test methods that call
remove_double_bos_token to verify behavior for (1) empty input_ids and labels
(ensure it returns unchanged and does not error), (2) single-element input_ids
(len==1) with bos_token_id and non-bos and assert it returns the same sequence,
and (3) boundary case of exactly two elements both equal to bos_token_id to
assert it collapses to a single-element result; use self.assertEqual instead of
bare assert to match existing style and reference the existing
TestRemoveDoubleBOSToken class and remove_double_bos_token function to locate
where to add these tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 07aa515d-4a0e-437c-94b8-b7cb6c06969d

📥 Commits

Reviewing files that changed from the base of the PR and between 00dee05 and 98abe54.

📒 Files selected for processing (13)
  • src/axolotl/core/builders/rl.py
  • src/axolotl/core/trainers/base.py
  • src/axolotl/core/trainers/dpo/__init__.py
  • src/axolotl/core/trainers/dpo/args.py
  • src/axolotl/core/trainers/dpo/trainer.py
  • src/axolotl/prompt_strategies/bradley_terry/chat_template.py
  • src/axolotl/prompt_strategies/orpo/chat_template.py
  • src/axolotl/utils/data/utils.py
  • src/axolotl/utils/schemas/config.py
  • src/axolotl/utils/schemas/deprecated.py
  • tests/e2e/test_dpo.py
  • tests/test_prompt_tokenizers.py
  • tests/utils/data/test_utils.py
💤 Files with no reviewable changes (2)
  • src/axolotl/core/builders/rl.py
  • tests/e2e/test_dpo.py

Copy link
Copy Markdown
Collaborator

@NanoCode012 NanoCode012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the cleanup, took a glance and noted the below.

Comment on lines +297 to +300
dpo_norm_loss: bool | None = Field(
default=None,
deprecated="Deprecated in v0.15.1 due to breaking changes in TRL >=v0.29.0. Will be readded upon TRL support.",
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this as this class inherits Deprecatedparameters and would be a duplicate. Same for the other change below.


@with_temp_dir
def test_dpo_nll_lora(self, temp_dir):
cfg = DictDefault(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't rmb if this test was specifically for rpo_alpha. If it is, we'd need to adjust to support it or remove this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching, the configs are identical without the parameter so you're probably right. Will remove

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 57.57576% with 14 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/axolotl/core/trainers/dpo/trainer.py 33.33% 6 Missing ⚠️
src/axolotl/utils/schemas/deprecated.py 61.53% 5 Missing ⚠️
src/axolotl/core/trainers/base.py 0.00% 2 Missing ⚠️
...rc/axolotl/prompt_strategies/orpo/chat_template.py 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@winglian winglian merged commit a81feab into axolotl-ai-cloud:main Mar 31, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants