Conversation
| class JEPOConfig(TrainingArguments): | ||
| r""" | ||
| Configuration class for the [`JEPOTrainer`], which serves as a variation of GRPO for unverifiable RL training. | ||
| JEPO [https://arxiv.org/pdf/2503.19618] |
There was a problem hiding this comment.
Missing paper_index.md update for JEPO paper
Medium Severity
This PR implements the JEPO algorithm from a research paper ("Beyond Verifiable Rewards: Scaling RL for Language Models to Unverifiable Data"), but does not add a corresponding subsection to paper_index.md. The project rule in .ai/AGENTS.md states: "If a PR implements a method, algorithm, or training approach from a research paper, it must also add a corresponding subsection to paper_index.md."
Triggered by project rule: ../.ai/AGENTS.md
| if self.loss_type == 'unnorm_jepo': | ||
| loss = - per_token_adv.sum(dim=1).sum()/ (self.num_generations) # sum over the tokens and average over the batch | ||
| else: | ||
| loss = - (per_token_adv.sum(dim=1)/cot_mask.sum(dim=1)).sum()/ (self.num_generations) # sum over the tokens and average over the batch |
There was a problem hiding this comment.
Division by zero when cot_mask is all zeros
Medium Severity
In _compute_loss, the norm_jepo loss path divides by cot_mask.sum(dim=1) and the supervised loss divides by answer_mask.sum(dim=1). When a sample doesn't have the correct format (no CoT or answer extracted), these sums can be zero, producing NaN values that propagate through the loss and corrupt gradients.
Additional Locations (1)
| year = 2025, | ||
| eprint = {arXiv:2503.19618}, | ||
| }"""), | ||
| } |
There was a problem hiding this comment.
Missing paper_index.md update for JEPO paper
Low Severity
This PR implements the JEPO method from the paper "Beyond Verifiable Rewards" (2503.19618) but does not add a corresponding subsection to paper_index.md. The project rules require that any PR implementing a research paper must update paper_index.md.
Triggered by project rule: ../.ai/AGENTS.md
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 7 total unresolved issues (including 6 from previous reviews).
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| advantage = torch.log(mean_reward) - torch.log(variance) | ||
| else: | ||
| # If only one sample is applicable, use log(mean_reward). | ||
| advantage = torch.log(mean_reward) |
There was a problem hiding this comment.
torch.log on potentially zero rewards produces NaN
Medium Severity
_compute_jepo_advantages calls torch.log(mean_reward) and torch.log(variance) where these values are sums of token-level probabilities. If the model assigns near-zero probability to all answer tokens, mean_reward or variance can be zero or extremely small, producing -inf or NaN. These NaN advantages then propagate into the loss computation.


What does this PR do?
Fixes # (issue)
Before submitting
AI writing disclosure
We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
Note
Medium Risk
Introduces a large new trainer with non-trivial reward/advantage and loss computation paths (plus optional vLLM/FSDP/DeepSpeed integration), so subtle training/runtime issues are possible despite being mostly additive.
Overview
Adds a new JEPO training flow to TRL by introducing
JEPOConfigandJEPOTrainer, and exporting them via the top-leveltrlandtrl.trainerlazy import structures.JEPOTrainerimplements JEPO-style generation, reward aggregation, JEPO-specific advantage computation (including CoT-based fabricated completions), and a JEPO loss (optionally combined with supervised and KL terms), with support hooks for standardtransformersgeneration and optional vLLM execution.Includes a new
examples/notebooks/jepo_math.ipynbnotebook demonstrating dataset prep, reward/cot helpers, and end-to-end JEPO training on a math dataset.Written by Cursor Bugbot for commit 6bf8c8e. This will update automatically on new commits. Configure here.