new feature: On policy distillation#346
Closed
sfc-gh-thonguyen wants to merge 6 commits into
Closed
Conversation
Collaborator
Author
|
Closed this PR as the original one #344 has modal test unblocked. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
(Creating a duplicate of #344 as that one was from a fork branch, causing the GPU modal test failure)
Based on this blog post https://thinkingmachines.ai/blog/on-policy-distillation/ -- figured Arctic Training would be an appropriate place to have this feature.
Training validated with GSM8K dataset on Qwen3-1.7B model using Qwen3-8B teacher.


Lower teacher perplexity means teacher is less surprised by the student's answer. Higher teacher logprob means teacher agrees with the student's answer.
Lower reverse KL and logprob gap mean student's answers converge to teacher's.
Full dashboard: https://snowflake.wandb.io/thongnguyen/on-policy-distillation-gsm8k/runs/zuwzrd11?nw=nwuserthongnguyen
Once this PR is in we can make the claim ArcticTraining supports RL :)