Skip to content

new feature: On policy distillation#346

Closed
sfc-gh-thonguyen wants to merge 6 commits into
mainfrom
thong/on_policy_distillation
Closed

new feature: On policy distillation#346
sfc-gh-thonguyen wants to merge 6 commits into
mainfrom
thong/on_policy_distillation

Conversation

@sfc-gh-thonguyen
Copy link
Copy Markdown
Collaborator

(Creating a duplicate of #344 as that one was from a fork branch, causing the GPU modal test failure)

Based on this blog post https://thinkingmachines.ai/blog/on-policy-distillation/ -- figured Arctic Training would be an appropriate place to have this feature.

Training validated with GSM8K dataset on Qwen3-1.7B model using Qwen3-8B teacher.
image
image
Lower teacher perplexity means teacher is less surprised by the student's answer. Higher teacher logprob means teacher agrees with the student's answer.

image image

Lower reverse KL and logprob gap mean student's answers converge to teacher's.

Full dashboard: https://snowflake.wandb.io/thongnguyen/on-policy-distillation-gsm8k/runs/zuwzrd11?nw=nwuserthongnguyen

Once this PR is in we can make the claim ArcticTraining supports RL :)

@sfc-gh-thonguyen
Copy link
Copy Markdown
Collaborator Author

Closed this PR as the original one #344 has modal test unblocked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant