new feature: On policy distillation by sfc-gh-thonguyen · Pull Request #344 · snowflakedb/ArcticTraining

sfc-gh-thonguyen · 2026-01-28T04:15:27Z

Based on this blog post https://thinkingmachines.ai/blog/on-policy-distillation/ -- figured Arctic Training would be an appropriate place to have this feature.

Training validated with GSM8K dataset on Qwen3-1.7B model using Qwen3-8B teacher.

Lower teacher perplexity means teacher is less surprised by the student's answer. Higher teacher logprob means teacher agrees with the student's answer.

Lower reverse KL means student's answers converge to teacher's.

Student's perplexity initially jumped up, meaning it's learning, and slowly ramped down, meaning it's getting more confident along the training progress.

Full dashboard: https://snowflake.wandb.io/thongnguyen/on-policy-distillation-gsm8k/runs/zuwzrd11?nw=nwuserthongnguyen

Once this PR is in we can make the claim ArcticTraining supports RL :)

sfc-gh-mwyatt · 2026-02-04T23:11:23Z

This file should probably be with the trainer code. config/ is for base config classes

sfc-gh-mwyatt · 2026-02-04T23:14:00Z

+    def step(self, batch: Dict[str, torch.Tensor]) -> None:
+        """Execute a single training step.
+
+        Overrides the base step to handle the unique requirements of
+        on-policy distillation (generation + training).
+        """
+        self.model.train()
+
+        loss = self.loss(batch)
+
+        self.backward(loss)
+
+        def maybe_item(v):
+            return v.item() if torch.is_tensor(v) else v
+
+        self.metrics.record("loss", maybe_item(loss))
+
+        self.model.step()
+
+        self.checkpoint()
+
+        # Update step counters
+        self.global_step = self.model.global_steps
+        self.global_step_this_run = self.global_step - self.global_step_at_start_this_run


This appears to be an exact copy of the step method in the base trainer class. Why do we redefine it here?

sfc-gh-thonguyen added 2 commits January 28, 2026 03:39

Add on policy distillation

796db60

fix flake8

949a91b

sfc-gh-thonguyen requested a review from sfc-gh-jrasley as a code owner January 28, 2026 04:15

sfc-gh-thonguyen added 3 commits January 28, 2026 04:21

remove redundant changes

d7e5e61

minor fix

2e5b0e7

minor fix

d590efc

sfc-gh-thonguyen mentioned this pull request Jan 28, 2026

new feature: On policy distillation #346

Closed

optimize generator

aa563ca

sfc-gh-mwyatt reviewed Feb 4, 2026

View reviewed changes

Merge branch 'main' into thong/on_policy_distillation

8f073ff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new feature: On policy distillation#344

new feature: On policy distillation#344
sfc-gh-thonguyen wants to merge 7 commits into
snowflakedb:mainfrom
sfc-gh-thonguyen:thong/on_policy_distillation

sfc-gh-thonguyen commented Jan 28, 2026 •

edited

Loading

Uh oh!

sfc-gh-mwyatt Feb 4, 2026

Uh oh!

sfc-gh-mwyatt Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sfc-gh-thonguyen commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-mwyatt Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-mwyatt Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sfc-gh-thonguyen commented Jan 28, 2026 •

edited

Loading