Zhewei smart packing by sfc-gh-zhyao · Pull Request #327 · snowflakedb/ArcticTraining

sfc-gh-zhyao · 2025-12-15T23:58:36Z

No description provided.

sfc-gh-zhyao · 2025-12-16T00:00:11Z

Same dataset with 32K max seq length ---
Previously:
Tokenizing messages (num_proc=16): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6624/6624 [00:04<00:00, 1654.35 examples/s]
Saving the dataset (1/1 shards): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6624/6624 [00:00<00:00, 66188.46 examples/s]
Filtering dataset by max length (num_proc=16): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6624/6624 [00:01<00:00, 3423.66 examples/s]
Packing dataset (num_proc=16): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6624/6624 [00:02<00:00, 3053.66 examples/s]
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 416/416 [00:00<00:00, 1872.14 examples/s]

After:
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 402/402 [00:00<00:00, 1792.24 examples/s]

402/416 = 0.966 ==> 3.4% training saving for this one. For 10M samples, the saving will be huge (testing now)

sfc-gh-zhyao · 2025-12-16T03:34:01Z

For a 6.2M sample setting ---
Previously:
Saving the dataset (681/1545 shards): 44%|█████████████████████████████████████████████████████████████████▎ | 132114/299574 [09:36<43:34, 64.06 examples/s]

After:
Saving the dataset (1545/1545 shards): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 283802/283802 [22:18<00:00, 212.00 examples/s]

==> 5.5% gains

sfc-gh-zhyao · 2025-12-16T03:35:39Z

Same dataset with 32K max seq length --- Previously: Tokenizing messages (num_proc=16): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6624/6624 [00:04<00:00, 1654.35 examples/s] Saving the dataset (1/1 shards): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6624/6624 [00:00<00:00, 66188.46 examples/s] Filtering dataset by max length (num_proc=16): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6624/6624 [00:01<00:00, 3423.66 examples/s] Packing dataset (num_proc=16): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6624/6624 [00:02<00:00, 3053.66 examples/s] Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 416/416 [00:00<00:00, 1872.14 examples/s]

After: Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 402/402 [00:00<00:00, 1792.24 examples/s]

402/416 = 0.966 ==> 3.4% training saving for this one. For 10M samples, the saving will be huge (testing now)

sfc-gh-zhyao · 2025-12-16T03:37:01Z

The testing for very large # samples is very costly. However, I see no issue to increase the batch_size = int(min(batch_size, 1e3)) to batch_size = int(min(batch_size, 1e4)) as 10K sample should be small enough which can further increase the efficiency

sfc-gh-zhyao added 2 commits December 15, 2025 23:43

a better way for packing

b21f6ac

add pack max back

7152ba0

sfc-gh-zhyao requested a review from sfc-gh-jrasley as a code owner December 15, 2025 23:58

sfc-gh-zhyao closed this Dec 16, 2025

sfc-gh-zhyao reopened this Dec 16, 2025

increase the batch size

a61720b

sfc-gh-truwase mentioned this pull request Jan 29, 2026

Improve sample packing #347

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zhewei smart packing#327

Zhewei smart packing#327
sfc-gh-zhyao wants to merge 3 commits into
mainfrom
zhewei-smart-packing

sfc-gh-zhyao commented Dec 15, 2025

Uh oh!

sfc-gh-zhyao commented Dec 16, 2025

Uh oh!

sfc-gh-zhyao commented Dec 16, 2025

Uh oh!

sfc-gh-zhyao commented Dec 16, 2025

Uh oh!

sfc-gh-zhyao commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sfc-gh-zhyao commented Dec 15, 2025

Uh oh!

sfc-gh-zhyao commented Dec 16, 2025

Uh oh!

sfc-gh-zhyao commented Dec 16, 2025

Uh oh!

sfc-gh-zhyao commented Dec 16, 2025

Uh oh!

sfc-gh-zhyao commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant