Conversation
- Add `huggingface-hub` as dependency - Introduce sparse weight patching via `DeltaWeightTransferEngine` - Add `ULPChangeDetector` for optimizer-level change tracking - Add config parameters for delta sync control (repo, anchor interval, checksum verification) - Support both anchor checkpoints and delta patches via HF Hub (Xet storage)
- Add `huggingface-hub` as dependency - Introduce sparse weight patching via `DeltaWeightTransferEngine` - Add `ULPChangeDetector` for optimizer-level change tracking - Add config parameters for delta sync control (repo, anchor interval, checksum verification) - Support both anchor checkpoints and delta patches via HF Hub (Xet storage) Add delta weight synchronization support to AsyncGRPO Implements two-phase delta sync workflow: non-blocking upload to HF Hub while inference continues, followed by a signal to vLLM to fetch and apply. Adds ULP change detection to selectively sync only modified parameters with element-level masks. Simplifies delta engine API by removing anchor/checksum logic; now uses HF Hub directly without intermediate configuration objects.
Remove ULP prediction logic, diagnostic logging config, and checkpoint chain reconstruction. Keep only ground-truth bf16 change detection via optimizer hooks and sparse patch metadata.
- Move anchor/delta decision from trainer to rollout worker - Remove change detector from streaming iter; only check for validated masks - Migrate from HfApi to bucket_id and HF Bucket APIs - Simplify upload/download paths and remove revision parameter - Refactor _send_weights_delta with clearer empty/non-empty logic
qgallouedec
left a comment
There was a problem hiding this comment.
That looks good, I'm not not sure to understand the big picture. Why do we push the weights (the diffs actually) if it's only to download them. Shouldn't we just keep the diff locally?
Let's ask them directly |
I think the goal of this technique is mainly for completely disggregating the inference and trainer servers. Like if the inference is running on some HF space you can still exchange weights. It also provides a good checkpoint mechanism for the weights because you have that for free (although I need to check the Xet atomicity semantics :? ) |
What does this PR do?
Training converges correctly on the immediate-EOS sanity check 👍
Still have some optimizations to implement. Both trainer and vLLM hold a CPU bf16 snapshot of the model
Before submitting
AI writing disclosure
We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.