[Feature] Support CUDA Graph under mixed mode DeepEP communication#7344
[Feature] Support CUDA Graph under mixed mode DeepEP communication#7344lizexu123 wants to merge 8 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
|
root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
📋 Review 摘要PR 概述:支持 CUDA Graph 在混合模式 DeepEP 通信下的使用 PR 规范检查PR 标题包含有效 Tag 问题
总体评价PR 实现了通过 📍 具体问题🔴 Bug -
|
Motivation
Modifications
报错日志:
DeepEP/csrc/kernels/internode_ll.cu:553 operation would make the legacy stream depend on a capturing blocking stream根本原因:
本次修复内容:
本来可以很简单的实现,比如像sglang/python/sglang/srt/distributed/parallel_state.py:483-510中这样
with torch.cuda.stream(stream):
# PyTorch 的 current stream → stream ✓
# c10 的 TLS → stream ✓
# DeepEP 调用 at::cuda::getCurrentCUDAStream() → stream ✓
但是Paddle的paddle.device.stream_guard() 只更新了 Paddle 自己的 GPUContext,没有更新 c10 的 TLS:
所以我们才需要用 ctypes 手动调用 c10::cuda::setCurrentCUDAStream() 来弥补这个差距。
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.