Hey, I would like to know how to solve the NCCL work times out issue.
I have tried to solve that following the official instructions, but it doesn't work.
[rank0]:[E624 09:29:06.968323585 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=54, OpType=ALLREDUCE, NumelIn=6572160, NumelOut=6572160, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
2025-06-24 17:29:06 [rank0]:[E624 09:29:06.969438582 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 54, last enqueued NCCL work: 84, last completed NCCL work: 53.
2025-06-24 17:29:21 [rank0]:[E624 09:29:21.005912011 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 0] Timeout at NCCL work: 54, last enqueued NCCL work: 84, last completed NCCL work: 53.
2025-06-24 17:29:21 [rank0]:[E624 09:29:21.005960733 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
2025-06-24 17:29:21 [rank0]:[E624 09:29:21.005970290 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
2025-06-24 17:29:21 [rank0]:[E624 09:29:21.007533481 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=54, OpType=ALLREDUCE, NumelIn=6572160, NumelOut=6572160, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
2025-06-24 17:29:21 Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1729647352509/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
2025-06-24 17:29:21 frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ff28ed6b446 in /opt/conda/envs/recon/lib/python3.10/site-packages/torch/lib/libc10.so)
2025-06-24 17:29:21 frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7ff23d4267f2 in /opt/conda/envs/recon/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
2025-06-24 17:29:21 frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ff23d42dc33 in /opt/conda/envs/recon/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
2025-06-24 17:29:21 frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ff23d42f69d in /opt/conda/envs/recon/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
2025-06-24 17:29:21 frame #4: <unknown function> + 0x145c0 (0x7ff29676f5c0 in /opt/conda/envs/recon/lib/python3.10/site-packages/torch/lib/libtorch.so)
2025-06-24 17:29:21 frame #5: <unknown function> + 0x94ac3 (0x7ff2a0e94ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2025-06-24 17:29:21 frame #6: <unknown function> + 0x126a40 (0x7ff2a0f26a40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Hey, I would like to know how to solve the NCCL work times out issue.
I have tried to solve that following the official instructions, but it doesn't work.