Skip to content

Add support to multi-plane NICs and multi-NIC per PE to DeepEP v1#650

Open
aahouzi wants to merge 2 commits into
deepseek-ai:mainfrom
aahouzi:multi-nic-fix
Open

Add support to multi-plane NICs and multi-NIC per PE to DeepEP v1#650
aahouzi wants to merge 2 commits into
deepseek-ai:mainfrom
aahouzi:multi-nic-fix

Conversation

@aahouzi

@aahouzi aahouzi commented May 28, 2026

Copy link
Copy Markdown

Description

On multi-plane NICs (e.g. CX-8 dual-plane) or multi-NIC per PE, enabling NVSHMEM's NVSHMEM_IBGDA_ENABLE_MULTI_PORT=1 was not enough to actually distribute traffic across planes in DeepEP, and reach full line rate of CX8, even though shmem_put_bw perftest was able to reach line rate of CX8 NICs.

After some investigation, it seems that the DeepEP v1 kernels only addressed the first half of QP pool, so all RDMA traffic stayed on a single plane even when NVSHMEM had correctly allocated QPs on all available planes. So we end up with this scenario on a DGX B300 NVL8 + CX8 dual-plane cluster where only plane p0 of each of the 8 NICs is being utilized by DeepEP:

image

Solution

This PR fixes the QP selection logic in both legacy DeepEP v1 kernels (Internode + LL), so traffic is distributed equally across all available planes/NICs per PE. Also, the PR should work well for multi-NIC per PE scenarios.

With multi-port enabled, NVSHMEM allocates num_rc_per_pe × num_devs QPs per peer, laid out as [NIC 0 QPs | NIC 1 QPs | ... | NIC N-1 QPs]. The new QP selection logic:

  • Internode: Different channels bind to different NICs for their whole lifetime, distributing the traffic across all NICs assigned per PE, and also preserving RC ordering.
  • LL: Binding the QP choice to expert_local_idx, and since different warp groups have different expert indices, traffic ends up spread across all NICs assigned per PE naturally. This requires num_local_experts >= num_devs since with fewer local experts than NICs, some NICs stay unused.

The fix also is fully backward compatible, so if a user doesn't provide NVSHMEM_IBGDA_ENABLE_MULTI_PORT=1, the qp_id reverts back to the original value it had for 1 NIC per PE.

Also, users are not required to provide any NIC to PE mapping via NVSHMEM_HCA_PE_MAPPING, as the default PCI-path of NVSHMEM can correctly detect when 2 NICs are closer to the same PE and assign it correctly whenever the user provides NVSHMEM_IBGDA_ENABLE_MULTI_PORT=1.

Results

  • Setup: DGX B300 NVL8 with CX-8 dual-plane

  • Internode test for 2N and 8N:

image
  • LL test for 4N and 8N across num_local_experts={4,8}
image
  • NIC counters during the run for all planes are fully utilized and balanced:
image

Requirements

  • NVSHMEM ≥ 3.7, which is now released
  • Multi-port feature enabled via: NVSHMEM_IBGDA_ENABLE_MULTI_PORT=1
  • This PR builds on top of #288 to actually distribute traffic across the num_rc_per_pe × num_devs QP pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant