Add support to multi-plane NICs and multi-NIC per PE to DeepEP v1#650
Open
aahouzi wants to merge 2 commits into
Open
Add support to multi-plane NICs and multi-NIC per PE to DeepEP v1#650aahouzi wants to merge 2 commits into
aahouzi wants to merge 2 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
On multi-plane NICs (e.g. CX-8 dual-plane) or multi-NIC per PE, enabling NVSHMEM's
NVSHMEM_IBGDA_ENABLE_MULTI_PORT=1was not enough to actually distribute traffic across planes in DeepEP, and reach full line rate of CX8, even thoughshmem_put_bwperftest was able to reach line rate of CX8 NICs.After some investigation, it seems that the DeepEP v1 kernels only addressed the first half of QP pool, so all RDMA traffic stayed on a single plane even when NVSHMEM had correctly allocated QPs on all available planes. So we end up with this scenario on a DGX B300 NVL8 + CX8 dual-plane cluster where only plane p0 of each of the 8 NICs is being utilized by DeepEP:
Solution
This PR fixes the QP selection logic in both legacy DeepEP v1 kernels (Internode + LL), so traffic is distributed equally across all available planes/NICs per PE. Also, the PR should work well for multi-NIC per PE scenarios.
With multi-port enabled, NVSHMEM allocates
num_rc_per_pe × num_devsQPs per peer, laid out as[NIC 0 QPs | NIC 1 QPs | ... | NIC N-1 QPs]. The new QP selection logic:expert_local_idx, and since different warp groups have different expert indices, traffic ends up spread across all NICs assigned per PE naturally. This requiresnum_local_experts >= num_devssince with fewer local experts than NICs, some NICs stay unused.The fix also is fully backward compatible, so if a user doesn't provide
NVSHMEM_IBGDA_ENABLE_MULTI_PORT=1, theqp_idreverts back to the original value it had for 1 NIC per PE.Also, users are not required to provide any NIC to PE mapping via
NVSHMEM_HCA_PE_MAPPING, as the default PCI-path of NVSHMEM can correctly detect when 2 NICs are closer to the same PE and assign it correctly whenever the user providesNVSHMEM_IBGDA_ENABLE_MULTI_PORT=1.Results
Setup: DGX B300 NVL8 with CX-8 dual-plane
Internode test for 2N and 8N:
num_local_experts={4,8}Requirements
NVSHMEM_IBGDA_ENABLE_MULTI_PORT=1num_rc_per_pe × num_devsQP pool.