[HybridEP] Add dense top-k routing scan path#673
Draft
harryzhou2000 wants to merge 8 commits into
Draft
Conversation
…chmarks, --only-bf16, permute/unpermute kernel profiling
Teach the Hybrid-EP metadata preprocessing path to accept dense int16 topk_idx rows in addition to the existing sparse bool routing map. The scan kernel now templates on TOPK, derives per-rank/per-node routing by range-checking dense expert IDs, reconstructs local expert routing data where needed, and includes TOPK in the JIT cache key so sparse and dense preprocessing kernels do not alias. Plumb the dense routing mode through HybridEpConfigInstance, pybind, the JIT compiler, Executor::allgather_routing_map, dispatch(), and dispatch_with_permute(). Dense routing converts topk_idx to contiguous int16, keeps handle reuse working without requiring topk_idx again, and masks -1 dropped-token sentinels when reconstructing probs on the Python side. The API rejects expert counts beyond the int16 representable range to avoid silent wraparound. Handle collectives for the dense layout: NCCL allgather views int16 tensors as int8 because NCCL does not directly support int16, and the custom NVLink allgather now sizes its backing buffer for the larger of sparse bool rows and dense topk rows. If a routing row is not 16B-aligned, the executor falls back to NCCL instead of launching the custom uint4 allgather kernel. Extend test_hybrid_ep.py to compare dense-routing dispatch against the existing sparse reference for with-probs and no-probs cases, then run combine with the dense handle to validate the metadata maps are usable beyond dispatch.
Signed-off-by: Harry Zhou <hhanyu@nvidia.com>
Signed-off-by: Harry Zhou <hhanyu@nvidia.com>
Signed-off-by: Harry Zhou <hhanyu@nvidia.com>
Signed-off-by: Harry Zhou <hhanyu@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Performance
Primary impact is scan preprocessing: dense scan is significantly faster on the target dense top-k path. End-to-end dispatch/combine bandwidth is effectively unchanged versus upstream hybrid-ep in single-node B300 tests, with no observed regression.
Testing
TORCH_CUDA_ARCH_LIST="10.0;10.3+PTX".tests/test_hybrid_ep.py --num-processes 8on B300 for default config.