[HybridEP] Add dense top-k routing scan path by harryzhou2000 · Pull Request #673 · deepseek-ai/DeepEP

harryzhou2000 · 2026-06-29T06:36:17Z

Summary

Add dense int16 top-k routing input for HybridEP metadata preprocessing.
Optimize dense scan with local-expert and rank bitsets.
Improve HybridEP test diagnostics and benchmark labels, including with/without-probs paths.

Performance

Primary impact is scan preprocessing: dense scan is significantly faster on the target dense top-k path. End-to-end dispatch/combine bandwidth is effectively unchanged versus upstream hybrid-ep in single-node B300 tests, with no observed regression.

Testing

Built on B300 with TORCH_CUDA_ARCH_LIST="10.0;10.3+PTX".
tests/test_hybrid_ep.py --num-processes 8 on B300 for default config.
Same test for H=512/E=8/K=8 and H=4096/E=8/K=8.
BF16 and FP8 correctness passed, including dense top-k scan and dense top-k scan+permute paths.
Re-ran updated test script against upstream hybrid-ep; dense-routing checks skip cleanly when unsupported.

…chmarks, --only-bf16, permute/unpermute kernel profiling

Teach the Hybrid-EP metadata preprocessing path to accept dense int16 topk_idx rows in addition to the existing sparse bool routing map. The scan kernel now templates on TOPK, derives per-rank/per-node routing by range-checking dense expert IDs, reconstructs local expert routing data where needed, and includes TOPK in the JIT cache key so sparse and dense preprocessing kernels do not alias. Plumb the dense routing mode through HybridEpConfigInstance, pybind, the JIT compiler, Executor::allgather_routing_map, dispatch(), and dispatch_with_permute(). Dense routing converts topk_idx to contiguous int16, keeps handle reuse working without requiring topk_idx again, and masks -1 dropped-token sentinels when reconstructing probs on the Python side. The API rejects expert counts beyond the int16 representable range to avoid silent wraparound. Handle collectives for the dense layout: NCCL allgather views int16 tensors as int8 because NCCL does not directly support int16, and the custom NVLink allgather now sizes its backing buffer for the larger of sparse bool rows and dense topk rows. If a routing row is not 16B-aligned, the executor falls back to NCCL instead of launching the custom uint4 allgather kernel. Extend test_hybrid_ep.py to compare dense-routing dispatch against the existing sparse reference for with-probs and no-probs cases, then run combine with the dense handle to validate the metadata maps are usable beyond dispatch.

Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

harryzhou2000 added 8 commits June 28, 2026 21:08

Improve test_hybrid_ep.py: check_bitwise diagnostics, probs=False ben…

149ff76

…chmarks, --only-bf16, permute/unpermute kernel profiling

Optimize sparse prob communication

62f7232

Fix sparse prob copy edge cases

db9880b

[HybridEP] Optimize dense scan local expert lookup

482a958

Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

[HybridEP] Use rank bitsets in dense scan

10358ab

Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

[HybridEP] Standardize benchmark probability labels

88f1b6a

Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

[HybridEP] Test dense topk permute routing

18d401a

Signed-off-by: Harry Zhou <hhanyu@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HybridEP] Add dense top-k routing scan path#673

[HybridEP] Add dense top-k routing scan path#673
harryzhou2000 wants to merge 8 commits into
deepseek-ai:hybrid-epfrom
harryzhou2000:hhanyu/hybrid-ep-dense-routing-scan-opt

harryzhou2000 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

harryzhou2000 commented Jun 29, 2026

Summary

Performance

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant