Skip to content

[HybridEP] Add dense top-k routing scan path#673

Draft
harryzhou2000 wants to merge 8 commits into
deepseek-ai:hybrid-epfrom
harryzhou2000:hhanyu/hybrid-ep-dense-routing-scan-opt
Draft

[HybridEP] Add dense top-k routing scan path#673
harryzhou2000 wants to merge 8 commits into
deepseek-ai:hybrid-epfrom
harryzhou2000:hhanyu/hybrid-ep-dense-routing-scan-opt

Conversation

@harryzhou2000

Copy link
Copy Markdown

Summary

  • Add dense int16 top-k routing input for HybridEP metadata preprocessing.
  • Optimize dense scan with local-expert and rank bitsets.
  • Improve HybridEP test diagnostics and benchmark labels, including with/without-probs paths.

Performance

Primary impact is scan preprocessing: dense scan is significantly faster on the target dense top-k path. End-to-end dispatch/combine bandwidth is effectively unchanged versus upstream hybrid-ep in single-node B300 tests, with no observed regression.

Testing

  • Built on B300 with TORCH_CUDA_ARCH_LIST="10.0;10.3+PTX".
  • tests/test_hybrid_ep.py --num-processes 8 on B300 for default config.
  • Same test for H=512/E=8/K=8 and H=4096/E=8/K=8.
  • BF16 and FP8 correctness passed, including dense top-k scan and dense top-k scan+permute paths.
  • Re-ran updated test script against upstream hybrid-ep; dense-routing checks skip cleanly when unsupported.

…chmarks, --only-bf16, permute/unpermute kernel profiling
Teach the Hybrid-EP metadata preprocessing path to accept dense int16 topk_idx rows in addition to the existing sparse bool routing map. The scan kernel now templates on TOPK, derives per-rank/per-node routing by range-checking dense expert IDs, reconstructs local expert routing data where needed, and includes TOPK in the JIT cache key so sparse and dense preprocessing kernels do not alias.

Plumb the dense routing mode through HybridEpConfigInstance, pybind, the JIT compiler, Executor::allgather_routing_map, dispatch(), and dispatch_with_permute(). Dense routing converts topk_idx to contiguous int16, keeps handle reuse working without requiring topk_idx again, and masks -1 dropped-token sentinels when reconstructing probs on the Python side. The API rejects expert counts beyond the int16 representable range to avoid silent wraparound.

Handle collectives for the dense layout: NCCL allgather views int16 tensors as int8 because NCCL does not directly support int16, and the custom NVLink allgather now sizes its backing buffer for the larger of sparse bool rows and dense topk rows. If a routing row is not 16B-aligned, the executor falls back to NCCL instead of launching the custom uint4 allgather kernel.

Extend test_hybrid_ep.py to compare dense-routing dispatch against the existing sparse reference for with-probs and no-probs cases, then run combine with the dense handle to validate the metadata maps are usable beyond dispatch.
Signed-off-by: Harry Zhou <hhanyu@nvidia.com>
Signed-off-by: Harry Zhou <hhanyu@nvidia.com>
Signed-off-by: Harry Zhou <hhanyu@nvidia.com>
Signed-off-by: Harry Zhou <hhanyu@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant