Rebase SM90 SBO features by b8zhong · Pull Request #51 · sgl-project/DeepGEMM

b8zhong · 2026-06-22T18:13:22Z

Requested in sgl-project/sglang#25664 (testing myself later on H200)

Co-Authored-By: rainj-me <rain-jiang@outlook.com>

…el (deepseek-ai#27) Two related additions for the DeepSeek-V4-Pro mega-MoE path: 1. **FP4 (E2M1) activations + `kind::mxf4` mainloop opt-in** for `fp8_fp4_mega_moe`. - `DG_USE_FP4_ACTS=1` halves the symm-buffer x-slot footprint (E2M1 nibbles vs E4M3 bytes); SF slot unchanged (still `hidden/32` UE8M0 bytes under gran_k=32). - `use_mxf4_kind=true` switches the L1+L2 mainloops to `cta_group::2 kind::mxf4` (2-CTA cluster) with dense FP4 smem layout (`_ALIGN8B`, 2 nibbles/byte). Per-stage A/B byte footprint halves → num_stages doubles for the same smem budget. - Threads `cumulative_local_expert_recv_stats` through the public mega-MoE API for per-rank expert counters used by sglang's expert-distribution recorder. - Block-m heuristic: under `use_mxf4_kind`, bumps `block_m=16 → 32` for the smallest-tokens-per-expert bucket so `load_block_m * block_k / 2` meets the 1024-byte smem alignment. - Multi-block_m support via `kCandidateBlockM` array + LCM-aligned pool padding; replaces the static `block_m=192` heuristic with token-density dispatch (8/16/32/64/96/128/192). 2. **`mega_moe_pre_dispatch` kernel**: BF16 → quant + topk-copy + pad-fill in one launch, gated on `kUseFp4Acts` + `kUsePDL`. Templated on `(kGroupSize, kUseFp4Acts, kUsePDL)`. Uses bucketize-style E2M1 encoder for byte-exact match against the `per_token_cast_to_fp4` host helper. - New: `deep_gemm.mega_moe_pre_dispatch(x, topk_idx, topk_weights, buf_x, buf_x_sf, buf_topk_idx, buf_topk_weights, num_tokens, group_size, use_fp4_acts)` - Test: `tests/test_mega_moe_pre_dispatch.py` — single-GPU bytewise check against host `per_token_cast_to_fp{8,4}` + pad-fill assertion. Validated end-to-end on 8× B300 with DeepSeek-V4-Pro at 8K input bench: - FP4 acts + MXF4 kind path produces matching tokens vs the FP8 baseline (rel-RMSE ≤ 0.5 sentinel; GSM8K accuracy parity within run-to-run variance). PR also includes existing FP4-mega-MoE supporting changes that are required by the kernel: - `cluster_sync_with_relaxed_arrive` helper (used twice in `sm100_fp8_fp4_mega_moe.cuh`). - `cvt_pack_f32_to_e2m1x2` / `cvt_pack_f32x4_to_e2m1x4` PTX wrappers. - `SM100_MMA_MXF4_2x1SM_SS` 2-CTA cluster MMA wrapper. - Generalized `red_add(int*, int)` for the `cumulative_local_expert_recv_stats` counter. - `st.L1::no_allocate.relaxed.sys.global.u64` (correctness fix: previous generic-address variant could miss the global state space). Co-authored-by: pranjalssh <adkz.photos@gmail.com> (cherry picked from commit bca278e)

…bine path) (deepseek-ai#28) * Add DG_USE_FP8_COMBINE: FP8 + per-row UE8M0 SF on the second a2a (combine path) The mega-MoE second all-to-all (combine) currently ships BF16 over NVLink: each token, each topk slot = kHidden * 2 bytes. This commit adds an env- gated FP8 path that ships FP8 E4M3 + a per-(token, N=128) UE8M0 SF byte — kHidden + kHidden/128 bytes per token per slot, half the NVLink bytes. Wiring: - New `kUseFp8Combine` template flag (default false → keeps BF16 path byte-identical when off). - New `combine_sf_buffer` symm-buffer slot, sized kHidden/128 bytes per (token, slot) when on, zero when off. - Host: `DG_USE_FP8_COMBINE=1` env flag in `mega.hpp`. Independent of `DG_USE_FP4_ACTS` / `DG_USE_MXF4_KIND` (those control the dispatch a2a + mainloops; this controls the combine a2a only). Producer side (L2 epilogue write-back, sm100_fp8_fp4_mega_moe.cuh): - Read 8 BF16 from smem (existing STSM target). - Compute per-row amax via `__shfl_xor_sync` reduction over the 16 lanes that share each row tile. Use a 16-lane mask (NOT 0xffffffff) — the outer `if (m_idx_in_block >= valid_m) break` may cause the OTHER half- warp to exit on padding rows, and a full-warp shfl would deadlock. - Compute UE8M0 SF (E4M3 finfo_max=448, mirrors `get_e4m3_sf_and_sf_inv`). - Cast 8 BF16 → 8 FP8 via `__nv_fp8x4_e4m3(float4)` ×2; pack into uint64. - Write 8 FP8 bytes to remote (vs 16 BF16 bytes). Lane 0 of the 16-lane group writes the SF byte to `combine_sf_buffer`. Consumer side (combine reduce): - Per-slot SF base ptr cached at slot start. - TMA-load FP8 chunk (kNumChunkBytes / 2 bytes when kUseFp8Combine). - Per uint4 (16 FP8): __ldg the SF byte for the segment; FP8 → FP16x2 via `cvt.rn.f16x2.e4m3x2`, FP16 → FP32 via `cvt.f32.f16`, then `__fmaf_rn(val, sf, acc)` for the accumulate-with-dequant. - BF16 store-buffer layout for FP8 path: 2 BF16 uint4 per input uint4 (16 elements → 2 × 8 BF16 stripes), at indices (j*32+lane)*2 + {0,1}. Total store uint4/lane same as BF16 path (kNumChunkUint4Bf16 / 32). Validation: - Microbench (`ptx/d_combine_reduce_v{1,2}_*`): - v1 BF16 baseline: 6,895 cycles/token, max_abs=0 (perfect). - v2 FP8 + UE8M0 SF: correctness PASS (max_abs=0 vs host reference that uses the same FP8 quant), 50% NVLink bytes savings. - Single-GPU iso bench (8x B300, fp4_mxf4 vs fp4_mxf4+combine): - b=128: 364 us → 359 us (+1.5%) - b=512: 377 us → 386 us (-2.2%) - b=2048: 710 us → 739 us (-3.9%) Single-GPU is compute-bound (no NVLink saving); production is the point of the change. - E2E DeepSeek-V4-Pro on 8x B300 (b=8192 input, 1024 output): - b=512: 91.92 s (FP8) → 78.37 s (FP4+MXF4+FP8combine) — +17.3% - b=2048: 259.4 s (FP8) → 238.2 s — +8.9% - b=4096: 489.5 s (FP8) → 444.2 s — +10.2% Sentinel test (FP4 acts vs FP8 acts): rel-RMSE <= 0.5 still passes. Numerical: rel-RMSE on synthetic random init = 0.027 (combine FP8 vs BF16 baseline, w/o SwiGLU clamping → tail outliers). Real activations post-SwiGLU + topk-weighting are bounded; production accuracy parity preserved (same GSM8K results as FP4 baseline). * Combine reduce: HFMA path (FP16 accumulator + fma.f16x2) Switch the FP8 combine reduce inner loop from FP32 accumulator + scalar fma to FP16x2 accumulator + hfma.f16x2. Halves the per-element op count and halves the accumulator register pressure (94 regs vs 138 regs). Inner loop, before: cvt.rn.f16x2.e4m3x2 (FP8x2 → FP16x2) cvt.f32.f16 ×2 (FP16 → FP32) fma.rn.f32 ×2 (acc += sf_f32 * f32_val) = 5 ops per FP8x2 (= 2 elements) After: cvt.rn.f16x2.e4m3x2 (FP8x2 → FP16x2) fma.rn.f16x2 (acc_fp16x2 += sf_pair * f16x2) = 2 ops per FP8x2 SF in FP16: UE8M0 byte → 1.0 * 2^(byte-127), packed as FP16 with bias 15. Out-of-range SFs (byte < 112 or > 142) clamp to 0 / FP16-max — production activations post-SwiGLU + topk-weighting fit comfortably in FP16 range. End cast: FP16x2 → __half22float2 → __float22bfloat162_rn for the gmem write-back (BF16 output unchanged). Microbench (`ptx/d_combine_reduce_v3_fp8_hfma`): v1 BF16 baseline: 6,895 cycles/token v2 FP8 + FP32 acc: 10,797 cycles/token (+57% vs v1) v3 FP8 + FP16 HFMA: **5,799 cycles/token (-16% vs v1, -46% vs v2)** E2E DeepSeek-V4-Pro 8x B300, 8K input + 1024 output: | batch | FP4+MXF4 | combine FP32 | combine HFMA | |------:|---------:|-------------:|-------------:| | 512 | — | 7,526 | 7,350 | | 2048 | 9,814 | 9,903 | **9,992** | | 4096 | 10,418 | 10,622 | **10,699** | HFMA wins at 2048/4096; ~tie at 512. Worth keeping as the default. Numerical: v3 microbench correctness max_abs=0.0625, rel_rmse=3.8e-4 vs the FP32 reference. Production activations: still within sentinel tolerance (rel-RMSE ≤ 0.5 vs FP8 baseline). * Revert "Combine reduce: HFMA path (FP16 accumulator + fma.f16x2)" This reverts commit 48e8101. --------- Co-authored-by: pranjalssh <adkz.photos@gmail.com> (cherry picked from commit 8fc78b4)

Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

Co-authored-by: Brayden Zhong <brayden@radixark.ai> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…with setuptools>=77 Co-authored-by: Brayden Zhong <brayden@radixark.ai>

Add host-side assertions and fix SM100 FP8/FP4 (paged) MQA logits kernels for num_heads == 16.

Co-authored-by: yinding <yinding@bytedance.com>

Clarify pull request rebase instructions and testing location.

Port the Single Batch Overlap (SBO) signal mechanism from the archive-release branch onto the refactored dev base. - sm90_fp8_gemm_1d2d.cuh: kEnableOverlap template param + signal pointer; emit a per-(group, m-block) completion counter via ptx::atomic_add_rel after the TMA store lands (ptx::tma_store_wait<0>). - GemmDesc: max_block_n / enable_overlap; SM90 heuristic caps block_n. - sm90 masked launch + gemm.hpp API return (block_m, signal_threshold); wired through both the pybind (_C) and tvm-ffi bindings + Python wrappers.

SM90-guarded overlap test for m-grouped masked GEMM plus check_signal helper; mirrored into both the source and sgl_deep_gemm test trees.

Fridge003 and others added 27 commits April 25, 2026 11:47

support tvm ffi interfaces

ffb3d29

Co-Authored-By: rainj-me <rain-jiang@outlook.com>

rebase on 0426 upstream

c82e589

Relax timeout to 180s

3d6ab9e

pin tvm to 0.1.9

f958b89

Support wheel compilation (deepseek-ai#26)

5af4329

Merge branch 'deepseek-ai:main' into release-0426

66ea23e

[Misc] Remove verbose import message of legacy

b97540e

Add tvm-ffi wrapper for w4a4 megamoe

8ab0096

Bump to v0.1.0

89f2f00

update readme

23105b2

Export the PDL utils of DeepGEMM (deepseek-ai#34)

46a2294

Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>

Expose BF16 grouped GEMM wrappers

48eb6a6

Fix version of dev branch to 0.0.0

6c9eaca

Fix IMA guard in paged MQA logits scheduler (deepseek-ai#38)

86d705d

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Co-authored-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

Fix various issues in DeepGEMM tests (deepseek-ai#39)

09fc810

Co-authored-by: Brayden Zhong <brayden@radixark.ai> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Change license in pyproject.toml to avoid build and publish failures …

7020fd8

…with setuptools>=77 Co-authored-by: Brayden Zhong <brayden@radixark.ai>

Move run tests script for DeepGEMM (deepseek-ai#42)

c872baf

Support num_heads == 16 in MQA logits (deepseek-ai#43)

2176dff

Add host-side assertions and fix SM100 FP8/FP4 (paged) MQA logits kernels for num_heads == 16.

Sm90 mega moe on sgl dev (deepseek-ai#36)

35d4d8c

Co-authored-by: yinding <yinding@bytedance.com>

Move MegaMoE Hopper test into sgl_deep_gemm tests (deepseek-ai#45)

a36f7fd

Update test modification instruction

b5238ad

Clarify pull request rebase instructions and testing location.

Add Hopper mega moe test to runner (deepseek-ai#46)

bdabf6c

chore: bump apache-tvm-ffi 0.1.9 -> 0.1.11 (deepseek-ai#47)

77c9522

feat: add test for signal GEMM

f4945a5

SM90-guarded overlap test for m-grouped masked GEMM plus check_signal helper; mirrored into both the source and sgl_deep_gemm test trees.

b8zhong mentioned this pull request Jun 22, 2026

Deprecate Hopper SBO feature sgl-project/sglang#25664

Open

Fridge003 force-pushed the dev branch from 77c9522 to 731e7c7 Compare June 30, 2026 00:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rebase SM90 SBO features#51

Rebase SM90 SBO features#51
b8zhong wants to merge 27 commits into
sgl-project:devfrom
bzhng-development:sbo-rebase-on-dev

b8zhong commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Uh oh!

Conversation

b8zhong commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants