feat(turbomind): integrate cublasGemmGroupedBatchedEx for Qwen3.5 MoE inference on Blackwell GPUs with memory copy optimizations#4490
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an SM100 (Blackwell) MoE GEMM backend for TurboMind by integrating a cuBLAS grouped batched GEMM path and adjusting build/runtime logic so SM100 builds can coexist with SM90 CUTLASS kernels while working around SM100-specific memcpy instability.
Changes:
- Add
cublasGemmGroupedBatchedEx-based grouped GEMM kernel for BF16/FP16 MoE on SM100, with workspace reuse and reduced per-launch overhead. - Update build/arch plumbing (SM100 arch, split SM90 kernels into a separate target, conditional registration/compilation defines).
- Adjust MoE weight/layout and copy behavior for SM100 (skip tiled conversion for grouped BF16/FP16, unfuse GatedSiLU, avoid
cuMemcpyBatchAsyncon SM100).
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/turbomind/models/llama/LlamaLinear.cu | Extends MoE gather to cover BF16/FP16 for unfused SM100 grouped path. |
| src/turbomind/models/llama/LlamaDenseWeight.cc | Disables fused GatedSiLU under SM100 grouped BF16/FP16 constraints. |
| src/turbomind/kernels/gemm/registry.h | Adds registry hook for SM100 grouped cuBLAS kernel. |
| src/turbomind/kernels/gemm/registry.cu | Conditionally registers SM90 kernels and SM100 grouped cuBLAS kernel. |
| src/turbomind/kernels/gemm/moe_utils_v2.h | Adds #pragma once. |
| src/turbomind/kernels/gemm/cublas.cu | Implements CublasGroupedKernel using cublasGemmGroupedBatchedEx. |
| src/turbomind/kernels/gemm/convert_v3.cu | Skips tiled conversion for SM100 grouped BF16/FP16 to match cuBLAS expectations. |
| src/turbomind/kernels/gemm/CMakeLists.txt | Splits SM90 kernels into gemm2_sm90, enables SM100 grouped cuBLAS via compile defs. |
| src/turbomind/kernels/gemm/arch.h | Adds Sm100 and updates compatibility ranges. |
| src/turbomind/core/copy.cc | Adds SM100 workaround to avoid cuMemcpyBatchAsync. |
| CMakeLists.txt | Adds 100a-real CUDA arch for B200 when CUDA ≥ 12.8. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // weight descriptor as Adesc; weight has no valid offsets -> Adesc.offsets=(nil) and Launch fails. | ||
| if (desc.group_axis != 0) { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
CublasGroupedKernel::is_feasible() does not verify desc.order_a/order_b/order_c against the layout assumptions in Launch (row-major A & D, col-major interpreted weight). Without these checks, this kernel could be selected for grouped GEMMs with different operand orders and then compute incorrect results. Add explicit order checks (and/or reuse Kernel::is_feasible logic for orders) while still allowing both kHalf and kBfloat16.
| } | |
| } | |
| // Launch assumes row-major A and D/C, with B interpreted as column-major weight. | |
| if (desc.order_a != Order::kRowMajor || desc.order_b != Order::kColMajor | |
| || desc.order_c != Order::kRowMajor) { | |
| return false; | |
| } |
| // Use pre-allocated workspace for device pointer arrays (no cudaMalloc/Free per call) | ||
| const size_t one_array = active_count * sizeof(void*); | ||
| const size_t total_bytes = 3 * one_array; | ||
| TM_CHECK_LE(total_bytes, workspace.tensormaps_size); |
There was a problem hiding this comment.
TM_CHECK_LE(total_bytes, workspace.tensormaps_size) will abort the process if the workspace is too small. Since this is a runtime property (depends on active_count), consider returning a nonzero error code with a clear message instead, so callers can fail gracefully (similar to the other argument validation in this function).
| TM_CHECK_LE(total_bytes, workspace.tensormaps_size); | |
| if (total_bytes > workspace.tensormaps_size) { | |
| fprintf(stderr, | |
| "[TM][GEMM] CublasGrouped: workspace.tensormaps_size too small " | |
| "(required=%zu, available=%zu, active_count=%d)\n", | |
| total_bytes, | |
| workspace.tensormaps_size, | |
| active_count); | |
| return 1; | |
| } |
| set(_has_sm100 FALSE) | ||
| set(_archs_100 "${CMAKE_CUDA_ARCHITECTURES}") | ||
| list(FILTER _archs_100 INCLUDE REGEX "^100") | ||
| if(_archs_100 AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "12.5") | ||
| set(_has_sm100 TRUE) |
There was a problem hiding this comment.
_has_sm100 is set but never used. If it’s not needed, remove it; if it is meant to drive later logic, wire it up so the intent is clear (unused variables in CMake can hide configuration bugs).
| set(_has_sm100 FALSE) | |
| set(_archs_100 "${CMAKE_CUDA_ARCHITECTURES}") | |
| list(FILTER _archs_100 INCLUDE REGEX "^100") | |
| if(_archs_100 AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "12.5") | |
| set(_has_sm100 TRUE) | |
| set(_archs_100 "${CMAKE_CUDA_ARCHITECTURES}") | |
| list(FILTER _archs_100 INCLUDE REGEX "^100") | |
| if(_archs_100 AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "12.5") |
69d9d5e to
0d53c58
Compare
src/turbomind/kernels/gemm/cublas.cu
Outdated
| const uintptr_t kBadB = 0x320936400ULL; | ||
| if (B == nullptr || reinterpret_cast<uintptr_t>(B) == kBadB) { | ||
| fprintf(stderr, "[TM][GEMM] CublasGrouped: B null or bad (B=%p)\n", (void*)B); | ||
| return 1; | ||
| } | ||
| cudaPointerAttributes attr{}; | ||
| if (cudaPointerGetAttributes(&attr, B) != cudaSuccess || attr.type != cudaMemoryTypeDevice) { | ||
| fprintf(stderr, "[TM][GEMM] CublasGrouped: B not device ptr (attr.type=%d)\n", (int)attr.type); | ||
| return 1; | ||
| } |
There was a problem hiding this comment.
B is guaranteed on device, no need to check
5ee48bf to
ade669a
Compare
… inference on Blackwell GPUs with memory copy optimizations Made-with: Cursor
ade669a to
01c1958
Compare
|
CI failed. |
The Ninja links With the usual single-pass static linking behavior:
|
Motivation
TurboMind’s existing MoE path relies on CUTLASS-style fused/grouped kernels that target SM90. On NVIDIA Blackwell (SM100, e.g. B200), that path is not a drop-in replacement: building SM90 kernels for SM100 toolchains is problematic, and MoE inference needs a stable, vendor-supported grouped GEMM.
This PR adds a cuBLAS Grouped Batched GEMM path (
cublasGemmGroupedBatchedEx, CUDA 12.5+) for BF16/FP16 MoE FFN on SM100, so models such as Qwen3.5 MoE can run on Blackwell. It also reduces per-launch overhead in the grouped cuBLAS launcher (fewer synchronizations, no per-call device malloc for pointer arrays, reuse of pre-allocated workspace) and applies a safe fallback wherecuMemcpyBatchAsyncis known to misbehave on SM100.Goal: Correct and more efficient MoE inference on Blackwell without breaking existing architectures (H100 and below keep their current kernel selection).
Modification
Build / arch
100a-real(B200) when using CUDA ≥ 12.8.gemm2_sm90) compiled only for90/90a, so SM100-only builds remain valid while H100 compatibility is preserved when SM90 objects are still linked.ENABLE_CUBLAS_GROUPEDwhen targeting SM100 and CUDA ≥ 12.5; registerCublasGroupedKernelin the GEMM registry for arch ≥ 1000.cublas.cuCublasGroupedKernelwrappingcublasGemmGroupedBatchedExwith the documented row-major ↔ col-major mapping for MoE (ragged M per expert).workspace.tensormapsfor device-side A/B/C pointer tables (nocudaMallocAsyncper call), stream ordering instead of extra barriers where safe,cublasSetWorkspacefromworkspace.partials, single-pass construction of active groups.Weight / layout
convert_v3.cu: On SM100, skip tiled weight conversion for grouped BF16/FP16 so weights stay in the layout expected by grouped cuBLAS.LlamaDenseWeight.cc: On SM100 grouped path, disable fused GatedSiLU so activation runs outside the plain GEMM epilogue.LlamaLinear.cu: Extend MoE token gather to BF16/half when unfused grouped path is required (aligned with FP8 gather + scale dispatch behavior).Stability
copy.cc: On SM100+, avoidcuMemcpyBatchAsync(crash workaround); use sequentialcudaMemcpyAsyncvia existingcore::Copy, with cached compute-capability check to avoid querying the device everyRun().Misc
moe_utils_v2.h: Add#pragma once.arch.h: Add Sm100 and compatibility wiring.BC-breaking (Optional)
No intentional API or config break for Python users or existing TurboMind deployments.
gemm2_sm90automatically when the CMake logic enables it for H100 compatibility; artifact size may increase slightly for fat binaries.Downstream forks that patch MoE or GEMM registration should rebase carefully; others need no code changes.
Use cases (Optional)
lmdeploywhen built with CUDA 12.5+ and SM100 inCMAKE_CUDA_ARCHITECTURES.(Optional doc follow-up: mention Blackwell + grouped cuBLAS MoE in TurboMind build notes or supported-hardware table if the project maintains one.)
Checklist
pre-commit run --all-files(or project CI) before merge; fix any reported issues.cublasGemmGroupedBatchedExon SM100.