Skip activation kernels when tensor size is zero#2848
Skip activation kernels when tensor size is zero#2848timmoon10 wants to merge 6 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Tim Moon <tmoon@nvidia.com>
|
/te-ci |
Greptile SummaryThis PR fixes obscure CUDA errors when activation/quantization kernels are called with zero-element tensors by adding early-return guards before kernel launches across FP8, MXFP8, and NVFP4 cast paths. The fix is consistent and correct for the non- Confidence Score: 5/5Safe to merge — the zero-size guards are correctly placed and the fix resolves the reported CUDA errors without breaking existing paths. All remaining findings are P2. The single issue (dead NVTE_ERROR code in the IS_DBIAS zero-size guard) does not affect correctness or runtime behavior — the code path is unreachable and the function correctly errors via the preceding NVTE_CHECK. All non-IS_DBIAS paths are clean early returns, tests cover both {0,N} and {N,0} shapes, and the fix is consistent across FP8/MXFP8/NVFP4. transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh and group_quantize_mxfp8.cuh contain the dead NVTE_ERROR, but it is harmless.
|
| Filename | Overview |
|---|---|
| transformer_engine/common/cast/fp8/gated_fp8.cuh | Adds early return before kernel launch when rows==0 or cols==0, correctly placed before DIVUP and grid/block calculations. |
| transformer_engine/common/cast/fp8/quantize_fp8.cuh | Adds N==0 early return before isFullTile validation check, correctly bypassing tile-alignment assertions that would trivially pass for N=0 anyway. |
| transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh | Adds zero-size guard and explicit IS_DBIAS rejection for empty tensors; the NVTE_ERROR inside the IS_DBIAS branch of the zero-size guard is dead code since the preceding NVTE_CHECK always fires first. |
| transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh | Same IS_DBIAS dead-code pattern as quantize_mxfp8.cuh — NVTE_ERROR at line 873 is unreachable because the preceding NVTE_CHECK fires first on every empty-tensor path. |
| transformer_engine/common/cast/mxfp8/gated_mxfp8.cuh | Adds rows==0 |
| transformer_engine/common/cast/mxfp8/dequantize_mxfp8.cuh | Adds rows==0 |
| transformer_engine/common/cast/nvfp4/dequantize_nvfp4.cuh | Adds N==0 |
| transformer_engine/common/cast/nvfp4/group_quantize_transpose_nvfp4.cuh | Adds zero-size guard before rows%32 alignment check and kernel launch; placement is correct. |
| transformer_engine/common/cast/nvfp4/quantize_nvfp4.cuh | Adds rows==0 |
| transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh | Adds zero-size guard before rows%32 alignment check; placement is correct and consistent with group_quantize_transpose. |
| transformer_engine/common/cast/nvfp4/specialized/quantize_transpose_nvfp4_tuned_1D.cuh | Adds rows==0 |
| tests/cpp/operator/test_act.cu | Adds {0,128} and {128,0} zero-dimension test cases and correctly gates the amax comparison on N*H>0 to avoid comparing against uninitialized reference values. |
| tests/cpp/operator/test_cast.cu | Adds zero-dimension shapes to test suite and skips amax/scale_inv checks when full_size==0, preventing false failures on empty tensors. |
| tests/cpp/operator/test_cast_gated_swiglu.cu | Adds {0,128}/{128,0} test shapes and gates the FP8 amax check on input_size>0. |
| tests/cpp/operator/test_cast_mxfp8_gated_swiglu.cu | Adds {0,128}/{128,0} test matrix sizes for MXFP8 gated SwiGLU. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Kernel dispatch function called] --> B{IS_DBIAS?}
B -- Yes --> C[Compute dbias_rows / dbias_cols]
C --> D{dbias_rows > 0 AND dbias_cols > 0?}
D -- No --> E[NVTE_CHECK error\nIS_DBIAS + empty tensor unsupported]
D -- Yes --> F{workspace ptr == nullptr?}
F -- Yes --> G[Set workspace shape and return\nworkspace-size query phase]
F -- No --> H{rows == 0 OR cols == 0?}
H -- Yes --> I[return early]
H -- No --> J[Launch CUDA kernel]
B -- No --> K{rows == 0 OR cols == 0?}
K -- Yes --> I
K -- No --> J
Reviews (3): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile
| auto err = cudaGetLastError(); | ||
| ASSERT_EQ(err, cudaSuccess) << cudaGetErrorString(err); | ||
| if (isFp8Type(otype)) { | ||
| if (isFp8Type(otype) && full_size > 0) { |
There was a problem hiding this comment.
So this problem shows up not only for activations, but also for a regular cast?
There was a problem hiding this comment.
Yep, many of our kernels are not robust to empty tensors. I still expect to see problems in the FP8 block-scale quantization kernels and transpose kernels.
| // Skip kernel if tensor size is zero | ||
| if (elts_total == 0) { | ||
| if constexpr (IS_DBIAS) { | ||
| NVTE_ERROR("Invalid grouped tensor shape for DBias computation (first_logical_dim=", |
There was a problem hiding this comment.
In this case we can output dbias = zero tensor also right instead of throwing an error?
|
/te-ci |
Signed-off-by: Tim Moon <tmoon@nvidia.com>
for more information, see https://pre-commit.ci
|
/te-ci |
Description
We have encountered some obscure CUDA errors when calling SwiGLU on a tensor with no entries. This PR handles this case by skipping the kernel launch when the tensor has no entries. Since the kernel implementation is heavily templated, this also fixes this bug for some quantization cases, although I haven't handled them exhaustively.
Type of change
Changes
Checklist: