Skip to content

Skip activation kernels when tensor size is zero#2848

Open
timmoon10 wants to merge 6 commits intoNVIDIA:mainfrom
timmoon10:tmoon/activations-with-zero-size-tensor
Open

Skip activation kernels when tensor size is zero#2848
timmoon10 wants to merge 6 commits intoNVIDIA:mainfrom
timmoon10:tmoon/activations-with-zero-size-tensor

Conversation

@timmoon10
Copy link
Copy Markdown
Collaborator

Description

We have encountered some obscure CUDA errors when calling SwiGLU on a tensor with no entries. This PR handles this case by skipping the kernel launch when the tensor has no entries. Since the kernel implementation is heavily templated, this also fixes this bug for some quantization cases, although I haven't handled them exhaustively.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Skip activation kernels and quantization kernels when tensor size is zero.
  • Add empty-tensor test cases in activation unit tests.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10 timmoon10 requested a review from Oleg-Goncharov April 8, 2026 03:47
@timmoon10 timmoon10 added the bug Something isn't working label Apr 8, 2026
@timmoon10
Copy link
Copy Markdown
Collaborator Author

/te-ci

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 8, 2026

Greptile Summary

This PR fixes obscure CUDA errors when activation/quantization kernels are called with zero-element tensors by adding early-return guards before kernel launches across FP8, MXFP8, and NVFP4 cast paths. The fix is consistent and correct for the non-IS_DBIAS paths; for IS_DBIAS, empty tensors are explicitly rejected with an error, which is a reasonable design choice given the workspace-size-query ambiguity inherent to zero-sized allocations.

Confidence Score: 5/5

Safe to merge — the zero-size guards are correctly placed and the fix resolves the reported CUDA errors without breaking existing paths.

All remaining findings are P2. The single issue (dead NVTE_ERROR code in the IS_DBIAS zero-size guard) does not affect correctness or runtime behavior — the code path is unreachable and the function correctly errors via the preceding NVTE_CHECK. All non-IS_DBIAS paths are clean early returns, tests cover both {0,N} and {N,0} shapes, and the fix is consistent across FP8/MXFP8/NVFP4.

transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh and group_quantize_mxfp8.cuh contain the dead NVTE_ERROR, but it is harmless.

Vulnerabilities

No security concerns identified.

Important Files Changed

Filename Overview
transformer_engine/common/cast/fp8/gated_fp8.cuh Adds early return before kernel launch when rows==0 or cols==0, correctly placed before DIVUP and grid/block calculations.
transformer_engine/common/cast/fp8/quantize_fp8.cuh Adds N==0 early return before isFullTile validation check, correctly bypassing tile-alignment assertions that would trivially pass for N=0 anyway.
transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh Adds zero-size guard and explicit IS_DBIAS rejection for empty tensors; the NVTE_ERROR inside the IS_DBIAS branch of the zero-size guard is dead code since the preceding NVTE_CHECK always fires first.
transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh Same IS_DBIAS dead-code pattern as quantize_mxfp8.cuh — NVTE_ERROR at line 873 is unreachable because the preceding NVTE_CHECK fires first on every empty-tensor path.
transformer_engine/common/cast/mxfp8/gated_mxfp8.cuh Adds rows==0
transformer_engine/common/cast/mxfp8/dequantize_mxfp8.cuh Adds rows==0
transformer_engine/common/cast/nvfp4/dequantize_nvfp4.cuh Adds N==0
transformer_engine/common/cast/nvfp4/group_quantize_transpose_nvfp4.cuh Adds zero-size guard before rows%32 alignment check and kernel launch; placement is correct.
transformer_engine/common/cast/nvfp4/quantize_nvfp4.cuh Adds rows==0
transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh Adds zero-size guard before rows%32 alignment check; placement is correct and consistent with group_quantize_transpose.
transformer_engine/common/cast/nvfp4/specialized/quantize_transpose_nvfp4_tuned_1D.cuh Adds rows==0
tests/cpp/operator/test_act.cu Adds {0,128} and {128,0} zero-dimension test cases and correctly gates the amax comparison on N*H>0 to avoid comparing against uninitialized reference values.
tests/cpp/operator/test_cast.cu Adds zero-dimension shapes to test suite and skips amax/scale_inv checks when full_size==0, preventing false failures on empty tensors.
tests/cpp/operator/test_cast_gated_swiglu.cu Adds {0,128}/{128,0} test shapes and gates the FP8 amax check on input_size>0.
tests/cpp/operator/test_cast_mxfp8_gated_swiglu.cu Adds {0,128}/{128,0} test matrix sizes for MXFP8 gated SwiGLU.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Kernel dispatch function called] --> B{IS_DBIAS?}
    B -- Yes --> C[Compute dbias_rows / dbias_cols]
    C --> D{dbias_rows > 0 AND dbias_cols > 0?}
    D -- No --> E[NVTE_CHECK error\nIS_DBIAS + empty tensor unsupported]
    D -- Yes --> F{workspace ptr == nullptr?}
    F -- Yes --> G[Set workspace shape and return\nworkspace-size query phase]
    F -- No --> H{rows == 0 OR cols == 0?}
    H -- Yes --> I[return early]
    H -- No --> J[Launch CUDA kernel]
    B -- No --> K{rows == 0 OR cols == 0?}
    K -- Yes --> I
    K -- No --> J
Loading

Reviews (3): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile

auto err = cudaGetLastError();
ASSERT_EQ(err, cudaSuccess) << cudaGetErrorString(err);
if (isFp8Type(otype)) {
if (isFp8Type(otype) && full_size > 0) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this problem shows up not only for activations, but also for a regular cast?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, many of our kernels are not robust to empty tensors. I still expect to see problems in the FP8 block-scale quantization kernels and transpose kernels.

Oleg-Goncharov
Oleg-Goncharov previously approved these changes Apr 8, 2026
Copy link
Copy Markdown
Collaborator

@Oleg-Goncharov Oleg-Goncharov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// Skip kernel if tensor size is zero
if (elts_total == 0) {
if constexpr (IS_DBIAS) {
NVTE_ERROR("Invalid grouped tensor shape for DBias computation (first_logical_dim=",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we can output dbias = zero tensor also right instead of throwing an error?

@timmoon10
Copy link
Copy Markdown
Collaborator Author

/te-ci

Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10
Copy link
Copy Markdown
Collaborator Author

/te-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants