Skip activation kernels when tensor size is zero by timmoon10 · Pull Request #2848 · NVIDIA/TransformerEngine

timmoon10 · 2026-04-08T03:47:33Z

Description

We have encountered some obscure CUDA errors when calling SwiGLU on a tensor with no entries. This PR handles this case by skipping the kernel launch when the tensor has no entries. Since the kernel implementation is heavily templated, this also fixes this bug for some quantization cases, although I haven't handled them exhaustively.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Skip activation kernels and quantization kernels when tensor size is zero.
Add empty-tensor test cases in activation unit tests.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

timmoon10 · 2026-04-08T03:49:05Z

/te-ci

greptile-apps · 2026-04-08T03:51:49Z

Greptile Summary

This PR fixes obscure CUDA errors when activation/quantization kernels are called with zero-element tensors by adding early-return guards before kernel launches across FP8, MXFP8, and NVFP4 cast paths. The fix is consistent and correct for the non-IS_DBIAS paths; for IS_DBIAS, empty tensors are explicitly rejected with an error, which is a reasonable design choice given the workspace-size-query ambiguity inherent to zero-sized allocations.

Confidence Score: 5/5

Safe to merge — the zero-size guards are correctly placed and the fix resolves the reported CUDA errors without breaking existing paths.

All remaining findings are P2. The single issue (dead NVTE_ERROR code in the IS_DBIAS zero-size guard) does not affect correctness or runtime behavior — the code path is unreachable and the function correctly errors via the preceding NVTE_CHECK. All non-IS_DBIAS paths are clean early returns, tests cover both {0,N} and {N,0} shapes, and the fix is consistent across FP8/MXFP8/NVFP4.

transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh and group_quantize_mxfp8.cuh contain the dead NVTE_ERROR, but it is harmless.

Vulnerabilities

No security concerns identified.

Important Files Changed

Filename	Overview
transformer_engine/common/cast/fp8/gated_fp8.cuh	Adds early return before kernel launch when rows==0 or cols==0, correctly placed before DIVUP and grid/block calculations.
transformer_engine/common/cast/fp8/quantize_fp8.cuh	Adds N==0 early return before isFullTile validation check, correctly bypassing tile-alignment assertions that would trivially pass for N=0 anyway.
transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh	Adds zero-size guard and explicit IS_DBIAS rejection for empty tensors; the NVTE_ERROR inside the IS_DBIAS branch of the zero-size guard is dead code since the preceding NVTE_CHECK always fires first.
transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh	Same IS_DBIAS dead-code pattern as quantize_mxfp8.cuh — NVTE_ERROR at line 873 is unreachable because the preceding NVTE_CHECK fires first on every empty-tensor path.
transformer_engine/common/cast/mxfp8/gated_mxfp8.cuh	Adds rows==0
transformer_engine/common/cast/mxfp8/dequantize_mxfp8.cuh	Adds rows==0
transformer_engine/common/cast/nvfp4/dequantize_nvfp4.cuh	Adds N==0
transformer_engine/common/cast/nvfp4/group_quantize_transpose_nvfp4.cuh	Adds zero-size guard before rows%32 alignment check and kernel launch; placement is correct.
transformer_engine/common/cast/nvfp4/quantize_nvfp4.cuh	Adds rows==0
transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh	Adds zero-size guard before rows%32 alignment check; placement is correct and consistent with group_quantize_transpose.
transformer_engine/common/cast/nvfp4/specialized/quantize_transpose_nvfp4_tuned_1D.cuh	Adds rows==0
tests/cpp/operator/test_act.cu	Adds {0,128} and {128,0} zero-dimension test cases and correctly gates the amax comparison on N*H>0 to avoid comparing against uninitialized reference values.
tests/cpp/operator/test_cast.cu	Adds zero-dimension shapes to test suite and skips amax/scale_inv checks when full_size==0, preventing false failures on empty tensors.
tests/cpp/operator/test_cast_gated_swiglu.cu	Adds {0,128}/{128,0} test shapes and gates the FP8 amax check on input_size>0.
tests/cpp/operator/test_cast_mxfp8_gated_swiglu.cu	Adds {0,128}/{128,0} test matrix sizes for MXFP8 gated SwiGLU.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Kernel dispatch function called] --> B{IS_DBIAS?}
    B -- Yes --> C[Compute dbias_rows / dbias_cols]
    C --> D{dbias_rows > 0 AND dbias_cols > 0?}
    D -- No --> E[NVTE_CHECK error\nIS_DBIAS + empty tensor unsupported]
    D -- Yes --> F{workspace ptr == nullptr?}
    F -- Yes --> G[Set workspace shape and return\nworkspace-size query phase]
    F -- No --> H{rows == 0 OR cols == 0?}
    H -- Yes --> I[return early]
    H -- No --> J[Launch CUDA kernel]
    B -- No --> K{rows == 0 OR cols == 0?}
    K -- Yes --> I
    K -- No --> J

_{Reviews (3): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

Oleg-Goncharov · 2026-04-08T13:08:59Z

tests/cpp/operator/test_cast.cu

  auto err = cudaGetLastError();
  ASSERT_EQ(err, cudaSuccess) << cudaGetErrorString(err);
-  if (isFp8Type(otype)) {
+  if (isFp8Type(otype) && full_size > 0) {


So this problem shows up not only for activations, but also for a regular cast?

Yep, many of our kernels are not robust to empty tensors. I still expect to see problems in the FP8 block-scale quantization kernels and transpose kernels.

Oleg-Goncharov

LGTM

vthumbe1503 · 2026-04-08T15:59:06Z

transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh

+  // Skip kernel if tensor size is zero
+  if (elts_total == 0) {
+    if constexpr (IS_DBIAS) {
+      NVTE_ERROR("Invalid grouped tensor shape for DBias computation (first_logical_dim=",


In this case we can output dbias = zero tensor also right instead of throwing an error?

timmoon10 · 2026-04-08T23:25:47Z

/te-ci

transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

timmoon10 · 2026-04-09T00:39:08Z

/te-ci

Skip quantization kernels when tensor size is zero

d0081f0

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 requested a review from Oleg-Goncharov April 8, 2026 03:47

timmoon10 added the bug Something isn't working label Apr 8, 2026

pre-commit-ci bot and others added 2 commits April 8, 2026 03:48

[pre-commit.ci] auto fixes from pre-commit.com hooks

a61dfe2

for more information, see https://pre-commit.ci

Merge branch 'main' into tmoon/activations-with-zero-size-tensor

6df9853

Oleg-Goncharov reviewed Apr 8, 2026

View reviewed changes

Oleg-Goncharov previously approved these changes Apr 8, 2026

View reviewed changes

vthumbe1503 reviewed Apr 8, 2026

View reviewed changes

Merge branch 'main' into tmoon/activations-with-zero-size-tensor

43e685f

greptile-apps bot reviewed Apr 8, 2026

View reviewed changes

transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh Outdated Show resolved Hide resolved

Use consistent early-termination logic in dbias kernels

ea6f858

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 dismissed Oleg-Goncharov’s stale review via ea6f858 April 9, 2026 00:28

[pre-commit.ci] auto fixes from pre-commit.com hooks

e05f353

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip activation kernels when tensor size is zero#2848

Skip activation kernels when tensor size is zero#2848
timmoon10 wants to merge 6 commits intoNVIDIA:mainfrom
timmoon10:tmoon/activations-with-zero-size-tensor

timmoon10 commented Apr 8, 2026

Uh oh!

timmoon10 commented Apr 8, 2026

Uh oh!

greptile-apps bot commented Apr 8, 2026 •

edited

Loading

Vulnerabilities

Uh oh!

Oleg-Goncharov Apr 8, 2026

Uh oh!

timmoon10 Apr 8, 2026

Uh oh!

Oleg-Goncharov left a comment

Uh oh!

vthumbe1503 Apr 8, 2026

Uh oh!

timmoon10 commented Apr 8, 2026

Uh oh!

Uh oh!

timmoon10 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

timmoon10 commented Apr 8, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

timmoon10 commented Apr 8, 2026

Uh oh!

greptile-apps bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Vulnerabilities

Important Files Changed

Flowchart

Uh oh!

Oleg-Goncharov Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov left a comment

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Apr 8, 2026

Uh oh!

Uh oh!

timmoon10 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Apr 8, 2026 •

edited

Loading