-
Notifications
You must be signed in to change notification settings - Fork 63
[feat] add CUTLASS kernel backend for HSTU attention #465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
tiankongdeguiji
merged 25 commits into
alibaba:master
from
tiankongdeguiji:feat/cutlass-hstu-attn
Apr 8, 2026
Merged
Changes from 24 commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
e624365
[feat] add CUTLASS kernel backend for HSTU attention
tiankongdeguiji 410210f
[fix] handle CUTLASS kernel unsupported param combinations
tiankongdeguiji adb3c60
[feat] add hstu_attn wheel to extra requirements
tiankongdeguiji 322aa91
[feat] add hstu_attn wheels for cp310/cp312
tiankongdeguiji 23dfe50
[docs] move CUTLASS usage guide from FAQ to dlrm_hstu.md
tiankongdeguiji 6fb33eb
[docs] simplify CUTLASS docs into kernel description
tiankongdeguiji ab97406
[docs] use repo.html style install for hstu_attn like dynamicemb
tiankongdeguiji 06645a1
[docs] add cu126 option for hstu_attn install
tiankongdeguiji e114ab6
[feat] add logging warning when CUTLASS falls back to Triton
tiankongdeguiji 6028d32
[feat] add logging warning in cutlass_cached_hstu_mha fallback
tiankongdeguiji d3d079d
[feat] add CUTLASS provider to hstu_attention_bench.py
tiankongdeguiji 785befc
[fix] move CUTLASS fallback logic to dispatch layer, add export test
tiankongdeguiji 07d27e8
[chore] bump version to 1.1.7
tiankongdeguiji 97e3e25
[fix] register cutlass_hstu_mha as torch.library custom_op for AOT ex…
tiankongdeguiji c437893
[fix] make CUTLASS dispatch FX-safe and handle fp32 inputs during export
tiankongdeguiji 590bb4f
[refactor] use AutocastWrapper to propagate mixed_precision through e…
tiankongdeguiji 9956957
[fix] register cutlass_hstu_mha custom op at aot_utils import time
tiankongdeguiji 9f39c32
[fix] avoid AOTI multi-thread predict deadlock in CUTLASS custom op
tiankongdeguiji a22f36b
[refactor] drop sparse-side autocast wrap, add export_config.mixed_pr…
tiankongdeguiji f7d020b
[cleanup] dedup mixed_precision handling, extract helpers, unify TRT/…
tiankongdeguiji 27420b0
Merge remote-tracking branch 'origin/master' into feat/cutlass-hstu-attn
tiankongdeguiji c4898fe
[chore] use tzrec.oss-accelerate URL for hstu_attn wheel
tiankongdeguiji f8bd059
[feat] CUTLASS kernel falls back to TRITON for non-attention sub-ops
tiankongdeguiji ef3e24e
[refactor] CUTLASS fallback handled per-op at each op entry, drop sub…
tiankongdeguiji 55b33e4
[cleanup] address review nits in cutlass_hstu_attention / hstu_attention
tiankongdeguiji File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,7 @@ | ||
| dynamicemb @ https://tzrec.oss-accelerate.aliyuncs.com/third_party/dynamicemb/cu129/dynamicemb-0.0.1%2B20260407.97b80bf.cu129-cp310-cp310-linux_x86_64.whl ; python_version=="3.10" | ||
| dynamicemb @ https://tzrec.oss-accelerate.aliyuncs.com/third_party/dynamicemb/cu129/dynamicemb-0.0.1%2B20260407.97b80bf.cu129-cp311-cp311-linux_x86_64.whl ; python_version=="3.11" | ||
| dynamicemb @ https://tzrec.oss-accelerate.aliyuncs.com/third_party/dynamicemb/cu129/dynamicemb-0.0.1%2B20260407.97b80bf.cu129-cp312-cp312-linux_x86_64.whl ; python_version=="3.12" | ||
| hstu_attn @ https://tzrec.oss-accelerate.aliyuncs.com/third_party/hstu/cu129/hstu_attn-0.1.0%2Bbea6b4b.cu12.9-cp310-cp310-linux_x86_64.whl ; python_version=="3.10" | ||
| hstu_attn @ https://tzrec.oss-accelerate.aliyuncs.com/third_party/hstu/cu129/hstu_attn-0.1.0%2Bbea6b4b.cu12.9-cp311-cp311-linux_x86_64.whl ; python_version=="3.11" | ||
| hstu_attn @ https://tzrec.oss-accelerate.aliyuncs.com/third_party/hstu/cu129/hstu_attn-0.1.0%2Bbea6b4b.cu12.9-cp312-cp312-linux_x86_64.whl ; python_version=="3.12" | ||
| torch_fx_tool @ https://tzrec.oss-accelerate.aliyuncs.com/third_party/rtp/torch_fx_tool-0.0.1%2B20251201.8c109c4-py3-none-any.whl |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,4 +18,4 @@ class Kernel(Enum): | |
|
|
||
| TRITON = "TRITON" | ||
| PYTORCH = "PYTORCH" | ||
| CUDA = "CUDA" | ||
| CUTLASS = "CUTLASS" | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| # Copyright (c) 2025, Alibaba Group; | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.