[NPU Roadmap, Updated to 2026-Q2] NPU support for Liger-Kernel

# 2026 Q2 Roadmap

Thanks to the `Liger-Kernel` team for their strong support of our work in Q1 2026. ❤🧡💛💜🤎💚💙

We will keep contributing to the `Liger-Kernel + NPU` community in Q2. This roadmap details the future plans for NPU native support. Welcome to join in the discussion.

## Integrated into training software

In order to promote `Liger-Kernel`, we need to integrate it into real production scenarios to verify its performance in actual training and inference tasks.

We selected four training frameworks: VeOmni, LLaMA-Factory, verl and ms-swift, which are widely used in real production scenarios. We will continue to demonstrate the effectiveness of using Liger-Kernel in these software applications.

| Software                                                   | APIs                        | First PR                                              |
|:-----------------------------------------------------------|:----------------------------|:------------------------------------------------------|
| [VeOmni](https://github.com/ByteDance-Seed/VeOmni)         | LigerRMSNorm                | https://github.com/ByteDance-Seed/VeOmni/pull/415     |
|                                                            | liger_rotary_pos_emb        |                                                       |
|                                                            | LigerSwiGLUMLP              |                                                       |
| [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)  | apply_liger_kernel_to_qwen3 | https://github.com/hiyouga/LlamaFactory/issues/10386  |
| [verl](https://github.com/volcengine/verl)                 | -                           | DOING                                                 |
| [ms-swift](https://github.com/modelscope/ms-swift)         | -                           | DOING                                                 |

## Performance Enhancement: Part 2

Following this [plan](https://github.com/linkedin/Liger-Kernel/issues/1159), the performance of some kernels still hasn't met the standard and will be continuously optimized in Q2. For example:

- [x] https://github.com/linkedin/Liger-Kernel/pull/1178
- [x] https://github.com/linkedin/Liger-Kernel/pull/1174
- [x] https://github.com/linkedin/Liger-Kernel/pull/1164
- [x] https://github.com/linkedin/Liger-Kernel/pull/1154
- [x] https://github.com/linkedin/Liger-Kernel/pull/1153

## Skills

Due to the differences in programming logic between `triton-ascend` and `triton`, migrating kernels to NPU inevitably incurs a time cost for performance optimization. 

To address this issue, we have developed a series of skills for AI programming tools. For example: https://gitcode.com/Ascend/agent-skills/blob/master/skills/triton-operator-performance-optim/SKILL.md

Next, we will translate them into English and demonstrate the effectiveness of using these skills.

## Support for more NPU machine types

Similarly, this includes three aspects: enabling functionality, ensuring accuracy within the error tolerance, and performance optimization.

Current support:
- Atlas 900 A2 POD (64G), which can be used for benchmark tasks.
- Atlas 800I A2 (32G), used by CI machines. This model NPU machine type not be used for benchmark work.

Planned support:
- Atlas 800T A3 (64G)

## Improvement of NPU CI work

Currently, the NPU CI status badge can only be displayed in the [README](https://github.com/linkedin/Liger-Kernel?tab=readme-ov-file#ci-status).

We plan to find a way to mark the CI status badge in NPU-related PRs. This would simulate the state after native CI integration and achieve the effect of checking PR quality.


# 2026 Q1 Roadmap

Thanks very much to Liger-Kernel for accepting our first native support pr.
This roadmap details the future plans for NPU native support. Welcome to join in the discussion.

## NPU Native Support

This shows how Liger-Kernel works on NPU.
- [x] https://github.com/linkedin/Liger-Kernel/pull/955
- [x] https://github.com/linkedin/Liger-Kernel/pull/965
- [x] https://github.com/linkedin/Liger-Kernel/pull/987
- [x] https://github.com/linkedin/Liger-Kernel/pull/1111

## Unit Test Coverage Improvement: Functionality & Precision

The accuracy of all kernels in `Liger-Kernel` must be within the acceptable tolerance. This task can be accomplished by checking the actual execution results of all test cases in the `./test` folder. 

We have done the following work to ensure that all test cases under `./test/transformers` have passed.

List of each kernel's first PR: 
- [x] cross_entroy: https://github.com/linkedin/Liger-Kernel/pull/1148
- [x] dyt: https://github.com/linkedin/Liger-Kernel/pull/1124
- [x] embedding: https://github.com/linkedin/Liger-Kernel/pull/1028
- [x] fused_add_rms_norm: https://github.com/linkedin/Liger-Kernel/pull/1070
- [x] fused_linear_cross_entroy: https://github.com/linkedin/Liger-Kernel/pull/1164
- [x] fused_linear_jsd: https://github.com/linkedin/Liger-Kernel/pull/1151
- [x] fused_neighborhood_attention: https://github.com/linkedin/Liger-Kernel/pull/1034
- [x] geglu: https://github.com/linkedin/Liger-Kernel/pull/996
- [x] group_norm: https://github.com/linkedin/Liger-Kernel/pull/1144
- [x] grpo_loss: https://github.com/linkedin/Liger-Kernel/pull/1146
- [x] jsd: https://github.com/linkedin/Liger-Kernel/pull/1134
- [x] kl_div: https://github.com/linkedin/Liger-Kernel/pull/1118
- [x] layer_norm: https://github.com/linkedin/Liger-Kernel/pull/1113
- [x] llama4_rope: https://github.com/linkedin/Liger-Kernel/pull/1035
- [x] poly_norm: https://github.com/linkedin/Liger-Kernel/pull/1114
- [x] qwen2vl_mrope: https://github.com/linkedin/Liger-Kernel/pull/992
- [x] rms_norm: https://github.com/linkedin/Liger-Kernel/pull/1099
- [x] rope: https://github.com/linkedin/Liger-Kernel/pull/992
- [x] softmax: https://github.com/linkedin/Liger-Kernel/pull/1087
- [x] sparsemax: https://github.com/linkedin/Liger-Kernel/pull/1104
- [x] swiglu: https://github.com/linkedin/Liger-Kernel/pull/995
- [x] tvd: https://github.com/linkedin/Liger-Kernel/pull/998

Additionally, the test cases under `./test/transformers` can serve as the foundation for NPU CI to guard future pull requests.
- [x] https://github.com/linkedin/Liger-Kernel/issues/1022

Due to certain policy restrictions, `Liger-Kernel` cannot be natively integrated into NPU CI. Therefore, we have marked the NPU CI status badge in the README.
- [x] https://github.com/linkedin/Liger-Kernel/pull/1131

We will look for a more convenient solution in the future.

## Performance Enhancement: Part 1

Unit tests can be used to APIs functionality and precision. However, as third-party devices may not fully align in their usage patterns of Triton, a performance optimization process is required.

Regarding the evaluation criteria, we have designed it as follows: The speedup ratio of each kernel compared to huggingface/torch should be greater than 1.

The progress of this task can be tracked in this rfc issue.
- [ ] https://github.com/linkedin/Liger-Kernel/issues/1159

## Benchmark Enhancement

To enable benchmarks to run reliably across devices with different memory capacities (e.g., 32G and 64G NPUs), we collaborated with the community to redesign the benchmark framework. The key improvements include:

1. **Device memory awareness**: Introduced runtime memory probing (`estimate_kernel_peak_memory`) to automatically detect available device memory and determine safe execution parameters, ensuring benchmarks can run successfully on different device types without OOM.

2. **Standardized benchmark dimensions**: Defined two orthogonal benchmark dimensions — D1 (non-model dimension sweep, e.g., sequence length) and D2 (model config sweep across real-world architectures from `MODEL_REGISTRY`), providing a more comprehensive and structured view of kernel performance.

Phase 1 (Foundation) and Phase 2 (Model-config sweep) have been completed. Phase 3 (Rollout and visualization) is planned for future work.

- [x] https://github.com/linkedin/Liger-Kernel/pull/1116
- [x] https://github.com/linkedin/Liger-Kernel/pull/1163

Co-Author: @momochen 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU Roadmap, Updated to 2026-Q2] NPU support for Liger-Kernel #969

2026 Q2 Roadmap

Integrated into training software

Performance Enhancement: Part 2

Skills

Support for more NPU machine types

Improvement of NPU CI work

2026 Q1 Roadmap

NPU Native Support

Unit Test Coverage Improvement: Functionality & Precision

Performance Enhancement: Part 1

Benchmark Enhancement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Software	APIs	First PR
VeOmni	LigerRMSNorm	ByteDance-Seed/VeOmni#415
	liger_rotary_pos_emb
	LigerSwiGLUMLP
LLaMA-Factory	apply_liger_kernel_to_qwen3	hiyouga/LlamaFactory#10386
verl	-	DOING
ms-swift	-	DOING

[NPU Roadmap, Updated to 2026-Q2] NPU support for Liger-Kernel #969

Description

2026 Q2 Roadmap

Integrated into training software

Performance Enhancement: Part 2

Skills

Support for more NPU machine types

Improvement of NPU CI work

2026 Q1 Roadmap

NPU Native Support

Unit Test Coverage Improvement: Functionality & Precision

Performance Enhancement: Part 1

Benchmark Enhancement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions