Skip to content

[NPU Roadmap, Updated to 2026-Q2] NPU support for Liger-KernelΒ #969

@zheliuyu

Description

@zheliuyu

2026 Q2 Roadmap

Thanks to the Liger-Kernel team for their strong support of our work in Q1 2026. β€πŸ§‘πŸ’›πŸ’œπŸ€ŽπŸ’šπŸ’™

We will keep contributing to the Liger-Kernel + NPU community in Q2. This roadmap details the future plans for NPU native support. Welcome to join in the discussion.

Integrated into training software

In order to promote Liger-Kernel, we need to integrate it into real production scenarios to verify its performance in actual training and inference tasks.

We selected four training frameworks: VeOmni, LLaMA-Factory, verl and ms-swift, which are widely used in real production scenarios. We will continue to demonstrate the effectiveness of using Liger-Kernel in these software applications.

Software APIs First PR
VeOmni LigerRMSNorm ByteDance-Seed/VeOmni#415
liger_rotary_pos_emb
LigerSwiGLUMLP
LLaMA-Factory apply_liger_kernel_to_qwen3 hiyouga/LlamaFactory#10386
verl - DOING
ms-swift - DOING

Performance Enhancement: Part 2

Following this plan, the performance of some kernels still hasn't met the standard and will be continuously optimized in Q2. For example:

Skills

Due to the differences in programming logic between triton-ascend and triton, migrating kernels to NPU inevitably incurs a time cost for performance optimization.

To address this issue, we have developed a series of skills for AI programming tools. For example: https://gitcode.com/Ascend/agent-skills/blob/master/skills/triton-operator-performance-optim/SKILL.md

Next, we will translate them into English and demonstrate the effectiveness of using these skills.

Support for more NPU machine types

Similarly, this includes three aspects: enabling functionality, ensuring accuracy within the error tolerance, and performance optimization.

Current support:

  • Atlas 900 A2 POD (64G), which can be used for benchmark tasks.
  • Atlas 800I A2 (32G), used by CI machines. This model NPU machine type not be used for benchmark work.

Planned support:

  • Atlas 800T A3 (64G)

Improvement of NPU CI work

Currently, the NPU CI status badge can only be displayed in the README.

We plan to find a way to mark the CI status badge in NPU-related PRs. This would simulate the state after native CI integration and achieve the effect of checking PR quality.

2026 Q1 Roadmap

Thanks very much to Liger-Kernel for accepting our first native support pr.
This roadmap details the future plans for NPU native support. Welcome to join in the discussion.

NPU Native Support

This shows how Liger-Kernel works on NPU.

Unit Test Coverage Improvement: Functionality & Precision

The accuracy of all kernels in Liger-Kernel must be within the acceptable tolerance. This task can be accomplished by checking the actual execution results of all test cases in the ./test folder.

We have done the following work to ensure that all test cases under ./test/transformers have passed.

List of each kernel's first PR:

Additionally, the test cases under ./test/transformers can serve as the foundation for NPU CI to guard future pull requests.

Due to certain policy restrictions, Liger-Kernel cannot be natively integrated into NPU CI. Therefore, we have marked the NPU CI status badge in the README.

We will look for a more convenient solution in the future.

Performance Enhancement: Part 1

Unit tests can be used to APIs functionality and precision. However, as third-party devices may not fully align in their usage patterns of Triton, a performance optimization process is required.

Regarding the evaluation criteria, we have designed it as follows: The speedup ratio of each kernel compared to huggingface/torch should be greater than 1.

The progress of this task can be tracked in this rfc issue.

Benchmark Enhancement

To enable benchmarks to run reliably across devices with different memory capacities (e.g., 32G and 64G NPUs), we collaborated with the community to redesign the benchmark framework. The key improvements include:

  1. Device memory awareness: Introduced runtime memory probing (estimate_kernel_peak_memory) to automatically detect available device memory and determine safe execution parameters, ensuring benchmarks can run successfully on different device types without OOM.

  2. Standardized benchmark dimensions: Defined two orthogonal benchmark dimensions β€” D1 (non-model dimension sweep, e.g., sequence length) and D2 (model config sweep across real-world architectures from MODEL_REGISTRY), providing a more comprehensive and structured view of kernel performance.

Phase 1 (Foundation) and Phase 2 (Model-config sweep) have been completed. Phase 3 (Rollout and visualization) is planned for future work.

Co-Author: @momochen

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions