Sync with Microsoft ONNX Runtime - 01072026#1175
Merged
Merged
Conversation
### Description Turbo quant implementation for ORT WebGPU, using a Hadamard matrix for rotation instead of a regular matrix, which deviates from the paper. | seq_length | turbo_quant | prefill_tps | generation_tps | working_set_gb | gpu_memory_gb | % saving| |-----------|-------------|-------------|----------------|----------------|---------------|----------------| | 1024 | ❌ Off| 2007.21 | 108.48 | 1.84 | 4.14 | | | | ✅ On| 2053.70 | 113.06 | 1.64 | 3.37 | **18.6%** | | 2048 | ❌ Off | 1778.32 | 111.26 | 1.95 | 5.49 | | | | ✅ On | 1763.69 | 111.83 | 1.90 | 3.75 | **31.7%** | | 4096 |❌ Off | 1373.88 | 29.78 | 1.61 | 7.21 | | | |✅ On | 1367.89 | 29.51 | 2.30 | 4.96 | **31.2%** | | 8192 | ❌ Off | 948.90 | 84.17 | 2.44 | 10.14 | | | | ✅ On | 943.96 | 82.36 | 2.31 | 6.50 | **35.9%** | - Hadamard transform is kept as its own class and standalone shader -[hadamard_transform.h](https://github.com/microsoft/onnxruntime/pull/28059/changes#diff-eb3f846a5a284367f10b2059cc662e19d1b029b416c0c8373e20add9dbfe8afa) - used to rotate/unrotate Q. Can be used by other feature in the future like activation quantization. - TurboQuantHadamard applies Hadamard transform and then quantizes using the centroid look up for q4. - Dequantization is all fused into the various flash attention kernels. Caller for LLMs like gen-ai have to set kvCacheQuantizationBits:4 in the EP provider options and pass in present,past kv cache input, output tensors that have a reduced headsize. With turboquant the headsize reduces from say using 16bits per value to 4bits and in addition there is a 32bit scale in the front of each token per head. ---------------------------------------------------------------- Note on impact on quality. Evaluting KV quantization 4 bits with Phi4 mini | Metric | kv0 (no quant) | kv4 (4-bit) | Δ (kv4 − kv0) | |---|---:|---:|---:| | **Mean quality score (0–5)** | **3.64** | **3.36** | **−0.28** | | Head-to-head wins | **71** | 37 | — | | Ties | — | — | 92 | | Broken/failed (score ≤ 1) | 13 | **17** | +4 | | Hard-broken (score = 0) | 2 | 2 | 0 | **Verdict:** Under graded rubric scoring, **4-bit KV quantization shows a small but consistent quality penalty** (≈ −0.28 on a 5-point scale). kv0 wins roughly 2× as many head-to-head matchups as kv4 (71 vs 37), though nearly half of all prompts (92/200) tie. The degradation is **mild and uneven**, not catastrophic — it concentrates in specific use cases rather than degrading everything. ## Quality-score distribution | Score | kv0 | kv4 | |---:|---:|---:| | 5 (perfect) | **62** | 42 | | 4 (good) | 56 | 55 | | 3 (bearable) | 44 | 54 | | 2 (significant issues) | 25 | 32 | | 1 (serious issues) | 11 | 15 | | 0 (broken) | 2 | 2 | The main shift is at the **top end**: kv0 earns 62 perfect scores vs kv4's 42 (−20). Those lost 5s mostly slide down to 3s (+10) and 2s (+7). kv4 doesn't produce dramatically more total failures — it produces fewer *flawless* answers. 4-bit KV quant is acceptable for latency/memory-sensitive deployments where a ~0.3-point average quality dip is tolerable, **except** for tag-generation and content-detection workloads, where kv0 (no quant) is meaningfully better. If those use cases matter, keep KV quant off or pair it with `repetition_penalty > 1.0` to suppress the tag-loop failures that dominate kv4's losses.
There is build warning as error using VS 2026 and Cuda 13.3 in Windows. ``` CCCL cub/config.cuh has a #pragma warning(pop) without matching push. ``` This fixes it.
…soft#29196) ### Description - Disconnect input edges that are not used by the fused SimplifiedLayerNormalization node before calling FinalizeNodeFusion. - Remove newly dead input producers only after their final consumer has been fused. - Add a full optimization-loop regression test for CPU EP fallback with a shared Cast-produced Pow exponent. - Validate that the Pow exponent is statically known to be scalar/one-element 2.0 before applying SimplifiedLayerNorm fusion. - Support epsilon on either input of Add, since Add is commutative. ### Motivation and Context A mixed-precision Cast can turn the Pow exponent from an initializer-only input into a graph edge. SimplifiedLayerNormFusion previously passed that edge to FinalizeNodeFusion, which attempted to move it to the replacement node. Since SimplifiedLayerNormalization does not have a corresponding exponent input, graph initialization failed in GetIndexFromName. The fix removes only inputs that are not part of the replacement node and preserves shared producers until they have no remaining consumers. Additional validation was added based on review feedback to make the fusion semantically safer: the Pow exponent must be proven to be 2.0, and epsilon is now identified by graph connectivity instead of assuming a fixed Add input index. Fixes microsoft#29153.
This pull request strengthens security around loading ONNX models by adding a defense-in-depth check that rejects Constant nodes with dense tensor attributes referencing internal ORT in-memory address markers. It also introduces a regression test to ensure this attack vector is blocked. **Security hardening:** * Added an explicit check in `ConstantNodeProtoToTensorProto` to reject Constant node tensor attributes with ORT in-memory address markers, preventing crafted models from propagating unsafe pointers. **Testing:** * Added a regression test `RejectInMemoryMarkerOnConstantNodeTensorAttribute` to verify that models containing Constant nodes with such in-memory markers are rejected during model load.
## Fix TreeEnsemble target id validation ### Problem `TreeEnsemble` opset 5 normalizes `leaf_targetids` into the internal v3-shaped attributes without going through the v3 attribute constructor. Invalid target ids could therefore reach the N-output aggregators, where Min/Max indexed the `predictions` vector without a bounds check. ### Fix - Centralize target/class id validation in `TreeEnsembleCommon::Init()` so all normalized entry paths are covered. - Reject non-positive target/class counts, negative target/class ids, and ids greater than or equal to the target/class count. - Add defense-in-depth target index checks in Sum/Min/Max N-output aggregators before indexing `predictions`. - Add v5 negative tests and update affected v3 regressor negative test expectations. ### Testing - `lintrunner` on changed files - rebuilt `onnxruntime_provider_test` - targeted provider tests: - `MLOpTest.TreeEnsembleFloat` - `MLOpTest.TreeEnsembleDouble` - `MLOpTest.TreeEnsembleSetMembership` - `MLOpTest.TreeEnsembleLeafOnly` - `MLOpTest.TreeEnsembleMinLeafTargetIdsOutsideBoundary` - `MLOpTest.TreeEnsembleMaxLeafTargetIdsOutsideBoundary` - `MLOpTest.TreeEnsembleNegativeLeafTargetIds` - `MLOpTest.TreeEnsembleZeroTargets` - `MLOpTest.TreeEnsembleLeafLike` - `MLOpTest.TreeEnsembleBigSet` - `MLOpTest.TreeEnsembleIssue25400` - `MLOpTest.TreeRegressorNegativeTargetIds` - `MLOpTest.TreeRegressorOutsideBoundaryTargetIds` - `MLOpTest.TreeEnsembleRegressorTargetIdsOutsideBoundary` --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Prepare for cuda plugin ep 0.10 version release
…utputs (microsoft#29398) ### Description CUDA graph tests were missing `synchronize_inputs()`/`synchronize_outputs()` calls around inference runs, creating a stream race between the default CUDA stream (host↔device copies) and the EP's compute stream. **`test_inference.cc`** (`CApiTest.basic_cuda_graph`, `RunWithCudaGraphAnnotation`): - Add `binding.SynchronizeInputs()` before each run - Add `binding.SynchronizeOutputs()` after each run, before reading results via `cudaMemcpy` **`onnxruntime_test_python_cudagraph.py`** (`run_model_with_cuda_graph`, `test_arena_with_cuda_graph`): - Add `io_binding.synchronize_inputs()` / `io_binding.synchronize_outputs()` around every `run_with_iobinding()` call, matching the pattern already used in `run_model_with_cuda_graph_annotation` ```python # Pattern now consistent across all CUDA graph test helpers: io_binding.synchronize_inputs() # ensure H2D copies visible to EP stream session.run_with_iobinding(io_binding, ro) io_binding.synchronize_outputs() # ensure EP computation done before D2H read np.testing.assert_allclose(y_ortvalue.numpy(), expected_y, ...) ``` ### Motivation and Context Without stream synchronization, the CUDA EP (running on a non-default stream) can race against `cudaMemcpy` calls issued on the default stream for input uploads and output reads. This is the same class of bug fixed in `CApiTest.basic_cuda_graph`; these tests had the identical omission. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
### Description Removes the deprecated `enable_skip_layer_norm_strict_mode` CUDA provider option and its associated "strict mode" code path from the CUDA `SkipLayerNorm` kernel. Strict mode previously routed `SkipLayerNorm` through the `LayerNormalization` kernel to gain fp32-accumulation accuracy at the cost of performance. After microsoft#28682 made `SkipLayerNorm`/`EmbedLayerNorm` CUDA kernels always accumulate in fp32, the strict-mode path is redundant: the default kernel already provides the same accuracy with better performance. This change deletes the now-dead branch and the plumbing that fed it. ### Key Changes - **`skip_layer_norm.cc`**: Remove the `strict_` branch that called `HostApplyLayerNorm`; always launch `LaunchSkipLayerNormKernel`. Drop the now-unused `layer_norm_impl.h` include and the input/skip same-shape strict-mode shape check. - **`skip_layer_norm.h`**: Remove the `strict_` member. - **`cuda_execution_provider.h`**: Remove `IsSkipLayerNormInStrictMode()`. - **`cuda_kernel_adapter.h`**: Remove `GetCudaKernelAdapterSkipLayerNormStrictMode()` shim. - **`cuda_provider_options.h`**: Keep `enable_skip_layer_norm_strict_mode` field for ABI/back-compat but mark it deprecated and ignored. - **`skiplayernorm_op_test.cc`**: Drop the redundant strict-mode test passes; tests now run a single default path. The provider option is retained (ignored) to preserve backward compatibility — existing configs that set it continue to work without error. ### Motivation Follow-up cleanup to microsoft#28682, which switched the CUDA kernels to fp32 accumulation, making strict mode obsolete. ### Testing - `skiplayernorm_op_test.cc` covers fp16/fp32/bf16 default path; strict-mode passes removed.
### Description Add a CUDA QMoE Split-K2 two-pass FC1 interleaved-SwiGLU GEMV implementation for supported fp16 INT4 decode-shaped workloads. The first pass computes two K-split partials into QMoE workspace using the selected GEMV accumulator type, and the second pass reduces the partials in fp32, applies optional bias, and writes the SwiGLU output for FC2. FC2 stays on the existing `moe_gemv_kernel` path. The route now uses two binary environment controls. `ORT_MOE_GEMV_FP32_ACCUM=1` enables fp32 accumulation, and `ORT_MOE_GEMV_SPLITK2_SWIGLU=1` enables Split-K2. Both default to `0`. | Accumulation control | Split-K2 control | Route | |---|---|---| | unset or `0` | unset or `0` | fp16 accumulation, single-kernel FC1 SwiGLU | | unset or `0` | `ORT_MOE_GEMV_SPLITK2_SWIGLU=1` | fp16 accumulation, Split-K2 FC1 SwiGLU | | `ORT_MOE_GEMV_FP32_ACCUM=1` | unset or `0` | fp32 accumulation, single-kernel FC1 SwiGLU | | `ORT_MOE_GEMV_FP32_ACCUM=1` | `ORT_MOE_GEMV_SPLITK2_SWIGLU=1` | fp32 accumulation, Split-K2 FC1 SwiGLU | This PR also: - keeps Split-K2 narrowly gated to fp16 INT4 interleaved-SwiGLU GEMV with activation/bias scale type matching the activation type; - adds QMoE workspace plumbing for the Split-K2 partials; - updates the focused QMoE profiler and Nsight wrapper with matching `--fp32-accum` and `--splitk2-swiglu` controls; - adds focused benchmark coverage that explicitly forces the Split-K2 route under the fp16-default policy; - documents the routing policy, measurements, binary knobs, and future autotune direction. ### Motivation and Context GPT-OSS-20B single-token decode spends visible time in the QMoE FC1 interleaved-SwiGLU GEMV path. Split-K2 improves FC1 parallelism by splitting the K dimension and reducing the partials in a lightweight second pass. Under the fp32-accumulation route, Split-K2 reduced FC1 kernel work from about `21.42 us` to `17.59 + 2.39 = 19.98 us` in Nsight, and repeated CUDA-graph GPT-OSS decode pairs showed about `+0.9%` to `+1.6%` throughput improvement. A later 3-pair CUDA-graph run averaged `332.099536 tok/s` for Split-K2 versus `327.857928 tok/s` with Split-K2 disabled (`+1.29%` throughput, `-1.28%` latency), with no MMLU smoke regression signal. After the normal fp16 QMoE GEMV path changed to fp16 accumulation by default, the single-kernel fp16 route became faster on the focused GPT-OSS, Qwen3.6-35B-A3B, and Gemma4-26B-A4B helper configurations. The fp16 Split-K2 variant is still kept because it is faster than the fp32 Split-K2 route in those focused runs and may be selected by future per-shape autotuning. ### Validation - Built and synced CUDA provider: - `cmake --build /home/tianlei/onnxruntime/build/cu130/Release --target onnxruntime_providers_cuda --parallel $(nproc)` - Lint/format: - `lintrunner -a ...` - `git diff --check` - Focused CUDA test: - `ORT_QMOE_GEMV_BENCHMARK=1 pytest -q onnxruntime/test/python/transformers/test_qmoe_cuda.py::TestQMoEGemvBenchmark::test_splitk2_swiglu_decode_latency` - result: `1 passed` - Focused GPT-OSS helper route checks: - default fp16: valid output, `ORT_MOE_GEMV_SPLITK2_SWIGLU=0`, latency `0.062995 ms` - fp16 accumulation with Split-K2: valid output, `ORT_MOE_GEMV_SPLITK2_SWIGLU=1`, latency `0.063945 ms` - fp32 accumulation without Split-K2: valid output, `ORT_MOE_GEMV_SPLITK2_SWIGLU=0`, latency `0.071311 ms` - fp32 accumulation with Split-K2: valid output, `ORT_MOE_GEMV_SPLITK2_SWIGLU=1`, latency `0.071726 ms` - Nsight route verification: - default fp16 dispatched `moe_gemv_interleaved_swiglu_kernel` and `moe_gemv_kernel` - fp16 accumulation with Split-K2 dispatched `moe_gemv_splitk_partials_kernel`, `moe_gemv_splitk_reduce_swiglu_kernel`, and `moe_gemv_kernel` - fp32 accumulation without Split-K2 dispatched the single FC1 SwiGLU kernel and FC2 - fp32 accumulation with Split-K2 enabled dispatched Split-K2 partial/reduce kernels and FC2 - Additional focused helper checks, all with valid output: - Qwen3.6-35B-A3B: fp16 Split-K2 `0.049207 ms`, fp16 single-kernel `0.047403 ms`, fp32 Split-K2 `0.052055 ms` - Gemma4-26B-A4B: fp16 Split-K2 `0.053503 ms`, fp16 single-kernel `0.050732 ms`, fp32 Split-K2 `0.059571 ms` - 1000-sample `match_mmlu` smoke on GPT-OSS-20B INT4 QMoE: - Split-K2 route: `0.8380` - Split-K2 disabled: `0.8350`
…osoft#29253) ### Description `MaxpoolWithMask::Compute` now validates that the pooling kernel rank equals the input spatial rank (and that the rank is one of the supported values {1, 2, 3}) before allocating the output, returning a clear error for malformed inputs instead of proceeding with a mismatched configuration. ### Changes - Add an explicit check in `MaxpoolWithMask::Compute` that the kernel rank matches the input spatial rank and is within the supported {1, 2, 3} range, with a descriptive error message. - DRY up the rank check. - Add `MaxPoolWithMask_KernelRankMismatch` and `MaxPoolWithMask_KernelRankTooLarge` unit tests covering the new validation. ### Motivation Improves input validation and error diagnostics for malformed `MaxpoolWithMask` inputs. CPU-only; no behavior change for valid inputs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This pull request refactors how ONNX initializers are handled in the CoreML provider code, centralizing and standardizing the creation of initializer objects. The main change is to consistently use the `ModelBuilder::CreateInitializer` method instead of directly constructing `Initializer` objects. Additionally, the signature of the `CreateCoreMLWeight` function is updated to require the `ModelBuilder` instance. Some helper functions are also modified to pass along the model path where necessary. These changes improve maintainability, facilitate future enhancements, and ensure consistent handling of model paths and initializers. ### Refactoring Initializer Creation * Replaced direct construction of `Initializer` objects with calls to `model_builder.CreateInitializer` throughout multiple operator builder implementations (e.g., Conv, Gemm, Pad, PRelu, GatherND, Reduction, Reshape, Split). This ensures consistent handling of initializers and centralizes any future logic changes. [[1]](diffhunk://#diff-ea3b27d64c46d0b499f8500dd6b0181bcd92459d842a65c1a1c3b1932da612e0L49-R49) [[2]](diffhunk://#diff-ea3b27d64c46d0b499f8500dd6b0181bcd92459d842a65c1a1c3b1932da612e0L84-R84) [[3]](diffhunk://#diff-a13aebda6cfc3bb814da5ebbfc333b98ed3dd1d93b820ea33d16c69166d0e729L75-R75) [[4]](diffhunk://#diff-b3cadc48997d5529b309453c5a312a0e584ffeb9fff5a249605a4c354be84ba6L161-R162) [[5]](diffhunk://#diff-b3cadc48997d5529b309453c5a312a0e584ffeb9fff5a249605a4c354be84ba6L231-R232) [[6]](diffhunk://#diff-cb42426502a0e82f16cdf4b0f30bb56cd0e51fd1ca13d5b087a6578d229e5ee9L83-R89) [[7]](diffhunk://#diff-cb42426502a0e82f16cdf4b0f30bb56cd0e51fd1ca13d5b087a6578d229e5ee9L143-R146) [[8]](diffhunk://#diff-cb42426502a0e82f16cdf4b0f30bb56cd0e51fd1ca13d5b087a6578d229e5ee9L173-R176) [[9]](diffhunk://#diff-283e957ad5c83ee7afaaf49820dffca49ede30c7540c288e650349e582573d7bL49-R50) [[10]](diffhunk://#diff-9ff46f17a77d7f2a8cf1ed5be519b434a81fa8119c73a5c1632d0d2accc8ef1eL69-R69) [[11]](diffhunk://#diff-9ff46f17a77d7f2a8cf1ed5be519b434a81fa8119c73a5c1632d0d2accc8ef1eL113-R113) [[12]](diffhunk://#diff-d56880aad90246a9bf69c993b0cc4ffc192eb1519eef62666b0bb02677f186a3L63-R63) * Updated `CreateCoreMLWeight` to require the `ModelBuilder` parameter and updated all call sites accordingly. This enables the function to use the centralized initializer creation logic. [[1]](diffhunk://#diff-6cec11c3a506483921fc8274758a47b01d8711ba2781139c51b2eca97410c536L92-R95) [[2]](diffhunk://#diff-be4bf80e91cbe935e8b5d2f9b6b628ce3f550a7e5e1cf375c5d4854994955da7L39-R41) [[3]](diffhunk://#diff-05313190254d7c959c84f5babc80baa88acc7169fba8f003e85528c9fc27b885L284-R290) [[4]](diffhunk://#diff-9d82116868ea177194a09e13a3974a31799d54fda55d75dc96a804960ddf4681L87-R90) [[5]](diffhunk://#diff-b3cadc48997d5529b309453c5a312a0e584ffeb9fff5a249605a4c354be84ba6L217-R221) ### Model Path Handling * Updated helper functions and operator builder logic (notably in the Pad and Reshape operators) to pass the model path when creating `Initializer` objects, ensuring correct handling of external data and model-relative paths. [[1]](diffhunk://#diff-cb42426502a0e82f16cdf4b0f30bb56cd0e51fd1ca13d5b087a6578d229e5ee9R42-R50) [[2]](diffhunk://#diff-cb42426502a0e82f16cdf4b0f30bb56cd0e51fd1ca13d5b087a6578d229e5ee9L303-R306) [[3]](diffhunk://#diff-cb42426502a0e82f16cdf4b0f30bb56cd0e51fd1ca13d5b087a6578d229e5ee9L324-R329) [[4]](diffhunk://#diff-283e957ad5c83ee7afaaf49820dffca49ede30c7540c288e650349e582573d7bL93-R96) ### Helper Function Updates * Modified utility and internal functions (such as `GetTensorDataTransposed` and `GetPaddingAxesData`) to accept the `ModelBuilder` or model path as parameters, propagating the new initializer creation pattern throughout the codebase. [[1]](diffhunk://#diff-b3cadc48997d5529b309453c5a312a0e584ffeb9fff5a249605a4c354be84ba6R73-R75) [[2]](diffhunk://#diff-b3cadc48997d5529b309453c5a312a0e584ffeb9fff5a249605a4c354be84ba6L141-R147) These changes collectively improve code consistency, reliability, and future extensibility for handling ONNX initializers in the CoreML backend. Co-authored-by: lucka-me <lucka-me@users.noreply.github.com> --------- Co-authored-by: lucka-me <i@lucka.moe>
### Description
Fix VSINPU ep build error
${CMAKE_CURRENT_BINARY_DIR} is so that #include "onnxruntime_config.h"
is found
Signed-off-by: Kee <xuke537@hotmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.