Skip to content

Sync with Microsoft ONNX Runtime - 01072026#1175

Merged
hdharpure9922 merged 13 commits into
ovep-developfrom
sync_msft_01072026
Jul 1, 2026
Merged

Sync with Microsoft ONNX Runtime - 01072026#1175
hdharpure9922 merged 13 commits into
ovep-developfrom
sync_msft_01072026

Conversation

@ai-fw-intg

Copy link
Copy Markdown

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

sushraja-msft and others added 13 commits June 29, 2026 08:33
### Description
Turbo quant implementation for ORT WebGPU, using a Hadamard matrix for
rotation instead of a regular matrix, which deviates from the paper.

| seq_length | turbo_quant | prefill_tps | generation_tps |
working_set_gb | gpu_memory_gb | % saving|

|-----------|-------------|-------------|----------------|----------------|---------------|----------------|
| 1024 | ❌ Off| 2007.21 | 108.48 | 1.84 | 4.14 | |
| | ✅ On| 2053.70 | 113.06 | 1.64 | 3.37 | **18.6%** |
| 2048 | ❌ Off | 1778.32 | 111.26 | 1.95 | 5.49 | |
| | ✅ On | 1763.69 | 111.83 | 1.90 | 3.75 | **31.7%** |
| 4096 |❌ Off | 1373.88 | 29.78 | 1.61 | 7.21 | |
| |✅ On | 1367.89 | 29.51 | 2.30 | 4.96 | **31.2%** |
| 8192 | ❌ Off | 948.90 | 84.17 | 2.44 | 10.14 | |
| | ✅ On | 943.96 | 82.36 | 2.31 | 6.50 | **35.9%** |

- Hadamard transform is kept as its own class and standalone shader
-[hadamard_transform.h‎](https://github.com/microsoft/onnxruntime/pull/28059/changes#diff-eb3f846a5a284367f10b2059cc662e19d1b029b416c0c8373e20add9dbfe8afa)
- used to rotate/unrotate Q. Can be used by other feature in the future
like activation quantization.

- TurboQuantHadamard applies Hadamard transform and then quantizes using
the centroid look up for q4.
- Dequantization is all fused into the various flash attention kernels.

Caller for LLMs like gen-ai have to set kvCacheQuantizationBits:4 in the
EP provider options and pass in present,past kv cache input, output
tensors that have a reduced headsize.

With turboquant the headsize reduces from say using 16bits per value to
4bits and in addition there is a 32bit scale in the front of each token
per head.

----------------------------------------------------------------

Note on impact on quality. Evaluting KV quantization 4 bits with Phi4
mini

| Metric | kv0 (no quant) | kv4 (4-bit) | Δ (kv4 − kv0) |
|---|---:|---:|---:|
| **Mean quality score (0–5)** | **3.64** | **3.36** | **−0.28** |
| Head-to-head wins | **71** | 37 | — |
| Ties | — | — | 92 |
| Broken/failed (score ≤ 1) | 13 | **17** | +4 |
| Hard-broken (score = 0) | 2 | 2 | 0 |

**Verdict:** Under graded rubric scoring, **4-bit KV quantization shows
a small but consistent quality penalty** (≈ −0.28 on a 5-point scale).
kv0 wins roughly 2× as many head-to-head matchups as kv4 (71 vs 37),
though nearly half of all prompts (92/200) tie. The degradation is
**mild and uneven**, not catastrophic — it concentrates in specific use
cases rather than degrading everything.

## Quality-score distribution

| Score | kv0 | kv4 |
|---:|---:|---:|
| 5 (perfect) | **62** | 42 |
| 4 (good) | 56 | 55 |
| 3 (bearable) | 44 | 54 |
| 2 (significant issues) | 25 | 32 |
| 1 (serious issues) | 11 | 15 |
| 0 (broken) | 2 | 2 |

The main shift is at the **top end**: kv0 earns 62 perfect scores vs
kv4's 42 (−20). Those lost 5s mostly slide down to 3s (+10) and 2s (+7).
kv4 doesn't produce dramatically more total failures — it produces fewer
*flawless* answers.

4-bit KV quant is acceptable for latency/memory-sensitive deployments
where a ~0.3-point average quality dip is tolerable, **except** for
tag-generation and content-detection workloads, where kv0 (no quant) is
meaningfully better. If those use cases matter, keep KV quant off or
pair it with `repetition_penalty > 1.0` to suppress the tag-loop
failures that dominate kv4's losses.
There is build warning as error using VS 2026 and Cuda 13.3 in Windows. 
```
CCCL cub/config.cuh has a #pragma warning(pop) without matching push.
```
This fixes it.
…soft#29196)

### Description
- Disconnect input edges that are not used by the fused
SimplifiedLayerNormalization node before calling FinalizeNodeFusion.
- Remove newly dead input producers only after their final consumer has
been fused.
- Add a full optimization-loop regression test for CPU EP fallback with
a shared Cast-produced Pow exponent.
- Validate that the Pow exponent is statically known to be
scalar/one-element 2.0 before applying SimplifiedLayerNorm fusion.
- Support epsilon on either input of Add, since Add is commutative.

### Motivation and Context
A mixed-precision Cast can turn the Pow exponent from an
initializer-only input into a graph edge. SimplifiedLayerNormFusion
previously passed that edge to FinalizeNodeFusion, which attempted to
move it to the replacement node. Since SimplifiedLayerNormalization does
not have a corresponding exponent input, graph initialization failed in
GetIndexFromName.

The fix removes only inputs that are not part of the replacement node
and preserves shared producers until they have no remaining consumers.

Additional validation was added based on review feedback to make the
fusion semantically safer: the Pow exponent must be proven to be 2.0,
and epsilon is now identified by graph connectivity instead of assuming
a fixed Add input index.

Fixes microsoft#29153.
This pull request strengthens security around loading ONNX models by
adding a defense-in-depth check that rejects Constant nodes with dense
tensor attributes referencing internal ORT in-memory address markers. It
also introduces a regression test to ensure this attack vector is
blocked.

**Security hardening:**

* Added an explicit check in `ConstantNodeProtoToTensorProto` to reject
Constant node tensor attributes with ORT in-memory address markers,
preventing crafted models from propagating unsafe pointers.

**Testing:**

* Added a regression test
`RejectInMemoryMarkerOnConstantNodeTensorAttribute` to verify that
models containing Constant nodes with such in-memory markers are
rejected during model load.
## Fix TreeEnsemble target id validation

### Problem
`TreeEnsemble` opset 5 normalizes `leaf_targetids` into the internal
v3-shaped attributes without going through the v3 attribute constructor.
Invalid target ids could therefore reach the N-output aggregators, where
Min/Max indexed the `predictions` vector without a bounds check.

### Fix
- Centralize target/class id validation in `TreeEnsembleCommon::Init()`
so all normalized entry paths are covered.
- Reject non-positive target/class counts, negative target/class ids,
and ids greater than or equal to the target/class count.
- Add defense-in-depth target index checks in Sum/Min/Max N-output
aggregators before indexing `predictions`.
- Add v5 negative tests and update affected v3 regressor negative test
expectations.

### Testing
- `lintrunner` on changed files
- rebuilt `onnxruntime_provider_test`
- targeted provider tests:
  - `MLOpTest.TreeEnsembleFloat`
  - `MLOpTest.TreeEnsembleDouble`
  - `MLOpTest.TreeEnsembleSetMembership`
  - `MLOpTest.TreeEnsembleLeafOnly`
  - `MLOpTest.TreeEnsembleMinLeafTargetIdsOutsideBoundary`
  - `MLOpTest.TreeEnsembleMaxLeafTargetIdsOutsideBoundary`
  - `MLOpTest.TreeEnsembleNegativeLeafTargetIds`
  - `MLOpTest.TreeEnsembleZeroTargets`
  - `MLOpTest.TreeEnsembleLeafLike`
  - `MLOpTest.TreeEnsembleBigSet`
  - `MLOpTest.TreeEnsembleIssue25400`
  - `MLOpTest.TreeRegressorNegativeTargetIds`
  - `MLOpTest.TreeRegressorOutsideBoundaryTargetIds`
  - `MLOpTest.TreeEnsembleRegressorTargetIdsOutsideBoundary`

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Prepare for cuda plugin ep 0.10 version release
…utputs (microsoft#29398)

### Description

CUDA graph tests were missing
`synchronize_inputs()`/`synchronize_outputs()` calls around inference
runs, creating a stream race between the default CUDA stream
(host↔device copies) and the EP's compute stream.

**`test_inference.cc`** (`CApiTest.basic_cuda_graph`,
`RunWithCudaGraphAnnotation`):
- Add `binding.SynchronizeInputs()` before each run
- Add `binding.SynchronizeOutputs()` after each run, before reading
results via `cudaMemcpy`

**`onnxruntime_test_python_cudagraph.py`** (`run_model_with_cuda_graph`,
`test_arena_with_cuda_graph`):
- Add `io_binding.synchronize_inputs()` /
`io_binding.synchronize_outputs()` around every `run_with_iobinding()`
call, matching the pattern already used in
`run_model_with_cuda_graph_annotation`

```python
# Pattern now consistent across all CUDA graph test helpers:
io_binding.synchronize_inputs()   # ensure H2D copies visible to EP stream
session.run_with_iobinding(io_binding, ro)
io_binding.synchronize_outputs()  # ensure EP computation done before D2H read
np.testing.assert_allclose(y_ortvalue.numpy(), expected_y, ...)
```

### Motivation and Context

Without stream synchronization, the CUDA EP (running on a non-default
stream) can race against `cudaMemcpy` calls issued on the default stream
for input uploads and output reads. This is the same class of bug fixed
in `CApiTest.basic_cuda_graph`; these tests had the identical omission.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
### Description

Removes the deprecated `enable_skip_layer_norm_strict_mode` CUDA
provider option and its associated "strict mode" code path from the CUDA
`SkipLayerNorm` kernel.

Strict mode previously routed `SkipLayerNorm` through the
`LayerNormalization` kernel to gain fp32-accumulation accuracy at the
cost of performance. After microsoft#28682 made `SkipLayerNorm`/`EmbedLayerNorm`
CUDA kernels always accumulate in fp32, the strict-mode path is
redundant: the default kernel already provides the same accuracy with
better performance. This change deletes the now-dead branch and the
plumbing that fed it.

### Key Changes

- **`skip_layer_norm.cc`**: Remove the `strict_` branch that called
`HostApplyLayerNorm`; always launch `LaunchSkipLayerNormKernel`. Drop
the now-unused `layer_norm_impl.h` include and the input/skip same-shape
strict-mode shape check.
- **`skip_layer_norm.h`**: Remove the `strict_` member.
- **`cuda_execution_provider.h`**: Remove
`IsSkipLayerNormInStrictMode()`.
- **`cuda_kernel_adapter.h`**: Remove
`GetCudaKernelAdapterSkipLayerNormStrictMode()` shim.
- **`cuda_provider_options.h`**: Keep
`enable_skip_layer_norm_strict_mode` field for ABI/back-compat but mark
it deprecated and ignored.
- **`skiplayernorm_op_test.cc`**: Drop the redundant strict-mode test
passes; tests now run a single default path.

The provider option is retained (ignored) to preserve backward
compatibility — existing configs that set it continue to work without
error.

### Motivation

Follow-up cleanup to microsoft#28682, which switched the CUDA kernels to fp32
accumulation, making strict mode obsolete.

### Testing

- `skiplayernorm_op_test.cc` covers fp16/fp32/bf16 default path;
strict-mode passes removed.
### Description

Add a CUDA QMoE Split-K2 two-pass FC1 interleaved-SwiGLU GEMV
implementation for supported fp16 INT4 decode-shaped workloads. The
first pass computes two K-split partials into QMoE workspace using the
selected GEMV accumulator type, and the second pass reduces the partials
in fp32, applies optional bias, and writes the SwiGLU output for FC2.
FC2 stays on the existing `moe_gemv_kernel` path.

The route now uses two binary environment controls.
`ORT_MOE_GEMV_FP32_ACCUM=1` enables fp32 accumulation, and
`ORT_MOE_GEMV_SPLITK2_SWIGLU=1` enables Split-K2. Both default to `0`.

| Accumulation control | Split-K2 control | Route |
|---|---|---|
| unset or `0` | unset or `0` | fp16 accumulation, single-kernel FC1
SwiGLU |
| unset or `0` | `ORT_MOE_GEMV_SPLITK2_SWIGLU=1` | fp16 accumulation,
Split-K2 FC1 SwiGLU |
| `ORT_MOE_GEMV_FP32_ACCUM=1` | unset or `0` | fp32 accumulation,
single-kernel FC1 SwiGLU |
| `ORT_MOE_GEMV_FP32_ACCUM=1` | `ORT_MOE_GEMV_SPLITK2_SWIGLU=1` | fp32
accumulation, Split-K2 FC1 SwiGLU |

This PR also:

- keeps Split-K2 narrowly gated to fp16 INT4 interleaved-SwiGLU GEMV
with activation/bias scale type matching the activation type;
- adds QMoE workspace plumbing for the Split-K2 partials;
- updates the focused QMoE profiler and Nsight wrapper with matching
`--fp32-accum` and `--splitk2-swiglu` controls;
- adds focused benchmark coverage that explicitly forces the Split-K2
route under the fp16-default policy;
- documents the routing policy, measurements, binary knobs, and future
autotune direction.

### Motivation and Context

GPT-OSS-20B single-token decode spends visible time in the QMoE FC1
interleaved-SwiGLU GEMV path. Split-K2 improves FC1 parallelism by
splitting the K dimension and reducing the partials in a lightweight
second pass.

Under the fp32-accumulation route, Split-K2 reduced FC1 kernel work from
about `21.42 us` to `17.59 + 2.39 = 19.98 us` in Nsight, and repeated
CUDA-graph GPT-OSS decode pairs showed about `+0.9%` to `+1.6%`
throughput improvement. A later 3-pair CUDA-graph run averaged
`332.099536 tok/s` for Split-K2 versus `327.857928 tok/s` with Split-K2
disabled (`+1.29%` throughput, `-1.28%` latency), with no MMLU smoke
regression signal.

After the normal fp16 QMoE GEMV path changed to fp16 accumulation by
default, the single-kernel fp16 route became faster on the focused
GPT-OSS, Qwen3.6-35B-A3B, and Gemma4-26B-A4B helper configurations. The
fp16 Split-K2 variant is still kept because it is faster than the fp32
Split-K2 route in those focused runs and may be selected by future
per-shape autotuning.

### Validation

- Built and synced CUDA provider:
- `cmake --build /home/tianlei/onnxruntime/build/cu130/Release --target
onnxruntime_providers_cuda --parallel $(nproc)`
- Lint/format:
  - `lintrunner -a ...`
  - `git diff --check`
- Focused CUDA test:
- `ORT_QMOE_GEMV_BENCHMARK=1 pytest -q
onnxruntime/test/python/transformers/test_qmoe_cuda.py::TestQMoEGemvBenchmark::test_splitk2_swiglu_decode_latency`
  - result: `1 passed`
- Focused GPT-OSS helper route checks:
- default fp16: valid output, `ORT_MOE_GEMV_SPLITK2_SWIGLU=0`, latency
`0.062995 ms`
- fp16 accumulation with Split-K2: valid output,
`ORT_MOE_GEMV_SPLITK2_SWIGLU=1`, latency `0.063945 ms`
- fp32 accumulation without Split-K2: valid output,
`ORT_MOE_GEMV_SPLITK2_SWIGLU=0`, latency `0.071311 ms`
- fp32 accumulation with Split-K2: valid output,
`ORT_MOE_GEMV_SPLITK2_SWIGLU=1`, latency `0.071726 ms`
- Nsight route verification:
- default fp16 dispatched `moe_gemv_interleaved_swiglu_kernel` and
`moe_gemv_kernel`
- fp16 accumulation with Split-K2 dispatched
`moe_gemv_splitk_partials_kernel`,
`moe_gemv_splitk_reduce_swiglu_kernel`, and `moe_gemv_kernel`
- fp32 accumulation without Split-K2 dispatched the single FC1 SwiGLU
kernel and FC2
- fp32 accumulation with Split-K2 enabled dispatched Split-K2
partial/reduce kernels and FC2
- Additional focused helper checks, all with valid output:
- Qwen3.6-35B-A3B: fp16 Split-K2 `0.049207 ms`, fp16 single-kernel
`0.047403 ms`, fp32 Split-K2 `0.052055 ms`
- Gemma4-26B-A4B: fp16 Split-K2 `0.053503 ms`, fp16 single-kernel
`0.050732 ms`, fp32 Split-K2 `0.059571 ms`
- 1000-sample `match_mmlu` smoke on GPT-OSS-20B INT4 QMoE:
  - Split-K2 route: `0.8380`
  - Split-K2 disabled: `0.8350`
…osoft#29253)

### Description

`MaxpoolWithMask::Compute` now validates that the pooling kernel rank
equals the input spatial rank (and that the rank is one of the supported
values {1, 2, 3}) before allocating the output, returning a clear error
for malformed inputs instead of proceeding with a mismatched
configuration.

### Changes

- Add an explicit check in `MaxpoolWithMask::Compute` that the kernel
rank matches the input spatial rank and is within the supported {1, 2,
3} range, with a descriptive error message.
- DRY up the rank check.
- Add `MaxPoolWithMask_KernelRankMismatch` and
`MaxPoolWithMask_KernelRankTooLarge` unit tests covering the new
validation.

### Motivation

Improves input validation and error diagnostics for malformed
`MaxpoolWithMask` inputs. CPU-only; no behavior change for valid inputs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This pull request refactors how ONNX initializers are handled in the
CoreML provider code, centralizing and standardizing the creation of
initializer objects. The main change is to consistently use the
`ModelBuilder::CreateInitializer` method instead of directly
constructing `Initializer` objects. Additionally, the signature of the
`CreateCoreMLWeight` function is updated to require the `ModelBuilder`
instance. Some helper functions are also modified to pass along the
model path where necessary. These changes improve maintainability,
facilitate future enhancements, and ensure consistent handling of model
paths and initializers.

### Refactoring Initializer Creation

* Replaced direct construction of `Initializer` objects with calls to
`model_builder.CreateInitializer` throughout multiple operator builder
implementations (e.g., Conv, Gemm, Pad, PRelu, GatherND, Reduction,
Reshape, Split). This ensures consistent handling of initializers and
centralizes any future logic changes.
[[1]](diffhunk://#diff-ea3b27d64c46d0b499f8500dd6b0181bcd92459d842a65c1a1c3b1932da612e0L49-R49)
[[2]](diffhunk://#diff-ea3b27d64c46d0b499f8500dd6b0181bcd92459d842a65c1a1c3b1932da612e0L84-R84)
[[3]](diffhunk://#diff-a13aebda6cfc3bb814da5ebbfc333b98ed3dd1d93b820ea33d16c69166d0e729L75-R75)
[[4]](diffhunk://#diff-b3cadc48997d5529b309453c5a312a0e584ffeb9fff5a249605a4c354be84ba6L161-R162)
[[5]](diffhunk://#diff-b3cadc48997d5529b309453c5a312a0e584ffeb9fff5a249605a4c354be84ba6L231-R232)
[[6]](diffhunk://#diff-cb42426502a0e82f16cdf4b0f30bb56cd0e51fd1ca13d5b087a6578d229e5ee9L83-R89)
[[7]](diffhunk://#diff-cb42426502a0e82f16cdf4b0f30bb56cd0e51fd1ca13d5b087a6578d229e5ee9L143-R146)
[[8]](diffhunk://#diff-cb42426502a0e82f16cdf4b0f30bb56cd0e51fd1ca13d5b087a6578d229e5ee9L173-R176)
[[9]](diffhunk://#diff-283e957ad5c83ee7afaaf49820dffca49ede30c7540c288e650349e582573d7bL49-R50)
[[10]](diffhunk://#diff-9ff46f17a77d7f2a8cf1ed5be519b434a81fa8119c73a5c1632d0d2accc8ef1eL69-R69)
[[11]](diffhunk://#diff-9ff46f17a77d7f2a8cf1ed5be519b434a81fa8119c73a5c1632d0d2accc8ef1eL113-R113)
[[12]](diffhunk://#diff-d56880aad90246a9bf69c993b0cc4ffc192eb1519eef62666b0bb02677f186a3L63-R63)

* Updated `CreateCoreMLWeight` to require the `ModelBuilder` parameter
and updated all call sites accordingly. This enables the function to use
the centralized initializer creation logic.
[[1]](diffhunk://#diff-6cec11c3a506483921fc8274758a47b01d8711ba2781139c51b2eca97410c536L92-R95)
[[2]](diffhunk://#diff-be4bf80e91cbe935e8b5d2f9b6b628ce3f550a7e5e1cf375c5d4854994955da7L39-R41)
[[3]](diffhunk://#diff-05313190254d7c959c84f5babc80baa88acc7169fba8f003e85528c9fc27b885L284-R290)
[[4]](diffhunk://#diff-9d82116868ea177194a09e13a3974a31799d54fda55d75dc96a804960ddf4681L87-R90)
[[5]](diffhunk://#diff-b3cadc48997d5529b309453c5a312a0e584ffeb9fff5a249605a4c354be84ba6L217-R221)

### Model Path Handling

* Updated helper functions and operator builder logic (notably in the
Pad and Reshape operators) to pass the model path when creating
`Initializer` objects, ensuring correct handling of external data and
model-relative paths.
[[1]](diffhunk://#diff-cb42426502a0e82f16cdf4b0f30bb56cd0e51fd1ca13d5b087a6578d229e5ee9R42-R50)
[[2]](diffhunk://#diff-cb42426502a0e82f16cdf4b0f30bb56cd0e51fd1ca13d5b087a6578d229e5ee9L303-R306)
[[3]](diffhunk://#diff-cb42426502a0e82f16cdf4b0f30bb56cd0e51fd1ca13d5b087a6578d229e5ee9L324-R329)
[[4]](diffhunk://#diff-283e957ad5c83ee7afaaf49820dffca49ede30c7540c288e650349e582573d7bL93-R96)

### Helper Function Updates

* Modified utility and internal functions (such as
`GetTensorDataTransposed` and `GetPaddingAxesData`) to accept the
`ModelBuilder` or model path as parameters, propagating the new
initializer creation pattern throughout the codebase.
[[1]](diffhunk://#diff-b3cadc48997d5529b309453c5a312a0e584ffeb9fff5a249605a4c354be84ba6R73-R75)
[[2]](diffhunk://#diff-b3cadc48997d5529b309453c5a312a0e584ffeb9fff5a249605a4c354be84ba6L141-R147)

These changes collectively improve code consistency, reliability, and
future extensibility for handling ONNX initializers in the CoreML
backend.

Co-authored-by: lucka-me <lucka-me@users.noreply.github.com>

---------

Co-authored-by: lucka-me <i@lucka.moe>
### Description
Fix VSINPU ep build error

${CMAKE_CURRENT_BINARY_DIR} is so that #include "onnxruntime_config.h"
is found

Signed-off-by: Kee <xuke537@hotmail.com>

@hdharpure9922 hdharpure9922 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hdharpure9922 hdharpure9922 merged commit 34ee98f into ovep-develop Jul 1, 2026
7 of 8 checks passed
@hdharpure9922 hdharpure9922 deleted the sync_msft_01072026 branch July 1, 2026 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.