|
| 1 | +--- |
| 2 | +title: Arm Backend Known Issues |
| 3 | +category: DEBUGGING |
| 4 | +backends: [Arm] |
| 5 | +last_validated: 2026-04-15 |
| 6 | +source_issues: [1004, 1110, 1161, 1163, 1230, 11913, 11999, 12237, 10899, 12270, 12959, 12991, 13022, 13399, 13557, 13842, 13901, 15805, 15870, 16090, 16225, 16374, 16426, 16541, 16629, 16739, 16779, 16784, 16864, 16899, 16902, 17241, 17397, 17437, 17489, 17667, 17668, 17753, 17902, 18306, 18319, 18491, 18500, 18873] |
| 7 | +--- |
| 8 | + |
| 9 | +# Arm Backend Known Issues |
| 10 | + |
| 11 | +## Submodule / Setup Issues |
| 12 | + |
| 13 | +### git.mlplatform.org SSL and availability |
| 14 | + |
| 15 | +The Arm backend's submodule (`ethos-u-core-driver`) is hosted on `git.mlplatform.org` which has recurring issues (note: `serialization_lib` has been removed from the repo): |
| 16 | + |
| 17 | +- **SSL certificate verification failures** — `fatal: unable to access ... server certificate verification failed` |
| 18 | +- **HTTP 500 errors** — server outages |
| 19 | +- These failures block ALL submodule init, not just Arm submodules [Source: #1004, #1163] |
| 20 | + |
| 21 | +**Fix:** Remove the Arm submodule if not using the Arm backend: |
| 22 | +```bash |
| 23 | +git submodule deinit backends/arm/third-party/ethos-u-core-driver/ |
| 24 | +``` |
| 25 | +Or disable SSL verification (not recommended): `git config --global http.sslVerify "false"` [Source: #1004] |
| 26 | + |
| 27 | +### install_executorch.sh failures on macOS |
| 28 | + |
| 29 | +Build failures during pip wheel build on macOS may be caused by CMake version conflicts. Some users report that downgrading CMake to 3.25, re-running the install script (which then upgrades CMake), resolves the issue. This is likely a caching/state issue. [Source: #10151] |
| 30 | + |
| 31 | +**Best fix:** Use a clean environment and v0.6+. [Source: #10151] |
| 32 | + |
| 33 | +## Operator / Compilation Issues |
| 34 | + |
| 35 | +### Dynamic shapes not supported |
| 36 | + |
| 37 | +The Arm backend cannot handle models with dynamic shapes. `SymFloat` or `SymInt` objects in the graph cause assertion failures in `get_first_fake_tensor()`. |
| 38 | + |
| 39 | +``` |
| 40 | +AssertionError: Found zuf38 in meta["val"] of _local_scalar_dense_2, expected to find FakeTensor |
| 41 | +``` |
| 42 | +or: |
| 43 | +``` |
| 44 | +TypeError: Expected a FakeTensor ... but got SymFloat |
| 45 | +``` |
| 46 | + |
| 47 | +**Workaround:** Fix all input shapes at export time. For YOLO models, remove the dynamic anchor generation. [Source: #12237] |
| 48 | + |
| 49 | +### Attribute mutation during export |
| 50 | + |
| 51 | +Models that mutate attributes (like YOLO's `self.anchors`) fail with strict export: |
| 52 | +``` |
| 53 | +AssertionError: Mutating module attribute anchors during export. |
| 54 | +``` |
| 55 | + |
| 56 | +**Fix:** Use `strict=False` in `torch.export.export_for_training()`. [Source: #12237] |
| 57 | + |
| 58 | +### NHWC memory format conversion |
| 59 | + |
| 60 | +TOSA requires channel-last (NHWC) format. The `Permute_Memory_Format_Pass` handles this, but was historically WIP with incomplete shape updates for neighbor operators. [Source: #1110] |
| 61 | + |
| 62 | +### Vela compiler internal errors |
| 63 | + |
| 64 | +Early versions had issues with Vela rejecting TOSA output: |
| 65 | +- `AttributeError: 'ReshapeAttribute' object has no attribute 'NewshapeAsNumpy'` — case sensitivity bug in Vela |
| 66 | +- Linear layers could fail until the TOSA-to-Vela mapping was revised [Source: #1161] |
| 67 | + |
| 68 | +### Missing quantized op kernels |
| 69 | + |
| 70 | +Running quantized models without delegation requires linking the quantized op library: |
| 71 | +``` |
| 72 | +RuntimeError: Missing out variants: {'quantized_decomposed::dequantize_per_tensor', ...} |
| 73 | +``` |
| 74 | + |
| 75 | +**Fix:** Build and link `quantized_ops_lib`. Performance without NPU delegation will be poor. [Source: #1161] |
| 76 | + |
| 77 | +## Build Issues |
| 78 | + |
| 79 | +### c10/macros/cmake_macros.h not found |
| 80 | + |
| 81 | +When building backends as separate CMake projects (e.g., MediaTek LLaMA runner), you may see: |
| 82 | +``` |
| 83 | +fatal error: 'c10/macros/cmake_macros.h' file not found |
| 84 | +``` |
| 85 | + |
| 86 | +**Fix:** Define `C10_USING_CUSTOM_GENERATED_MACROS` in the CMakeLists.txt. This is needed whenever a separate CMake project sets up ExecuTorch include paths directly rather than using the `executorch_core` target's public compile definitions. [Source: #11999] |
| 87 | + |
| 88 | +### Selective build for baremetal |
| 89 | + |
| 90 | +`libportable_kernels` for Arm baremetal may not include selective build by default. Use CMake flags to enable: |
| 91 | +```bash |
| 92 | +-DEXECUTORCH_SELECT_OPS_FROM_MODEL="<model>.pte" |
| 93 | +-DEXECUTORCH_DTYPE_SELECTIVE_BUILD=ON |
| 94 | +``` |
| 95 | +[Source: #11913] |
| 96 | + |
| 97 | +## Performance Profiling |
| 98 | + |
| 99 | +### Vela estimator vs FVP profiling |
| 100 | + |
| 101 | +The Vela compiler includes a performance estimator, but its estimates can differ significantly from actual FVP (Fixed Virtual Platform) profiling results. Always validate performance on FVP or real hardware. [Source: #18319] |
| 102 | + |
| 103 | +### Non-delegated performance |
| 104 | + |
| 105 | +Running quantized models on Cortex-M CPU without Ethos-U delegation has "tragic" performance (as noted by core team). Always use delegation for production workloads. [Source: #1161] |
| 106 | + |
| 107 | +## Preserved Ops API |
| 108 | + |
| 109 | +Cadence and Arm backends need `to_edge_with_preserved_ops` (experimental) to prevent decomposition of ops like `aten.rms_norm`. This API is being promoted to official status: |
| 110 | +- `preserve_ops` will be added to `EdgeCompileConfig` |
| 111 | +- View/mutation ops can be preserved if consumed by a delegate backend |
| 112 | +- View/mutation ops should NOT be preserved if they remain in the portable graph [Source: #12306] |
| 113 | + |
| 114 | +## Quantizer Issues |
| 115 | + |
| 116 | +### Observer sharing bug at Conv-ReLU + residual junctions |
| 117 | + |
| 118 | +The Arm Ethos quantizer incorrectly shares observers across `add`, `permute`, `relu` at residual connections. This causes quantization errors in models with skip connections (e.g., ResNet, MobileNet). Root cause: `quantization_annotator.py` doesn't properly handle shared quantization specs at add nodes. [Source: #12959] |
| 119 | + |
| 120 | +### SharedQuantizationSpec infinite recursion |
| 121 | + |
| 122 | +Using `SharedQuantizationSpec` with certain topologies (e.g., `minimum → eq` chains) causes `RecursionError`. Fixed upstream in pytorch/ao#3011. [Source: #13842] |
| 123 | + |
| 124 | +### LeakyReLU fails with device mismatch |
| 125 | + |
| 126 | +ARM quantizers (VGF, Ethos-U) fail on `nn.LeakyReLU` because the `negative_slope` constant gets placed on wrong device. XNNPACK quantizer doesn't have this. Root cause: kwargs removal in `quantization_annotator.py`. [Source: #16541] |
| 127 | + |
| 128 | +### ReLU(inplace=True) with 16-bit activation |
| 129 | + |
| 130 | +`ReLU(inplace=True)` with `a16w8` quantization config fails at `to_edge_transform_and_lower` with `Expected tensor aten_convolution_default in aten.clamp`. Fixed on main branch. [Source: #16629] |
| 131 | + |
| 132 | +### FuseQuantizedActivationPass INT16 failure |
| 133 | + |
| 134 | +`FuseQuantizedActivationPass` does not handle INT16 symmetric quantization correctly in some cases. [Source: #17437] |
| 135 | + |
| 136 | +### aot_arm_compiler.py Conv2d quantization failure |
| 137 | + |
| 138 | +`aot_arm_compiler.py` may not quantize `Conv2d` for `cortex-m55+int8` target in certain configurations. [Source: #17902] |
| 139 | + |
| 140 | +### Name filter doesn't match nodes correctly |
| 141 | + |
| 142 | +`arm_quantizer.py`'s `module_name_filter` assumes names start with `"L['self']."`, which may not be present. Fixed on main. [Source: #15870] |
| 143 | + |
| 144 | +### GroupNorm decomposition failure |
| 145 | + |
| 146 | +`DecomposeGroupNormPass(ArmPass)` fails when running `prepare_pt2e` on models with `torch.nn.GroupNorm`. May be related to dynamic shape handling. [Source: #16090] |
| 147 | + |
| 148 | +## Vela Compiler Issues |
| 149 | + |
| 150 | +### Custom config file crashes with trailing spaces |
| 151 | + |
| 152 | +Custom `[System_Config.*]` sections crash Vela with `IndexError` if config lines have trailing spaces. Fixed in Vela 4.5.0. [Source: #15805] |
| 153 | + |
| 154 | +### `--optimise Size` produces incorrect results |
| 155 | + |
| 156 | +Vela with `--optimise Size` flag can produce different (wrong) results compared to default optimization. [Source: #16864] |
| 157 | + |
| 158 | +### reduce_mean not fully delegated |
| 159 | + |
| 160 | +Operator support checks for views/reshapes are overly pessimistic — they reject view nodes with axis-product > 65536 even when no transpose is needed. This prevents full delegation of reduce_mean to the NPU. [Source: #16779] |
| 161 | + |
| 162 | +### Vela internal errors on certain models |
| 163 | + |
| 164 | +Vela may crash internally on certain model structures. The Vela team is actively investigating. [Source: #13022] |
| 165 | + |
| 166 | +## Delegation Issues |
| 167 | + |
| 168 | +### conv→relu→permute→reshape(5D) crashes partitioner |
| 169 | + |
| 170 | +This specific graph pattern crashes during `to_edge_transform_and_lower` for Ethos-U. [Source: #16739] |
| 171 | + |
| 172 | +### PReLU unsupported on Ethos-U |
| 173 | + |
| 174 | +`torch.nn.PReLU` decomposes to `torch.where(x>0, x, weights*x)` which isn't supported by the Ethos-U backend. No workaround exists. [Source: #16902] |
| 175 | + |
| 176 | +### BatchNorm2d without preceding Conv not delegated |
| 177 | + |
| 178 | +Standalone `BatchNorm2d` (not fused with Conv) fails Ethos-U delegation, though it works in TFLite→Vela flow. Workaround: manually decompose to `mul + add`. [Source: #17241, #17397] |
| 179 | + |
| 180 | +### GRU / RNN layers not supported |
| 181 | + |
| 182 | +GRU decomposition fails during Ethos-U lowering. LSTM support via CMSIS-NN is planned but not yet implemented. [Source: #12270, #17753] |
| 183 | + |
| 184 | +### RewriteConvPass crashes on non-fuseable conv→relu branches |
| 185 | + |
| 186 | +**Symptom**: |
| 187 | +``` |
| 188 | +ValueError: RewriteConvPass: No output quantization parameter found in node tosa_conv2d_default |
| 189 | +original_aten=aten.convolution.default |
| 190 | +``` |
| 191 | +Occurs during `to_edge_transform_and_lower` when a delegated `conv → relu/clamp` branch has an activation whose output quantization has `zero_point != qmin` (non-fuseable). [Source: #18491] |
| 192 | + |
| 193 | +**Root Cause**: `FoldAndAnnotateQParamsPass` places `output_qparams` on the downstream `clamp` node rather than the `conv` node in the non-fuseable case. `RewriteConvPass` unconditionally calls `get_output_qparams(conv)` which crashes because the conv doesn't own its output quantization. |
| 194 | + |
| 195 | +**Fix**: Fixed by PR #18778. The fix makes `RewriteConvPass` check for `output_qparams` on successor activation nodes when the conv itself has no output qparams. [Source: #18491] |
| 196 | + |
| 197 | +### Quantized sigmoid TABLE generation bug with qmin=-127 |
| 198 | + |
| 199 | +**Symptom**: Quantized `aten.sigmoid.default` produces incorrect outputs when lowered to TOSA TABLE with `qmin=-127, qmax=127, dtype=torch.int8`. The generated 256-entry LUT has duplicate entries and off-by-one shifts. [Source: #18873] |
| 200 | + |
| 201 | +**Root Cause**: `InsertTableOpsPass.generate_8bit_table_values()` uses `torch.linspace(start=-127, end=127, steps=256, dtype=torch.int8)` which cannot produce 256 distinct values in a 255-code range, causing code `0` to be duplicated. |
| 202 | + |
| 203 | +**Status**: Open issue. The fix should use the full int8 domain `[-128, 127]` as table input regardless of `qmin/qmax`, or use explicit integer range instead of `torch.linspace`. [Source: #18873] |
| 204 | + |
| 205 | +### ConvTranspose2d fallback failure |
| 206 | + |
| 207 | +`ConvTranspose2d` fails to fall back to CPU when it can't run on the NPU, producing "Non-passthrough operation could not run on NPU" error. [Source: #17668] |
| 208 | + |
| 209 | +### Ethos-U base_addr mismatch |
| 210 | + |
| 211 | +The Ethos-U backend may use `base_addr` values that don't match ExecuTorch's planned memory pool, causing output buffers to remain unchanged on real hardware despite reported successful execution. Works on FVP but fails on real MCUs. [Source: #16784] |
| 212 | + |
| 213 | +## Performance Issues |
| 214 | + |
| 215 | +### Softmax decomposition slow on NPU |
| 216 | + |
| 217 | +Softmax decomposition uses `aten::amax` which runs on the elementwise engine (not MACs). The Vela performance estimator is unreliable for cycle counts — always validate on FVP or real hardware. [Source: #18319] |
| 218 | + |
| 219 | +### LayerNorm quantization accuracy |
| 220 | + |
| 221 | +LayerNorm quantization is sensitive to epsilon values. For transformer models (DeiT-tiny, etc.), accuracy drops in TOSA quantized pipeline may be caused by epsilon sensitivity. Use `--stable_softmax` flag for numerically stable algorithm. [Source: #16426, #18306, #18316] |
| 222 | + |
| 223 | +### amax support added for U55 |
| 224 | + |
| 225 | +`amax` op support was added for Ethos-U55 (via Vela update). To use it, set `ArmPassPipelineConfig` in compile spec with `stable_softmax=True`. [Source: #17211] |
| 226 | + |
| 227 | +## Setup / Build Issues |
| 228 | + |
| 229 | +### Dependency conflicts in setup.sh |
| 230 | + |
| 231 | +`examples/arm/setup.sh` has known dependency conflicts between ethos-u-vela (flatbuffers==24.12.23) and tosa-tools (flatbuffers==23.5.26). These are known and the backend still works. [Source: #10899, #12991] |
| 232 | + |
| 233 | +### No module named 'tosa' after pip install |
| 234 | + |
| 235 | +`pip install executorch` does not install tosa dependencies. Run `examples/arm/setup.sh` after pip install. Future: `pip install executorch[ethos-u]`. [Source: #13901] |
| 236 | + |
| 237 | +### ARM GitLab access issues (resolved) |
| 238 | + |
| 239 | +`git.gitlab.arm.com` had recurring access issues. Resolved with improved IP access management. [Source: #13557] |
| 240 | + |
| 241 | +### Cross-compilation flatc issues |
| 242 | + |
| 243 | +Remove manual `FLATBUFFERS_FLATC_EXECUTABLE` args — newer ExecuTorch builds handle host flatc automatically. [Source: #10964] |
| 244 | + |
| 245 | +### strided_copy in output graph |
| 246 | + |
| 247 | +When sample inputs are transposed (e.g., NHWC numpy arrays), `aten.as_strided_copy` appears in the graph. This is inserted by `ExportedProgram.run_decompositions()` and is often a no-op that can be removed. [Source: #16374] |
| 248 | + |
| 249 | +## Runner Issues |
| 250 | + |
| 251 | +### Object lifetime bug in arm_executor_runner.cpp |
| 252 | + |
| 253 | +`BufferCleanup` used `free()` on memory from `ArmMemoryAllocator` (static pools). Hidden by FVP, crashes on real hardware. Fixed in PR #16339. [Source: #16225] |
| 254 | + |
| 255 | +### FVP log format issues |
| 256 | + |
| 257 | +ARM GNU compiler may not support C99 format specifiers (`%zd`) by default, causing garbled FVP output. Use `%ld` instead. [Source: #13038] |
| 258 | + |
| 259 | +### int8 I/O with ML Toolkit |
| 260 | + |
| 261 | +When using `QuantizeInputs`/`QuantizeOutputs` passes, the PTE expects int8 I/O. The ML Toolkit (MLEK) preprocessing may feed float data, causing type mismatches. [Source: #16899] |
| 262 | + |
| 263 | +### Cortex-M quantization operators incorrect results |
| 264 | + |
| 265 | +When using the Arm backend without Ethos-U delegation, Cortex-M quantization operators (`cortex_m_dequantize`, etc.) can produce incorrect results if calibration data is not representative. The default calibration in `aot_arm_compiler` uses `torch.randn(32, 2, 2)` which may not be appropriate. [Source: #13399] |
0 commit comments