Add should_run() + fast-copy infrastructure with targeted_ops annotations (#18497) by apullin · Pull Request #18497 · pytorch/executorch

apullin · 2026-03-25T16:05:10Z

Summary:

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

Changes

1. should_run() hook on ExportPass / ArmPass

Subclasses that declare a targeted_ops class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

2. Fast-copy for cold nodes

When a pass declares targeted_ops, nodes whose ops are NOT in the set
are copied into the new graph via graph.node_copy() instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without val metadata (e.g. nodes
inserted by call() overrides before super().call()) fall back to
full dispatch instead of propagating None.

3. FakeTensor cache extension

Context manager _extend_faketensor_cache_builtins() temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

4. init_subclass auto-discovery on ArmPass

Subclasses with existing _TARGET_OPS, _supported_ops, or
_EDGE_OPS/_ATEN_OPS attributes get targeted_ops populated
automatically at class definition time — no manual annotation needed.

5. targeted_ops annotations on ~60 ARM passes

Each annotation is a one-liner declaring the ops the pass checks in
call_operator(). Combined with should_run() and fast-copy, this
achieves the measured speedup below.

Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

lower() before: 186 s
lower() after: 100 s
Passes skipped: 53 of 146
Delta: -86 s (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:

_fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
instead of full FakeTensor dispatch for cold nodes in passes that declare
targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
_extend_faketensor_cache_builtins context manager that extends FakeTensor
dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
init_subclass on ArmPass for auto-discovery of targeted_ops from
existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
lower(): 185s -> 96s = -89s (-48%)

Differential Revision: D97528110

pytorch-bot · 2026-03-25T16:05:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18497

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Run pull and trunk jobs on OSDC in pull requests shadow mode

❌ 5 New Failures, 7 Unrelated Failures

As of commit 350a0a4 with merge base bd5752a ():

NEW FAILURES - The following jobs have failed:

Apple / build-demo-ios / macos-job (gh)
No files were found with the provided path: /Users/runner/work/_temp/artifacts/. No artifacts will be uploaded.
pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t 0cc10b73d80adba59f86100bc3ebeacf4b8a06cbf51eebad07a18756b5a0d43b /exec failed with exit code 139
pull / unittest-editable / macos / macos-job (gh)
exir/emit/test/test_emit.py::TestEmit::test_delegate_deduplicate
trunk / test-huggingface-transformers-xnnpack (gemma3-1b|xnnpack|--quantize) / linux-job (gh)
RuntimeError: Command docker exec -t 84b9c48ba621ca13500b4868f5ce28ef7adcddacf1be672b1be56855ee9ca298 /exec failed with exit code 134
trunk / unittest-release / macos / macos-job (gh)
exir/tests/test_joint_graph.py::TestJointGraph::test_joint_graph

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / unittest-editable / linux / linux-job (gh) (similar failure)
exir/tests/test_joint_graph.py::TestJointGraph::test_joint_graph
trunk / unittest-release / linux / linux-job (gh) (similar failure)
exir/tests/test_joint_graph.py::TestJointGraph::test_joint_graph

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / linux / linux-job (gh) (trunk failure)
exir/tests/test_joint_graph.py::TestJointGraph::test_joint_graph
pull / unittest / macos / macos-job (gh) (trunk failure)
exir/tests/test_memory_planning.py::TestMisc::test_multiple_pools_1
pull / unittest / windows / windows-job (gh) (trunk failure)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_8a4w_recipe
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_8a4w_recipe
trunk / unittest-release / windows / windows-job (gh) (trunk failure)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_8a4w_recipe

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-03-25T16:05:40Z

@apullin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D97528110.

github-actions · 2026-03-25T16:10:42Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…ions, [executorch][arm] Add should_run() + fast-copy infrastructure with targeted_ops annotations (pytorch#18497) Summary: Pull Request resolved: pytorch#18497 Adds infrastructure for skipping and fast-copying unchanged nodes during ExportPass execution, then annotates ~60 ARM backend passes to use it. ## Changes ### 1. should_run() hook on ExportPass / ArmPass Subclasses that declare a `targeted_ops` class attribute (a set of op overloads) can be skipped entirely when the graph contains none of their target ops. ArmPass provides a default implementation via inheritance. ### 2. Fast-copy for cold nodes When a pass declares `targeted_ops`, nodes whose ops are NOT in the set are copied into the new graph via `graph.node_copy()` instead of full FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x). Includes a safety guard: nodes without `val` metadata (e.g. nodes inserted by `call()` overrides before `super().call()`) fall back to full dispatch instead of propagating None. ### 3. FakeTensor cache extension Context manager `_extend_faketensor_cache_builtins()` temporarily extends the FakeTensor dispatch cache to cover ExecuTorch op namespaces (quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant re-dispatches for non-builtin ops across 50+ passes. ### 4. __init_subclass__ auto-discovery on ArmPass Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or `_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated automatically at class definition time — no manual annotation needed. ### 5. targeted_ops annotations on ~60 ARM passes Each annotation is a one-liner declaring the ops the pass checks in `call_operator()`. Combined with should_run() and fast-copy, this achieves the measured speedup below. ## Benchmark Model: small CNN feature extractor (~50K params, 9 conv layers with LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline). Graph: ~1200 nodes, 146 ExportPass invocations. lower() before: 186 s lower() after: 100 s Passes skipped: 53 of 146 Delta: -86 s (-46 %) Adds should_run() hook to ExportPass that subclasses can override to skip execution when a pass has no work to do. ArmPass implements a default that checks a targeted_ops class attribute against the graph's call_function nodes. Also adds: - _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy instead of full FakeTensor dispatch for cold nodes in passes that declare targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms. - _extend_faketensor_cache_builtins context manager that extends FakeTensor dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.) - __init_subclass__ on ArmPass for auto-discovery of targeted_ops from existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes - targeted_ops annotations on ~60 ARM pass subclasses Measured on SleepNet featurizer (U55 lowering): lower(): 185s -> 96s = -89s (-48%) Differential Revision: D97528110

…ions (pytorch#18497) Summary: Pull Request resolved: pytorch#18497 Adds infrastructure for skipping and fast-copying unchanged nodes during ExportPass execution, then annotates ~60 ARM backend passes to use it. ## Changes ### 1. should_run() hook on ExportPass / ArmPass Subclasses that declare a `targeted_ops` class attribute (a set of op overloads) can be skipped entirely when the graph contains none of their target ops. ArmPass provides a default implementation via inheritance. ### 2. Fast-copy for cold nodes When a pass declares `targeted_ops`, nodes whose ops are NOT in the set are copied into the new graph via `graph.node_copy()` instead of full FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x). Includes a safety guard: nodes without `val` metadata (e.g. nodes inserted by `call()` overrides before `super().call()`) fall back to full dispatch instead of propagating None. ### 3. FakeTensor cache extension Context manager `_extend_faketensor_cache_builtins()` temporarily extends the FakeTensor dispatch cache to cover ExecuTorch op namespaces (quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant re-dispatches for non-builtin ops across 50+ passes. ### 4. __init_subclass__ auto-discovery on ArmPass Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or `_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated automatically at class definition time — no manual annotation needed. ### 5. targeted_ops annotations on ~60 ARM passes Each annotation is a one-liner declaring the ops the pass checks in `call_operator()`. Combined with should_run() and fast-copy, this achieves the measured speedup below. ## Benchmark Model: small CNN feature extractor (~50K params, 9 conv layers with LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline). Graph: ~1200 nodes, 146 ExportPass invocations. lower() before: 186 s lower() after: 100 s Passes skipped: 53 of 146 Delta: -86 s (-46 %) Adds should_run() hook to ExportPass that subclasses can override to skip execution when a pass has no work to do. ArmPass implements a default that checks a targeted_ops class attribute against the graph's call_function nodes. Also adds: - _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy instead of full FakeTensor dispatch for cold nodes in passes that declare targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms. - _extend_faketensor_cache_builtins context manager that extends FakeTensor dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.) - __init_subclass__ on ArmPass for auto-discovery of targeted_ops from existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes - targeted_ops annotations on ~60 ARM pass subclasses Measured on SleepNet featurizer (U55 lowering): lower(): 185s -> 96s = -89s (-48%) Differential Revision: D97528110

…tructure with targeted_ops annotations (pytorch#18497) Summary: Pull Request resolved: pytorch#18497 Adds infrastructure for skipping and fast-copying unchanged nodes during ExportPass execution, then annotates ~60 ARM backend passes to use it. ## Changes ### 1. should_run() hook on ExportPass / ArmPass Subclasses that declare a `targeted_ops` class attribute (a set of op overloads) can be skipped entirely when the graph contains none of their target ops. ArmPass provides a default implementation via inheritance. ### 2. Fast-copy for cold nodes When a pass declares `targeted_ops`, nodes whose ops are NOT in the set are copied into the new graph via `graph.node_copy()` instead of full FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x). Includes a safety guard: nodes without `val` metadata (e.g. nodes inserted by `call()` overrides before `super().call()`) fall back to full dispatch instead of propagating None. ### 3. FakeTensor cache extension Context manager `_extend_faketensor_cache_builtins()` temporarily extends the FakeTensor dispatch cache to cover ExecuTorch op namespaces (quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant re-dispatches for non-builtin ops across 50+ passes. ### 4. __init_subclass__ auto-discovery on ArmPass Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or `_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated automatically at class definition time — no manual annotation needed. ### 5. targeted_ops annotations on ~60 ARM passes Each annotation is a one-liner declaring the ops the pass checks in `call_operator()`. Combined with should_run() and fast-copy, this achieves the measured speedup below. ## Benchmark Model: small CNN feature extractor (~50K params, 9 conv layers with LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline). Graph: ~1200 nodes, 146 ExportPass invocations. lower() before: 186 s lower() after: 100 s Passes skipped: 53 of 146 Delta: -86 s (-46 %) Adds should_run() hook to ExportPass that subclasses can override to skip execution when a pass has no work to do. ArmPass implements a default that checks a targeted_ops class attribute against the graph's call_function nodes. Also adds: - _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy instead of full FakeTensor dispatch for cold nodes in passes that declare targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms. - _extend_faketensor_cache_builtins context manager that extends FakeTensor dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.) - __init_subclass__ on ArmPass for auto-discovery of targeted_ops from existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes - targeted_ops annotations on ~60 ARM pass subclasses Measured on SleepNet featurizer (U55 lowering): lower(): 185s -> 96s = -89s (-48%) Differential Revision: D97528110

Summary: Adds stock model (non-sleep) profiling tests to the modai lowering profiling suite. These serve as a baseline/validation for the ExportPass speedup work (D97528110) without requiring sleep/FBLearner dependencies. ## New profiling functions (sleepmodels_lowering_profile.py) - _profile_arm_model_lowering(): generic helper using the same modai pipeline (Input → recipe → PTQ → Manager → export → lower) so timings are directly comparable to the sleep model profiling - profile_resnet8_lowering(): ResNet8 (MLPerf Tiny CIFAR-10), ~77K params, 32x32 input — small residual CNN with skip connections - profile_mobilenet_v1_025_lowering(): MobileNetV1-0.25 (MLPerf Tiny VWW), ~217K params, 96x96 input — depthwise-separable CNN ## New test methods - test_profile_resnet8_lowering() - test_profile_mobilenet_v1_025_lowering() Both confirmed passing: ResNet8: https://www.internalfb.com/intern/testinfra/testrun/20266198338067913 MobileNetV1-0.25: https://www.internalfb.com/intern/testinfra/testrun/32088147347033640 ## Buck changes - fbcode/executorch/examples/models/TARGETS + BUCK: add mlperf_tiny target (wraps xplat/executorch/examples/models/mlperf_tiny/*.py) - fbcode/healthtech/common/tests/BUCK: add //executorch/examples/models:mlperf_tiny dep Differential Revision: D101254299

…tructure with targeted_ops annotations (pytorch#18497) Summary: Pull Request resolved: pytorch#18497 Adds infrastructure for skipping and fast-copying unchanged nodes during ExportPass execution, then annotates ~60 ARM backend passes to use it. ## Changes ### 1. should_run() hook on ExportPass / ArmPass Subclasses that declare a `targeted_ops` class attribute (a set of op overloads) can be skipped entirely when the graph contains none of their target ops. ArmPass provides a default implementation via inheritance. ### 2. Fast-copy for cold nodes When a pass declares `targeted_ops`, nodes whose ops are NOT in the set are copied into the new graph via `graph.node_copy()` instead of full FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x). Includes a safety guard: nodes without `val` metadata (e.g. nodes inserted by `call()` overrides before `super().call()`) fall back to full dispatch instead of propagating None. ### 3. FakeTensor cache extension Context manager `_extend_faketensor_cache_builtins()` temporarily extends the FakeTensor dispatch cache to cover ExecuTorch op namespaces (quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant re-dispatches for non-builtin ops across 50+ passes. ### 4. __init_subclass__ auto-discovery on ArmPass Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or `_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated automatically at class definition time — no manual annotation needed. ### 5. targeted_ops annotations on ~60 ARM passes Each annotation is a one-liner declaring the ops the pass checks in `call_operator()`. Combined with should_run() and fast-copy, this achieves the measured speedup below. ## Benchmark Model: small CNN feature extractor (~50K params, 9 conv layers with LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline). Graph: ~1200 nodes, 146 ExportPass invocations. lower() before: 186 s lower() after: 100 s Passes skipped: 53 of 146 Delta: -86 s (-46 %) Adds should_run() hook to ExportPass that subclasses can override to skip execution when a pass has no work to do. ArmPass implements a default that checks a targeted_ops class attribute against the graph's call_function nodes. Also adds: - _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy instead of full FakeTensor dispatch for cold nodes in passes that declare targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms. - _extend_faketensor_cache_builtins context manager that extends FakeTensor dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.) - __init_subclass__ on ArmPass for auto-discovery of targeted_ops from existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes - targeted_ops annotations on ~60 ARM pass subclasses Measured on SleepNet featurizer (U55 lowering): lower(): 185s -> 96s = -89s (-48%) Differential Revision: D97528110

pytorch-bot · 2026-05-12T16:00:36Z

~~Workflows were awaiting approval.~~ CI has now been triggered for the ciflow labels on this PR.

…tructure with targeted_ops annotations (pytorch#18497) Summary: Pull Request resolved: pytorch#18497 Adds infrastructure for skipping and fast-copying unchanged nodes during ExportPass execution, then annotates ~60 ARM backend passes to use it. ## Changes ### 1. should_run() hook on ExportPass / ArmPass Subclasses that declare a `targeted_ops` class attribute (a set of op overloads) can be skipped entirely when the graph contains none of their target ops. ArmPass provides a default implementation via inheritance. ### 2. Fast-copy for cold nodes When a pass declares `targeted_ops`, nodes whose ops are NOT in the set are copied into the new graph via `graph.node_copy()` instead of full FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x). Includes a safety guard: nodes without `val` metadata (e.g. nodes inserted by `call()` overrides before `super().call()`) fall back to full dispatch instead of propagating None. ### 3. FakeTensor cache extension Context manager `_extend_faketensor_cache_builtins()` temporarily extends the FakeTensor dispatch cache to cover ExecuTorch op namespaces (quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant re-dispatches for non-builtin ops across 50+ passes. ### 4. __init_subclass__ auto-discovery on ArmPass Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or `_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated automatically at class definition time — no manual annotation needed. ### 5. targeted_ops annotations on ~60 ARM passes Each annotation is a one-liner declaring the ops the pass checks in `call_operator()`. Combined with should_run() and fast-copy, this achieves the measured speedup below. ## Benchmark Model: small CNN feature extractor (~50K params, 9 conv layers with LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline). Graph: ~1200 nodes, 146 ExportPass invocations. lower() before: 186 s lower() after: 100 s Passes skipped: 53 of 146 Delta: -86 s (-46 %) Adds should_run() hook to ExportPass that subclasses can override to skip execution when a pass has no work to do. ArmPass implements a default that checks a targeted_ops class attribute against the graph's call_function nodes. Also adds: - _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy instead of full FakeTensor dispatch for cold nodes in passes that declare targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms. - _extend_faketensor_cache_builtins context manager that extends FakeTensor dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.) - __init_subclass__ on ArmPass for auto-discovery of targeted_ops from existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes - targeted_ops annotations on ~60 ARM pass subclasses Measured on SleepNet featurizer (U55 lowering): lower(): 185s -> 96s = -89s (-48%) Differential Revision: D97528110

…ions (pytorch#18497) Summary: Pull Request resolved: pytorch#18497 Adds infrastructure for skipping and fast-copying unchanged nodes during ExportPass execution, then annotates ~60 ARM backend passes to use it. ## Changes ### 1. should_run() hook on ExportPass / ArmPass Subclasses that declare a `targeted_ops` class attribute (a set of op overloads) can be skipped entirely when the graph contains none of their target ops. ArmPass provides a default implementation via inheritance. ### 2. Fast-copy for cold nodes When a pass declares `targeted_ops`, nodes whose ops are NOT in the set are copied into the new graph via `graph.node_copy()` instead of full FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x). Includes a safety guard: nodes without `val` metadata (e.g. nodes inserted by `call()` overrides before `super().call()`) fall back to full dispatch instead of propagating None. ### 3. FakeTensor cache extension Context manager `_extend_faketensor_cache_builtins()` temporarily extends the FakeTensor dispatch cache to cover ExecuTorch op namespaces (quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant re-dispatches for non-builtin ops across 50+ passes. ### 4. __init_subclass__ auto-discovery on ArmPass Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or `_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated automatically at class definition time — no manual annotation needed. ### 5. targeted_ops annotations on ~60 ARM passes Each annotation is a one-liner declaring the ops the pass checks in `call_operator()`. Combined with should_run() and fast-copy, this achieves the measured speedup below. ## Benchmark Model: small CNN feature extractor (~50K params, 9 conv layers with LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline). Graph: ~1200 nodes, 146 ExportPass invocations. lower() before: 186 s lower() after: 100 s Passes skipped: 53 of 146 Delta: -86 s (-46 %) Adds should_run() hook to ExportPass that subclasses can override to skip execution when a pass has no work to do. ArmPass implements a default that checks a targeted_ops class attribute against the graph's call_function nodes. Also adds: - _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy instead of full FakeTensor dispatch for cold nodes in passes that declare targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms. - _extend_faketensor_cache_builtins context manager that extends FakeTensor dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.) - __init_subclass__ on ArmPass for auto-discovery of targeted_ops from existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes - targeted_ops annotations on ~60 ARM pass subclasses Measured on SleepNet featurizer (U55 lowering): lower(): 185s -> 96s = -89s (-48%) Differential Revision: D97528110

apullin requested review from JacobSzwejbka, digantdesai and larryliu0820 as code owners March 25, 2026 16:05

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 25, 2026

meta-codesync Bot added fb-exported meta-exported labels Mar 25, 2026

meta-codesync Bot changed the title ~~Add should_run() + fast-copy infrastructure with targeted_ops annotations~~ Add should_run() + fast-copy infrastructure with targeted_ops annotations, [executorch][arm] Add should_run() + fast-copy infrastructure with targeted_ops annotations Mar 25, 2026

apullin force-pushed the export-D97528110 branch from d300c02 to 076fb18 Compare March 25, 2026 17:13

apullin changed the title ~~Add should_run() + fast-copy infrastructure with targeted_ops annotations, [executorch][arm] Add should_run() + fast-copy infrastructure with targeted_ops annotations~~ Add should_run() + fast-copy infrastructure with targeted_ops annotations Mar 25, 2026

apullin force-pushed the export-D97528110 branch from 076fb18 to edcba1c Compare March 25, 2026 22:34

apullin force-pushed the export-D97528110 branch from edcba1c to a30f6a7 Compare March 25, 2026 22:57

apullin force-pushed the export-D97528110 branch 3 times, most recently from 485f99d to 417b280 Compare March 25, 2026 23:40

apullin force-pushed the export-D97528110 branch 2 times, most recently from ad7b73c to f401907 Compare March 26, 2026 06:34

apullin force-pushed the export-D97528110 branch from f401907 to 897a09c Compare March 26, 2026 06:42

apullin force-pushed the export-D97528110 branch from 897a09c to 575341f Compare March 30, 2026 16:14

apullin force-pushed the export-D97528110 branch from 575341f to 9dd58e8 Compare March 30, 2026 21:27

apullin changed the title ~~Add should_run() + fast-copy infrastructure with targeted_ops annotations (#18497)~~ Major speedup for model lowering: Add should_run() + fast-copy infrastructure with targeted_ops annotations (#18497) Apr 2, 2026

apullin force-pushed the export-D97528110 branch from 9dd58e8 to fc5cb5e Compare April 13, 2026 19:30

apullin force-pushed the export-D97528110 branch from fc5cb5e to a408af0 Compare May 12, 2026 16:00

apullin requested a review from lucylq as a code owner May 12, 2026 16:00

github-actions Bot added ciflow/trunk module: arm Issues related to arm backend labels May 12, 2026

apullin force-pushed the export-D97528110 branch from a408af0 to 5c00029 Compare May 12, 2026 16:06

meta-codesync Bot changed the title ~~Major speedup for model lowering: Add should_run() + fast-copy infrastructure with targeted_ops annotations (#18497)~~ Add should_run() + fast-copy infrastructure with targeted_ops annotations (#18497) May 12, 2026

apullin force-pushed the export-D97528110 branch from 5c00029 to 7432007 Compare May 12, 2026 20:45

apullin force-pushed the export-D97528110 branch from 7432007 to 350a0a4 Compare May 12, 2026 20:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add should_run() + fast-copy infrastructure with targeted_ops annotations (#18497)#18497

Add should_run() + fast-copy infrastructure with targeted_ops annotations (#18497)#18497
apullin wants to merge 2 commits into
pytorch:mainfrom
apullin:export-D97528110

apullin commented Mar 25, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

pytorch-bot Bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Mar 25, 2026

Uh oh!

github-actions Bot commented Mar 25, 2026

Uh oh!

pytorch-bot Bot commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

apullin commented Mar 25, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

1. should_run() hook on ExportPass / ArmPass

2. Fast-copy for cold nodes

3. FakeTensor cache extension

4. init_subclass auto-discovery on ArmPass

5. targeted_ops annotations on ~60 ARM passes

Benchmark

Uh oh!

pytorch-bot Bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18497

❗ 1 Active SEVs

❌ 5 New Failures, 7 Unrelated Failures

Uh oh!

meta-codesync Bot commented Mar 25, 2026

Uh oh!

github-actions Bot commented Mar 25, 2026

This PR needs a release notes: label

Uh oh!

pytorch-bot Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

apullin commented Mar 25, 2026 •

edited by meta-codesync Bot

Loading

pytorch-bot Bot commented Mar 25, 2026 •

edited

Loading

This PR needs a `release notes:` label

pytorch-bot Bot commented May 12, 2026 •

edited

Loading