Skip to content

Minor speedup for model lowering: Skip redundant run_decompositions when no ops match decomp table (#18496)#18496

Open
apullin wants to merge 3 commits into
pytorch:mainfrom
apullin:export-D96489903
Open

Minor speedup for model lowering: Skip redundant run_decompositions when no ops match decomp table (#18496)#18496
apullin wants to merge 3 commits into
pytorch:mainfrom
apullin:export-D96489903

Conversation

@apullin
Copy link
Copy Markdown
Contributor

@apullin apullin commented Mar 25, 2026

Summary:

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

lower() before: 82 s
lower() after: 71 s
Delta: -11 s (-13 %)

Differential Revision: D96489903

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 25, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18496

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 30 New Failures, 7 Unrelated Failures

As of commit 12a4ff1 with merge base bd5752a (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 25, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Mar 25, 2026

@apullin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96489903.

@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-codesync meta-codesync Bot changed the title Skip redundant run_decompositions when no ops match decomp table Skip redundant run_decompositions when no ops match decomp table (#18496) Mar 25, 2026
apullin pushed a commit to apullin/executorch that referenced this pull request Mar 25, 2026
…orch#18496)

Summary:

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
@meta-codesync meta-codesync Bot changed the title Skip redundant run_decompositions when no ops match decomp table (#18496) Skip redundant run_decompositions when no ops match decomp table Mar 26, 2026
@meta-codesync meta-codesync Bot changed the title Skip redundant run_decompositions when no ops match decomp table Skip redundant run_decompositions when no ops match decomp table (#18496) Mar 30, 2026
apullin pushed a commit to apullin/executorch that referenced this pull request Mar 30, 2026
…orch#18496)

Summary:

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
apullin pushed a commit to apullin/executorch that referenced this pull request Mar 30, 2026
…orch#18496)

Summary:

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
@apullin apullin force-pushed the export-D96489903 branch 2 times, most recently from 559036a to 77d036d Compare March 30, 2026 21:27
@apullin apullin requested a review from digantdesai as a code owner March 30, 2026 21:27
apullin pushed a commit to apullin/executorch that referenced this pull request Mar 30, 2026
…orch#18496)

Summary:
Pull Request resolved: pytorch#18496

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
@apullin apullin changed the title Skip redundant run_decompositions when no ops match decomp table (#18496) Minor speedup for model lowering: Skip redundant run_decompositions when no ops match decomp table (#18496) Apr 2, 2026
apullin pushed a commit to apullin/executorch that referenced this pull request Apr 13, 2026
…hen no ops match decomp table (pytorch#18496)

Summary:
Pull Request resolved: pytorch#18496

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
apullin pushed a commit to apullin/executorch that referenced this pull request Apr 17, 2026
…hen no ops match decomp table (pytorch#18496)

Summary:

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
apullin pushed a commit to apullin/executorch that referenced this pull request Apr 20, 2026
…hen no ops match decomp table (pytorch#18496)

Summary:

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
Andrew Pullin added 2 commits May 12, 2026 08:22
Summary:
Adds stock model (non-sleep) profiling tests to the modai lowering
profiling suite. These serve as a baseline/validation for the ExportPass
speedup work (D97528110) without requiring sleep/FBLearner dependencies.

## New profiling functions (sleepmodels_lowering_profile.py)

- _profile_arm_model_lowering(): generic helper using the same modai
  pipeline (Input → recipe → PTQ → Manager → export → lower) so timings
  are directly comparable to the sleep model profiling
- profile_resnet8_lowering(): ResNet8 (MLPerf Tiny CIFAR-10), ~77K params,
  32x32 input — small residual CNN with skip connections
- profile_mobilenet_v1_025_lowering(): MobileNetV1-0.25 (MLPerf Tiny VWW),
  ~217K params, 96x96 input — depthwise-separable CNN

## New test methods

- test_profile_resnet8_lowering()
- test_profile_mobilenet_v1_025_lowering()

Both confirmed passing:
  ResNet8: https://www.internalfb.com/intern/testinfra/testrun/20266198338067913
  MobileNetV1-0.25: https://www.internalfb.com/intern/testinfra/testrun/32088147347033640

## Buck changes

- fbcode/executorch/examples/models/TARGETS + BUCK: add mlperf_tiny target
  (wraps xplat/executorch/examples/models/mlperf_tiny/*.py)
- fbcode/healthtech/common/tests/BUCK: add //executorch/examples/models:mlperf_tiny dep

Differential Revision: D101254299
…ions (pytorch#18497)

Summary:
Pull Request resolved: pytorch#18497

Adds infrastructure for skipping and fast-copying unchanged nodes during
ExportPass execution, then annotates ~60 ARM backend passes to use it.

## Changes

### 1. should_run() hook on ExportPass / ArmPass
Subclasses that declare a `targeted_ops` class attribute (a set of op
overloads) can be skipped entirely when the graph contains none of their
target ops. ArmPass provides a default implementation via inheritance.

### 2. Fast-copy for cold nodes
When a pass declares `targeted_ops`, nodes whose ops are NOT in the set
are copied into the new graph via `graph.node_copy()` instead of full
FakeTensor dispatch. Per-node cost drops from ~0.4 ms to ~0.02 ms (~20x).

Includes a safety guard: nodes without `val` metadata (e.g. nodes
inserted by `call()` overrides before `super().call()`) fall back to
full dispatch instead of propagating None.

### 3. FakeTensor cache extension
Context manager `_extend_faketensor_cache_builtins()` temporarily extends
the FakeTensor dispatch cache to cover ExecuTorch op namespaces
(quantized_decomposed, tosa, dim_order_ops, cortex_m). Avoids redundant
re-dispatches for non-builtin ops across 50+ passes.

### 4. __init_subclass__ auto-discovery on ArmPass
Subclasses with existing `_TARGET_OPS`, `_supported_ops`, or
`_EDGE_OPS`/`_ATEN_OPS` attributes get `targeted_ops` populated
automatically at class definition time — no manual annotation needed.

### 5. targeted_ops annotations on ~60 ARM passes
Each annotation is a one-liner declaring the ops the pass checks in
`call_operator()`. Combined with should_run() and fast-copy, this
achieves the measured speedup below.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes, 146 ExportPass invocations.

  lower() before:  186 s
  lower() after:   100 s
  Passes skipped:  53 of 146
  Delta:           -86 s  (-46 %)
Adds should_run() hook to ExportPass that subclasses can override to skip
execution when a pass has no work to do. ArmPass implements a default that
checks a targeted_ops class attribute against the graph's call_function nodes.

Also adds:
- _fast_copy_node path in ExportInterpreter.run_node that uses graph.node_copy
  instead of full FakeTensor dispatch for cold nodes in passes that declare
  targeted_ops. Per-node cost drops from ~0.4ms to ~0.02ms.
- _extend_faketensor_cache_builtins context manager that extends FakeTensor
  dispatch cache to cover ExecuTorch ops (quantized_decomposed, tosa, etc.)
- __init_subclass__ on ArmPass for auto-discovery of targeted_ops from
  existing _TARGET_OPS, _supported_ops, _EDGE_OPS/_ATEN_OPS attributes
- targeted_ops annotations on ~60 ARM pass subclasses

Measured on SleepNet featurizer (U55 lowering):
  lower():  185s -> 96s  = -89s (-48%)

Differential Revision: D97528110
apullin pushed a commit to apullin/executorch that referenced this pull request May 12, 2026
…hen no ops match decomp table (pytorch#18496)

Summary:
Pull Request resolved: pytorch#18496

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
@apullin apullin force-pushed the export-D96489903 branch from e10f77c to c4a7945 Compare May 12, 2026 20:48
@apullin apullin requested a review from lucylq as a code owner May 12, 2026 20:48
@github-actions github-actions Bot added ciflow/trunk module: arm Issues related to arm backend labels May 12, 2026
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 12, 2026

Workflows were awaiting approval. CI has now been triggered for the ciflow labels on this PR.

…hen no ops match decomp table (pytorch#18496)

Summary:
Pull Request resolved: pytorch#18496

Adds an early-exit check to _gen_edge_manager_for_partitioners: before
calling program.run_decompositions(table), scan the graph for ops that
appear in the decomposition table. If none are found, skip the call
entirely.

Each run_decompositions call performs a full re-export of the program
via make_fx(), re-tracing every node through FakeTensor dispatch.
On the EDGE_DO_NOT_DECOMP path this function is called up to 3 times;
the early-exit eliminates at least one redundant call where the previous
pass already decomposed all matching ops.

The check recursively walks control flow submodules (cond/map/scan) to
avoid incorrectly skipping when decomposable ops are nested.

## Benchmark

Model: small CNN feature extractor (~50K params, 9 conv layers with
LayerNorm, targeting Ethos-U55 via the ARM/TOSA lowering pipeline).
Graph: ~1200 nodes.

  lower() before:  82 s
  lower() after:   71 s
  Delta:          -11 s  (-13 %)

Differential Revision: D96489903
@apullin apullin force-pushed the export-D96489903 branch from c4a7945 to 12a4ff1 Compare May 12, 2026 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported module: arm Issues related to arm backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant