WebGPU: add memory aliasing for intermediate tensor buffers by digantdesai · Pull Request #19305 · pytorch/executorch

digantdesai · 2026-05-05T19:30:47Z

USE ETVK's mem_obj_id for the WebGPU runtime to implement memory aliasing

pytorch-bot · 2026-05-05T19:30:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19305

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Run pull request jobs on OSDC runners in shadow mode

❌ 1 New Failure, 1 Unclassified Failure

As of commit 53ac65d with merge base 2dfe8e4 ():

NEW FAILURE - The following job has failed:

pull / unittest / macos / macos-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_mv3_model

UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:

Check Labels / Check labels (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: GraphQL query

This comment was automatically generated by Dr. CI and updates every 15 minutes.

digantdesai · 2026-05-05T19:30:59Z

@claude review this PR

github-actions · 2026-05-05T19:31:40Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

claude · 2026-05-05T19:31:47Z

Claude finished @digantdesai's task in 1m 37s —— View job

Review: WebGPU memory aliasing for intermediate tensor buffers

Read changed files and understand the implementation
Review WebGPUBackend changes
Review WebGPUGraph changes (core aliasing logic)
Review test changes
Post review feedback

Overall: Clean, well-structured implementation. The approach of using ETVK's mem_obj_id to share GPU buffers between non-overlapping intermediate tensors is sound and the test coverage is good. A few items worth addressing:

Issues

1. Global mutable state for memory stats (WebGPUBackend.cpp:41)

static WebGPUMemoryStats s_last_memory_stats;

This is a process-wide global that gets overwritten on every init() call. If two models are loaded concurrently (or sequentially), the stats from the first are silently lost. This is fine for a test-only helper, but consider documenting that intent (e.g., naming it something like s_last_memory_stats_for_testing) or gating it behind a test/debug flag so it doesn't ship in production builds. If this is intentionally test-only, could it live in the test file instead?

2. Destructor correctness — shared buffer double-release guard (WebGPUGraph.cpp:55-58)

for (size_t i = 0; i < tensors_.size(); i++) {
    if (tensors_[i].buffer &&
        (i >= tensor_mem_obj_ids_.size() || tensor_mem_obj_ids_[i] < 0)) {
      wgpuBufferRelease(tensors_[i].buffer);
    }
}

The guard i >= tensor_mem_obj_ids_.size() should never be true since both vectors are resized to num_vals in build(). If build() throws before that resize (e.g., device is null), both vectors are empty and the loop wouldn't execute anyway. The check is harmless but somewhat misleading—a simple tensor_mem_obj_ids_[i] < 0 would suffice given the invariant. Minor nit, no action needed.

3. total_bytes() excludes tensor_buffer_bytes field (WebGPUGraph.h:44-46)

size_t total_bytes() const {
    return shared_buffer_bytes + unshared_tensor_buffer_bytes +
        staging_buffer_bytes + uniform_buffer_bytes;
}

This is correct (it equals tensor_buffer_bytes + staging + uniform), but it's mildly confusing that total_bytes() doesn't reference the tensor_buffer_bytes field directly. Rewriting as return tensor_buffer_bytes + staging_buffer_bytes + uniform_buffer_bytes; would make the semantics clearer, since tensor_buffer_bytes is already computed as shared + unshared in memory_stats().

Fix this →

Minor Suggestions

4. Test assertion could verify specific aliasing count (test_webgpu_native.cpp:143)

The test asserts stats.num_shared_objects > 0 — for a 5-op chained add with 2 inputs + 1 output + intermediates, you could assert a more specific expected value (e.g., that num_shared_objects equals a known count) to catch regressions where aliasing becomes less effective. Though I recognize this is fragile if the partitioner changes its allocation strategy.

5. export_chained_add_model not invocable from CLI (test_add.py)

The export functions are defined but only callable programmatically or via the test script's -c block. Not a blocker, just noting that adding a --export-chained CLI arg would be nice for manual testing.

What looks good

The two-pass approach (collect max sizes in pass 1, allocate shared buffers in pass 2) is clean and avoids over-allocation.
The destructor properly separates ownership: shared buffers are released once via shared_buffers_, unshared per-tensor buffers are released individually.
The memory stats struct gives good observability into the aliasing savings.
The chained-add test validates both correctness and memory savings, which is the right thing to test here.
Replacing zero-size buffer fallbacks with asserts (ET_CHECK_MSG) is a good cleanup — zero-size tensors would indicate a bug upstream.

Verdict: LGTM with minor suggestions. The core aliasing logic is correct and well-tested. The global stats accessor is the only thing I'd recommend addressing before merge (even if just renaming to clarify it's test-only).

Copilot

Pull request overview

This PR adds WebGPU runtime support for tensor-buffer memory aliasing by reusing ETVK/Vulkan mem_obj_id so intermediate tensors can share underlying GPU buffers, and extends the WebGPU native test flow to validate the aliasing behavior and report memory stats.

Changes:

Implement shared WGPUBuffer allocation/assignment in WebGPUGraph based on mem_obj_id, and extend memory stats to account for shared vs unshared tensor bytes.
Add a test-only mechanism to retrieve the last graph’s memory stats and a new native test that validates aliasing + memory savings using a chained-add model.
Update WebGPU test scripts and Python export utilities to generate and run the chained-add model.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
backends/webgpu/runtime/WebGPUGraph.h	Extends memory stats and adds state for `mem_obj_id`-based shared buffers.
backends/webgpu/runtime/WebGPUGraph.cpp	Allocates shared buffers by `mem_obj_id`, adjusts destruction logic, and updates memory stats accounting.
backends/webgpu/runtime/WebGPUBackend.h	Declares a test-only accessor for last graph memory stats.
backends/webgpu/runtime/WebGPUBackend.cpp	Stores last graph memory stats at init time for tests to query.
backends/webgpu/test/test_webgpu_native.cpp	Adds a chained-add native test that checks correctness and confirms aliasing memory savings.
backends/webgpu/test/test_build_webgpu.sh	Exports both simple and chained models and runs the native test with both paths.
backends/webgpu/test/ops/add/test_add.py	Extends chained-add model and adds an export helper for the chained model.

Comments suppressed due to low confidence (1)

backends/webgpu/runtime/WebGPUGraph.cpp:230

Output staging buffer creation now ET_CHECK_MSGs on tensors_[oid].nbytes > 0, which will abort for valid models that produce empty outputs. The previous code handled this by allocating a small non-zero staging buffer while still copying 0 bytes. Consider restoring that behavior (allocate at least 4 bytes, but allow 0-byte outputs) to avoid hard process termination.

      // Create staging buffer for output readback
      WGPUBufferDescriptor staging_desc = {};
      ET_CHECK_MSG(tensors_[oid].nbytes > 0, "Output tensor has zero bytes");
      staging_desc.size = tensors_[oid].nbytes;
      staging_desc.usage = WGPUBufferUsage_MapRead | WGPUBufferUsage_CopyDst;
      staging_desc.mappedAtCreation = false;
      output_staging_buffers_.push_back(
          wgpuDeviceCreateBuffer(device_, &staging_desc));

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

SS-JIA · 2026-05-11T17:55:43Z

+        // Constants always get dedicated buffers regardless of mem_obj_id
+        if (constant_id >= 0 || mem_obj_id < 0) {
+          tensor_mem_obj_ids_[i] = -1;
+          WGPUBufferDescriptor buf_desc = {};
+          ET_CHECK_MSG(tensor.nbytes > 0, "Tensor has zero bytes");
+          buf_desc.size = tensor.nbytes;
+          buf_desc.usage = WGPUBufferUsage_Storage | WGPUBufferUsage_CopyDst |
+              WGPUBufferUsage_CopySrc;
+          buf_desc.mappedAtCreation = false;
+          tensor.buffer = wgpuDeviceCreateBuffer(device_, &buf_desc);


Fair point from AI reviewer, I think.

SS-JIA · 2026-05-11T17:54:55Z

+  // Allocate shared buffers and assign to tensors
+  shared_buffers_.resize(shared_buffer_sizes_.size(), nullptr);
+  for (size_t id = 0; id < shared_buffer_sizes_.size(); id++) {
+    WGPUBufferDescriptor buf_desc = {};
+    ET_CHECK_MSG(shared_buffer_sizes_[id] > 0, "Shared buffer has zero bytes");
+    buf_desc.size = shared_buffer_sizes_[id];
+    buf_desc.usage = WGPUBufferUsage_Storage | WGPUBufferUsage_CopyDst |
+        WGPUBufferUsage_CopySrc;
+    buf_desc.mappedAtCreation = false;
+    shared_buffers_[id] = wgpuDeviceCreateBuffer(device_, &buf_desc);


Similar/related to the above comment.

SS-JIA · 2026-05-11T18:01:08Z


+// Test-only: returns memory stats from the most recently initialized graph.
+// Not thread-safe; only valid when a single graph is loaded at a time.
+WebGPUMemoryStats get_last_memory_stats();


Agree with this comment here, having the static global s_last_memory_stats_for_testing and exposing get_last_memory_stats is pretty awkard.

I would recommend having the test_chained_add_memory() construct a WebGPUGraph directly, and query the memory stats after construction, as opposed to loading a *.pte file which forces you to use this pattern. Would also make it so that we don't have to do

s_last_memory_stats_for_testing = graph->memory_stats();

in the backend init() (not a big deal but just a nit).

If you really want to keep it as-is, then I would gate with an #ifdef like the comment suggests.

SS-JIA

Overall LGTM, just want to surface some valid comments from AI Reviewer.

SS-JIA · 2026-05-11T17:54:55Z

+  // Allocate shared buffers and assign to tensors
+  shared_buffers_.resize(shared_buffer_sizes_.size(), nullptr);
+  for (size_t id = 0; id < shared_buffer_sizes_.size(); id++) {
+    WGPUBufferDescriptor buf_desc = {};
+    ET_CHECK_MSG(shared_buffer_sizes_[id] > 0, "Shared buffer has zero bytes");
+    buf_desc.size = shared_buffer_sizes_[id];
+    buf_desc.usage = WGPUBufferUsage_Storage | WGPUBufferUsage_CopyDst |
+        WGPUBufferUsage_CopySrc;
+    buf_desc.mappedAtCreation = false;
+    shared_buffers_[id] = wgpuDeviceCreateBuffer(device_, &buf_desc);


Similar/related to the above comment.

SS-JIA · 2026-05-11T17:55:43Z

+        // Constants always get dedicated buffers regardless of mem_obj_id
+        if (constant_id >= 0 || mem_obj_id < 0) {
+          tensor_mem_obj_ids_[i] = -1;
+          WGPUBufferDescriptor buf_desc = {};
+          ET_CHECK_MSG(tensor.nbytes > 0, "Tensor has zero bytes");
+          buf_desc.size = tensor.nbytes;
+          buf_desc.usage = WGPUBufferUsage_Storage | WGPUBufferUsage_CopyDst |
+              WGPUBufferUsage_CopySrc;
+          buf_desc.mappedAtCreation = false;
+          tensor.buffer = wgpuDeviceCreateBuffer(device_, &buf_desc);


Fair point from AI reviewer, I think.

SS-JIA · 2026-05-11T18:01:08Z


+// Test-only: returns memory stats from the most recently initialized graph.
+// Not thread-safe; only valid when a single graph is loaded at a time.
+WebGPUMemoryStats get_last_memory_stats();


Agree with this comment here, having the static global s_last_memory_stats_for_testing and exposing get_last_memory_stats is pretty awkard.

I would recommend having the test_chained_add_memory() construct a WebGPUGraph directly, and query the memory stats after construction, as opposed to loading a *.pte file which forces you to use this pattern. Would also make it so that we don't have to do

s_last_memory_stats_for_testing = graph->memory_stats();

in the backend init() (not a big deal but just a nit).

If you really want to keep it as-is, then I would gate with an #ifdef like the comment suggests.

The export pipeline already runs a greedy memory planning pass that assigns mem_obj_id to tensors with non-overlapping lifetimes, but the WebGPU runtime was ignoring it and allocating a dedicated WGPUBuffer per tensor. Read mem_obj_id from the flatbuffer during graph build. Tensors sharing the same mem_obj_id now share a single WGPUBuffer sized to the largest user. Constants and tensors without a mem_obj_id still get dedicated buffers. Adds a chained-add native test (z=x+y; z=z+x; z=z+y) that verifies both correctness and that memory aliasing produces savings (~20% for this model). Co-authored with Claude.

Replace the silent `nbytes > 0 ? nbytes : 4` fallback pattern with ET_CHECK_MSG assertions. If a zero-byte tensor reaches buffer creation, we want to know immediately rather than silently creating a dummy 4-byte buffer that masks the issue. Co-authored with Claude.

Invert the condition to eliminate the empty if-body with a comment. Co-authored with Claude.

Export and run the chained-add memory aliasing test in test_build_webgpu.sh so it runs automatically instead of requiring a manual WEBGPU_TEST_CHAINED_MODEL env var. Co-authored with Claude.

Longer chain produces more intermediates, giving the memory planner more opportunity to alias buffers. Expected output: 3x + 3y. Co-authored with Claude.

Fix: if a constant tensor has mem_obj_id >= 0, force it to -1 so the dedicated buffer path and the destructor stay consistent. Previously the buffer would leak and get overwritten by the shared buffer pass. Also make the chained-add test actually fail when aliasing is absent instead of just printing informational messages. Co-authored with Claude.

…tes() Rename the static to s_last_memory_stats_for_testing and document the test-only, single-graph, not-thread-safe intent in the header. Simplify total_bytes() to use tensor_buffer_bytes directly since it is already computed as shared + unshared in memory_stats(). Co-authored with Claude.

Empty tensors (nbytes == 0) are legitimate in exported graphs (e.g. from padding or dynamic shapes). Restore min buffer size of 4 bytes to satisfy WebGPU's size > 0 requirement, and treat zero-byte tensors as no-ops in copy_inputs/copy_outputs instead of aborting via ET_CHECK_MSG. Co-authored with Claude.

Drop get_last_memory_stats() and the static global from WebGPUBackend. The chained-add C++ test validates aliasing correctness implicitly: if buffer sharing corrupted data, the 5-op chain output (3x + 3y) would be wrong. Co-authored with Claude.

The .cpp gained OutputCopy, ExecuteConfig, shader/pipeline/bgl caches, and get_or_create_* methods from upstream changes that were not reflected in the header. Add the corresponding declarations and includes. Co-authored with Claude.

BinaryOp.cpp creates pipelines directly (not via pipeline_cache_), so the destructor must still release dispatch.pipeline. The cache cleanup handles cached pipelines; this handles uncached ones. Co-authored with Claude.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

  // Phase 1: Create all values
  const auto* values = graph->values();
  const int num_vals = values ? values->size() : 0;
  value_types_.resize(num_vals, ValueType::Null);
  tensors_.resize(num_vals);
+  tensor_mem_obj_ids_.resize(num_vals, -1);
  ints_.resize(num_vals, 0);
  doubles_.resize(num_vals, 0.0);
  bools_.resize(num_vals, false);


+struct ExecuteConfig {
+  size_t chunk_size = 0;
+  size_t initial_chunk_size = 0;
+};


+static bool test_chained_add(const std::string& model_path) {
+  printf("\n--- Test: chained add (1024x1024, 5 ops) ---\n");
+
+  Module module(model_path);
+  auto err = module.load_forward();
+  if (err != Error::Ok) {
+    printf("FAIL: could not load forward method (error %d)\n", (int)err);
+    return false;
+  }
+  printf("Model loaded: %s\n", model_path.c_str());
+


meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 5, 2026

digantdesai force-pushed the wgpu_memory_aliasing branch from a402f89 to 3666881 Compare May 7, 2026 03:18

digantdesai marked this pull request as ready for review May 7, 2026 03:18

digantdesai requested review from SS-JIA and Copilot May 7, 2026 03:18

Copilot started reviewing on behalf of digantdesai May 7, 2026 03:19 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

SS-JIA requested changes May 11, 2026

View reviewed changes

digantdesai added 10 commits May 11, 2026 20:48

WebGPU: clean up empty if-branch in memory_stats()

e0eb5eb

Invert the condition to eliminate the empty if-body with a comment. Co-authored with Claude.

WebGPU: add chained-add model to test script

6e91025

Export and run the chained-add memory aliasing test in test_build_webgpu.sh so it runs automatically instead of requiring a manual WEBGPU_TEST_CHAINED_MODEL env var. Co-authored with Claude.

WebGPU: extend chained add test to 5 ops for better aliasing coverage

4038f85

Longer chain produces more intermediates, giving the memory planner more opportunity to alias buffers. Expected output: 3x + 3y. Co-authored with Claude.

digantdesai force-pushed the wgpu_memory_aliasing branch from 3666881 to 41f3c60 Compare May 12, 2026 19:36

WebGPU: restore per-dispatch pipeline release to fix leak

53ac65d

BinaryOp.cpp creates pipelines directly (not via pipeline_cache_), so the destructor must still release dispatch.pipeline. The cache cleanup handles cached pipelines; this handles uncached ones. Co-authored with Claude.

Copilot AI review requested due to automatic review settings May 13, 2026 02:30

Copilot started reviewing on behalf of digantdesai May 13, 2026 02:31 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

Conversation

digantdesai commented May 5, 2026

Uh oh!

pytorch-bot Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19305

❗ 1 Active SEVs

❌ 1 New Failure, 1 Unclassified Failure

Uh oh!

digantdesai commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

This PR needs a release notes: label

Uh oh!

claude Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: WebGPU memory aliasing for intermediate tensor buffers

Issues

Minor Suggestions

What looks good

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

SS-JIA May 11, 2026

Choose a reason for hiding this comment

Uh oh!

SS-JIA May 11, 2026

Choose a reason for hiding this comment

Uh oh!

SS-JIA May 11, 2026

Choose a reason for hiding this comment

Uh oh!

SS-JIA left a comment

Choose a reason for hiding this comment

Uh oh!

SS-JIA May 11, 2026

Choose a reason for hiding this comment

Uh oh!

SS-JIA May 11, 2026

Choose a reason for hiding this comment

Uh oh!

SS-JIA May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot Bot commented May 5, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented May 5, 2026 •

edited

Loading