WebGPU: add memory aliasing for intermediate tensor buffers#19305
WebGPU: add memory aliasing for intermediate tensor buffers#19305digantdesai wants to merge 11 commits into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19305
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 New Failure, 1 Unclassified FailureAs of commit 53ac65d with merge base 2dfe8e4 ( NEW FAILURE - The following job has failed:
UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@claude review this PR |
This PR needs a
|
|
Claude finished @digantdesai's task in 1m 37s —— View job Review: WebGPU memory aliasing for intermediate tensor buffers
Overall: Clean, well-structured implementation. The approach of using ETVK's Issues1. Global mutable state for memory stats ( static WebGPUMemoryStats s_last_memory_stats;This is a process-wide global that gets overwritten on every 2. Destructor correctness — shared buffer double-release guard ( for (size_t i = 0; i < tensors_.size(); i++) {
if (tensors_[i].buffer &&
(i >= tensor_mem_obj_ids_.size() || tensor_mem_obj_ids_[i] < 0)) {
wgpuBufferRelease(tensors_[i].buffer);
}
}The guard 3. size_t total_bytes() const {
return shared_buffer_bytes + unshared_tensor_buffer_bytes +
staging_buffer_bytes + uniform_buffer_bytes;
}This is correct (it equals Minor Suggestions4. Test assertion could verify specific aliasing count ( The test asserts 5. The export functions are defined but only callable programmatically or via the test script's What looks good
Verdict: LGTM with minor suggestions. The core aliasing logic is correct and well-tested. The global stats accessor is the only thing I'd recommend addressing before merge (even if just renaming to clarify it's test-only). |
a402f89 to
3666881
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds WebGPU runtime support for tensor-buffer memory aliasing by reusing ETVK/Vulkan mem_obj_id so intermediate tensors can share underlying GPU buffers, and extends the WebGPU native test flow to validate the aliasing behavior and report memory stats.
Changes:
- Implement shared
WGPUBufferallocation/assignment inWebGPUGraphbased onmem_obj_id, and extend memory stats to account for shared vs unshared tensor bytes. - Add a test-only mechanism to retrieve the last graph’s memory stats and a new native test that validates aliasing + memory savings using a chained-add model.
- Update WebGPU test scripts and Python export utilities to generate and run the chained-add model.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| backends/webgpu/runtime/WebGPUGraph.h | Extends memory stats and adds state for mem_obj_id-based shared buffers. |
| backends/webgpu/runtime/WebGPUGraph.cpp | Allocates shared buffers by mem_obj_id, adjusts destruction logic, and updates memory stats accounting. |
| backends/webgpu/runtime/WebGPUBackend.h | Declares a test-only accessor for last graph memory stats. |
| backends/webgpu/runtime/WebGPUBackend.cpp | Stores last graph memory stats at init time for tests to query. |
| backends/webgpu/test/test_webgpu_native.cpp | Adds a chained-add native test that checks correctness and confirms aliasing memory savings. |
| backends/webgpu/test/test_build_webgpu.sh | Exports both simple and chained models and runs the native test with both paths. |
| backends/webgpu/test/ops/add/test_add.py | Extends chained-add model and adds an export helper for the chained model. |
Comments suppressed due to low confidence (1)
backends/webgpu/runtime/WebGPUGraph.cpp:230
- Output staging buffer creation now ET_CHECK_MSGs on tensors_[oid].nbytes > 0, which will abort for valid models that produce empty outputs. The previous code handled this by allocating a small non-zero staging buffer while still copying 0 bytes. Consider restoring that behavior (allocate at least 4 bytes, but allow 0-byte outputs) to avoid hard process termination.
// Create staging buffer for output readback
WGPUBufferDescriptor staging_desc = {};
ET_CHECK_MSG(tensors_[oid].nbytes > 0, "Output tensor has zero bytes");
staging_desc.size = tensors_[oid].nbytes;
staging_desc.usage = WGPUBufferUsage_MapRead | WGPUBufferUsage_CopyDst;
staging_desc.mappedAtCreation = false;
output_staging_buffers_.push_back(
wgpuDeviceCreateBuffer(device_, &staging_desc));
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Constants always get dedicated buffers regardless of mem_obj_id | ||
| if (constant_id >= 0 || mem_obj_id < 0) { | ||
| tensor_mem_obj_ids_[i] = -1; | ||
| WGPUBufferDescriptor buf_desc = {}; | ||
| ET_CHECK_MSG(tensor.nbytes > 0, "Tensor has zero bytes"); | ||
| buf_desc.size = tensor.nbytes; | ||
| buf_desc.usage = WGPUBufferUsage_Storage | WGPUBufferUsage_CopyDst | | ||
| WGPUBufferUsage_CopySrc; | ||
| buf_desc.mappedAtCreation = false; | ||
| tensor.buffer = wgpuDeviceCreateBuffer(device_, &buf_desc); |
There was a problem hiding this comment.
Fair point from AI reviewer, I think.
| // Allocate shared buffers and assign to tensors | ||
| shared_buffers_.resize(shared_buffer_sizes_.size(), nullptr); | ||
| for (size_t id = 0; id < shared_buffer_sizes_.size(); id++) { | ||
| WGPUBufferDescriptor buf_desc = {}; | ||
| ET_CHECK_MSG(shared_buffer_sizes_[id] > 0, "Shared buffer has zero bytes"); | ||
| buf_desc.size = shared_buffer_sizes_[id]; | ||
| buf_desc.usage = WGPUBufferUsage_Storage | WGPUBufferUsage_CopyDst | | ||
| WGPUBufferUsage_CopySrc; | ||
| buf_desc.mappedAtCreation = false; | ||
| shared_buffers_[id] = wgpuDeviceCreateBuffer(device_, &buf_desc); |
There was a problem hiding this comment.
Similar/related to the above comment.
|
|
||
| // Test-only: returns memory stats from the most recently initialized graph. | ||
| // Not thread-safe; only valid when a single graph is loaded at a time. | ||
| WebGPUMemoryStats get_last_memory_stats(); |
There was a problem hiding this comment.
Agree with this comment here, having the static global s_last_memory_stats_for_testing and exposing get_last_memory_stats is pretty awkard.
I would recommend having the test_chained_add_memory() construct a WebGPUGraph directly, and query the memory stats after construction, as opposed to loading a *.pte file which forces you to use this pattern. Would also make it so that we don't have to do
s_last_memory_stats_for_testing = graph->memory_stats();
in the backend init() (not a big deal but just a nit).
If you really want to keep it as-is, then I would gate with an #ifdef like the comment suggests.
SS-JIA
left a comment
There was a problem hiding this comment.
Overall LGTM, just want to surface some valid comments from AI Reviewer.
| // Allocate shared buffers and assign to tensors | ||
| shared_buffers_.resize(shared_buffer_sizes_.size(), nullptr); | ||
| for (size_t id = 0; id < shared_buffer_sizes_.size(); id++) { | ||
| WGPUBufferDescriptor buf_desc = {}; | ||
| ET_CHECK_MSG(shared_buffer_sizes_[id] > 0, "Shared buffer has zero bytes"); | ||
| buf_desc.size = shared_buffer_sizes_[id]; | ||
| buf_desc.usage = WGPUBufferUsage_Storage | WGPUBufferUsage_CopyDst | | ||
| WGPUBufferUsage_CopySrc; | ||
| buf_desc.mappedAtCreation = false; | ||
| shared_buffers_[id] = wgpuDeviceCreateBuffer(device_, &buf_desc); |
There was a problem hiding this comment.
Similar/related to the above comment.
| // Constants always get dedicated buffers regardless of mem_obj_id | ||
| if (constant_id >= 0 || mem_obj_id < 0) { | ||
| tensor_mem_obj_ids_[i] = -1; | ||
| WGPUBufferDescriptor buf_desc = {}; | ||
| ET_CHECK_MSG(tensor.nbytes > 0, "Tensor has zero bytes"); | ||
| buf_desc.size = tensor.nbytes; | ||
| buf_desc.usage = WGPUBufferUsage_Storage | WGPUBufferUsage_CopyDst | | ||
| WGPUBufferUsage_CopySrc; | ||
| buf_desc.mappedAtCreation = false; | ||
| tensor.buffer = wgpuDeviceCreateBuffer(device_, &buf_desc); |
There was a problem hiding this comment.
Fair point from AI reviewer, I think.
|
|
||
| // Test-only: returns memory stats from the most recently initialized graph. | ||
| // Not thread-safe; only valid when a single graph is loaded at a time. | ||
| WebGPUMemoryStats get_last_memory_stats(); |
There was a problem hiding this comment.
Agree with this comment here, having the static global s_last_memory_stats_for_testing and exposing get_last_memory_stats is pretty awkard.
I would recommend having the test_chained_add_memory() construct a WebGPUGraph directly, and query the memory stats after construction, as opposed to loading a *.pte file which forces you to use this pattern. Would also make it so that we don't have to do
s_last_memory_stats_for_testing = graph->memory_stats();
in the backend init() (not a big deal but just a nit).
If you really want to keep it as-is, then I would gate with an #ifdef like the comment suggests.
The export pipeline already runs a greedy memory planning pass that assigns mem_obj_id to tensors with non-overlapping lifetimes, but the WebGPU runtime was ignoring it and allocating a dedicated WGPUBuffer per tensor. Read mem_obj_id from the flatbuffer during graph build. Tensors sharing the same mem_obj_id now share a single WGPUBuffer sized to the largest user. Constants and tensors without a mem_obj_id still get dedicated buffers. Adds a chained-add native test (z=x+y; z=z+x; z=z+y) that verifies both correctness and that memory aliasing produces savings (~20% for this model). Co-authored with Claude.
Replace the silent `nbytes > 0 ? nbytes : 4` fallback pattern with ET_CHECK_MSG assertions. If a zero-byte tensor reaches buffer creation, we want to know immediately rather than silently creating a dummy 4-byte buffer that masks the issue. Co-authored with Claude.
Invert the condition to eliminate the empty if-body with a comment. Co-authored with Claude.
Export and run the chained-add memory aliasing test in test_build_webgpu.sh so it runs automatically instead of requiring a manual WEBGPU_TEST_CHAINED_MODEL env var. Co-authored with Claude.
Longer chain produces more intermediates, giving the memory planner more opportunity to alias buffers. Expected output: 3x + 3y. Co-authored with Claude.
Fix: if a constant tensor has mem_obj_id >= 0, force it to -1 so the dedicated buffer path and the destructor stay consistent. Previously the buffer would leak and get overwritten by the shared buffer pass. Also make the chained-add test actually fail when aliasing is absent instead of just printing informational messages. Co-authored with Claude.
…tes() Rename the static to s_last_memory_stats_for_testing and document the test-only, single-graph, not-thread-safe intent in the header. Simplify total_bytes() to use tensor_buffer_bytes directly since it is already computed as shared + unshared in memory_stats(). Co-authored with Claude.
Empty tensors (nbytes == 0) are legitimate in exported graphs (e.g. from padding or dynamic shapes). Restore min buffer size of 4 bytes to satisfy WebGPU's size > 0 requirement, and treat zero-byte tensors as no-ops in copy_inputs/copy_outputs instead of aborting via ET_CHECK_MSG. Co-authored with Claude.
Drop get_last_memory_stats() and the static global from WebGPUBackend. The chained-add C++ test validates aliasing correctness implicitly: if buffer sharing corrupted data, the 5-op chain output (3x + 3y) would be wrong. Co-authored with Claude.
The .cpp gained OutputCopy, ExecuteConfig, shader/pipeline/bgl caches, and get_or_create_* methods from upstream changes that were not reflected in the header. Add the corresponding declarations and includes. Co-authored with Claude.
3666881 to
41f3c60
Compare
BinaryOp.cpp creates pipelines directly (not via pipeline_cache_), so the destructor must still release dispatch.pipeline. The cache cleanup handles cached pipelines; this handles uncached ones. Co-authored with Claude.
| // Phase 1: Create all values | ||
| const auto* values = graph->values(); | ||
| const int num_vals = values ? values->size() : 0; | ||
| value_types_.resize(num_vals, ValueType::Null); | ||
| tensors_.resize(num_vals); | ||
| tensor_mem_obj_ids_.resize(num_vals, -1); | ||
| ints_.resize(num_vals, 0); | ||
| doubles_.resize(num_vals, 0.0); | ||
| bools_.resize(num_vals, false); |
| struct ExecuteConfig { | ||
| size_t chunk_size = 0; | ||
| size_t initial_chunk_size = 0; | ||
| }; |
| static bool test_chained_add(const std::string& model_path) { | ||
| printf("\n--- Test: chained add (1024x1024, 5 ops) ---\n"); | ||
|
|
||
| Module module(model_path); | ||
| auto err = module.load_forward(); | ||
| if (err != Error::Ok) { | ||
| printf("FAIL: could not load forward method (error %d)\n", (int)err); | ||
| return false; | ||
| } | ||
| printf("Model loaded: %s\n", model_path.c_str()); | ||
|
|
USE ETVK's mem_obj_id for the WebGPU runtime to implement memory aliasing