perf(executor): copy fetched instruction directly in MinimalExecutor#2776
Open
wstran wants to merge 1 commit into
Open
perf(executor): copy fetched instruction directly in MinimalExecutor#2776wstran wants to merge 1 commit into
wstran wants to merge 1 commit into
Conversation
`MinimalExecutor<SupervisorMode>::execute_instruction` in the portable backend cloned the program `Arc` once per RISC-V cycle just to satisfy the borrow checker, costing two atomic refcount operations on every instruction. The sibling `tracing` / `estimating` / `splicing` executors already dereference the `Option<&Instruction>` returned by `fetch` and copy the 32-byte `Instruction` (it is `#[derive(Clone, Copy)]` with `#[repr(C)]`). The `UserMode` execute_instruction in this same file already uses an owned `Instruction` returned by `Self::fetch()`. Apply the same pattern in the SupervisorMode arm of the portable backend so the inner loop carries a local `Instruction` value instead of holding an `Arc<Program>` clone alive. Each `execute_xxx` callee already takes `&Instruction`; the caller now passes `&instruction` to keep their signatures unchanged. Measured on Apple M-class aarch64, Docker rust:1.91-bookworm, fibonacci program with n=200_000 (2,601,619 RISC-V cycles per run). 5 measured runs after 2 warmups, 3 independent benchmark invocations: Before: 14.61 / 14.66 / 14.62 ms => 178.07 / 177.47 / 178.00 MHz After: 11.14 / 10.77 / 10.38 ms => 233.49 / 241.66 / 250.56 MHz Mean: 14.63 ms -> 10.76 ms (-26.4%) Mean: 177.85 MHz -> 241.90 MHz (+36.0%) Cycle count is identical before and after (2,601,619), so execution semantics are unchanged. `cargo +nightly fmt --check -p sp1-core-executor` passes. `cargo check --release -p sp1-core-executor` passes.
8b9171e to
d40f7d4
Compare
Author
|
Retargeted from |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
MinimalExecutor::execute_instructionin the portable backend (crates/core/executor/src/minimal/arch/portable/mod.rs) clonedself.program(Arc<Program>) on every RISC-V cycle just to satisfy the borrow checker:That is two atomic refcount operations per instruction (one increment on
clone, one decrement when the localprogramArc is dropped at the end of the function). For a program with a few million RISC-V cycles this is a measurable cost on aarch64.The sibling executors (
tracing.rs,estimating.rs,splicing.rs) already avoid this by dereferencing theOption<&Instruction>returned byfetchand copying the 32-byteInstructionvalue (it is#[derive(Clone, Copy)]with#[repr(C)]):The portable backend is the only one that did not do this.
Change
Replace the per-cycle
Arc::clonewith a direct copy of the fetchedInstruction:Each
execute_xxxcallee already takes&Instruction, so the callers now pass&instruction(the local copy). No call site outside this function changes.I kept the safe
*…unwrap()form rather than theunsafe { *…unwrap_unchecked() }form used in the sibling executors, since this PR is focused on removing the per-cycle atomic op without changing panic behavior. Switching to the unchecked form is a separate decision.Benchmark
Apple M-class CPU, Docker
rust:1.91-bookworm(aarch64),cargo build --releasewith default profile +lto = \"thin\"+codegen-units = 1. Workload: fibonacci program fromexamples/fibonacci/programwith inputn = 200_000, run end-to-end throughMinimalExecutor::execute_chunkuntil completion.Each invocation: 2 warmup runs (discarded) + 5 measured runs, average reported. Three independent invocations of the benchmark:
Δ runtime: −26.4 %
Δ throughput: +36.0 %
Cycle count is
2,601,619in every run, before and after — execution semantics are preserved.Verification
cargo +nightly fmt --check -p sp1-core-executor— passescargo test --release -p sp1-core-executor— 16 unit tests pass, includingminimal::tests::test_chunk_stops_correctlyNotes
crates/core/executor/src/minimal/arch/x86_64/was not touched.&Instructionborrow fromself.program.fetch(...)lives only for the duration of the dereferencing expression; onceinstructionis the owned local,&mut selfis free to be reborrowed by theexecute_xxxcalls.