Skip to content

perf(executor): copy fetched instruction directly in MinimalExecutor#2776

Open
wstran wants to merge 1 commit into
succinctlabs:mainfrom
wstran:perf/portable-executor-skip-arc-clone
Open

perf(executor): copy fetched instruction directly in MinimalExecutor#2776
wstran wants to merge 1 commit into
succinctlabs:mainfrom
wstran:perf/portable-executor-skip-arc-clone

Conversation

@wstran
Copy link
Copy Markdown

@wstran wstran commented May 6, 2026

Motivation

MinimalExecutor::execute_instruction in the portable backend (crates/core/executor/src/minimal/arch/portable/mod.rs) cloned self.program (Arc<Program>) on every RISC-V cycle just to satisfy the borrow checker:

fn execute_instruction(&mut self) -> bool {
    let program = self.program.clone();
    let instruction = program.fetch(self.pc).unwrap();
    // ...
}

That is two atomic refcount operations per instruction (one increment on clone, one decrement when the local program Arc is dropped at the end of the function). For a program with a few million RISC-V cycles this is a measurable cost on aarch64.

The sibling executors (tracing.rs, estimating.rs, splicing.rs) already avoid this by dereferencing the Option<&Instruction> returned by fetch and copying the 32-byte Instruction value (it is #[derive(Clone, Copy)] with #[repr(C)]):

let instruction = unsafe { *instruction.unwrap_unchecked() };

The portable backend is the only one that did not do this.

Change

Replace the per-cycle Arc::clone with a direct copy of the fetched Instruction:

let instruction = *self.program.fetch(self.pc).unwrap();

Each execute_xxx callee already takes &Instruction, so the callers now pass &instruction (the local copy). No call site outside this function changes.

I kept the safe *…unwrap() form rather than the unsafe { *…unwrap_unchecked() } form used in the sibling executors, since this PR is focused on removing the per-cycle atomic op without changing panic behavior. Switching to the unchecked form is a separate decision.

Benchmark

Apple M-class CPU, Docker rust:1.91-bookworm (aarch64), cargo build --release with default profile + lto = \"thin\" + codegen-units = 1. Workload: fibonacci program from examples/fibonacci/program with input n = 200_000, run end-to-end through MinimalExecutor::execute_chunk until completion.

Each invocation: 2 warmup runs (discarded) + 5 measured runs, average reported. Three independent invocations of the benchmark:

Run Before (ms) Before (MHz) After (ms) After (MHz)
1 14.61 178.07 11.14 233.49
2 14.66 177.47 10.77 241.66
3 14.62 178.00 10.38 250.56
Mean 14.63 177.85 10.76 241.90

Δ runtime: −26.4 %
Δ throughput: +36.0 %

Cycle count is 2,601,619 in every run, before and after — execution semantics are preserved.

Verification

  • cargo +nightly fmt --check -p sp1-core-executor — passes
  • cargo test --release -p sp1-core-executor — 16 unit tests pass, including minimal::tests::test_chunk_stops_correctly

Notes

  • The change is in the portable (interpreter) backend only. The native x86_64 backend at crates/core/executor/src/minimal/arch/x86_64/ was not touched.
  • The borrow checker accepts the deref-and-copy form because the &Instruction borrow from self.program.fetch(...) lives only for the duration of the dereferencing expression; once instruction is the owned local, &mut self is free to be reborrowed by the execute_xxx calls.

@wstran wstran changed the base branch from dev to main May 11, 2026 10:37
`MinimalExecutor<SupervisorMode>::execute_instruction` in the portable
backend cloned the program `Arc` once per RISC-V cycle just to satisfy
the borrow checker, costing two atomic refcount operations on every
instruction.

The sibling `tracing` / `estimating` / `splicing` executors already
dereference the `Option<&Instruction>` returned by `fetch` and copy the
32-byte `Instruction` (it is `#[derive(Clone, Copy)]` with `#[repr(C)]`).
The `UserMode` execute_instruction in this same file already uses an
owned `Instruction` returned by `Self::fetch()`.

Apply the same pattern in the SupervisorMode arm of the portable
backend so the inner loop carries a local `Instruction` value instead
of holding an `Arc<Program>` clone alive. Each `execute_xxx` callee
already takes `&Instruction`; the caller now passes `&instruction` to
keep their signatures unchanged.

Measured on Apple M-class aarch64, Docker rust:1.91-bookworm, fibonacci
program with n=200_000 (2,601,619 RISC-V cycles per run). 5 measured
runs after 2 warmups, 3 independent benchmark invocations:

  Before: 14.61 / 14.66 / 14.62 ms => 178.07 / 177.47 / 178.00 MHz
   After: 11.14 / 10.77 / 10.38 ms => 233.49 / 241.66 / 250.56 MHz

  Mean: 14.63 ms -> 10.76 ms (-26.4%)
  Mean: 177.85 MHz -> 241.90 MHz (+36.0%)

Cycle count is identical before and after (2,601,619), so execution
semantics are unchanged.

`cargo +nightly fmt --check -p sp1-core-executor` passes.
`cargo check --release -p sp1-core-executor` passes.
@wstran wstran force-pushed the perf/portable-executor-skip-arc-clone branch from 8b9171e to d40f7d4 Compare May 11, 2026 10:42
@wstran
Copy link
Copy Markdown
Author

wstran commented May 11, 2026

Retargeted from dev to main and rebased the branch onto the current main HEAD — dev and main have separated histories and dev has been frozen for ~9 days, so the PR was inert there. The diff is functionally identical: the same per-cycle Arc::clone removal in MinimalExecutor<SupervisorMode>::execute_instruction (post-mprotect refactor). All other fields stay the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant