perf(executor): copy fetched instruction directly in MinimalExecutor by wstran · Pull Request #2776 · succinctlabs/sp1

wstran · 2026-05-06T18:32:25Z

Motivation

MinimalExecutor::execute_instruction in the portable backend (crates/core/executor/src/minimal/arch/portable/mod.rs) cloned self.program (Arc<Program>) on every RISC-V cycle just to satisfy the borrow checker:

fn execute_instruction(&mut self) -> bool {
    let program = self.program.clone();
    let instruction = program.fetch(self.pc).unwrap();
    // ...
}

That is two atomic refcount operations per instruction (one increment on clone, one decrement when the local program Arc is dropped at the end of the function). For a program with a few million RISC-V cycles this is a measurable cost on aarch64.

The sibling executors (tracing.rs, estimating.rs, splicing.rs) already avoid this by dereferencing the Option<&Instruction> returned by fetch and copying the 32-byte Instruction value (it is #[derive(Clone, Copy)] with #[repr(C)]):

let instruction = unsafe { *instruction.unwrap_unchecked() };

The portable backend is the only one that did not do this.

Change

Replace the per-cycle Arc::clone with a direct copy of the fetched Instruction:

let instruction = *self.program.fetch(self.pc).unwrap();

Each execute_xxx callee already takes &Instruction, so the callers now pass &instruction (the local copy). No call site outside this function changes.

I kept the safe *…unwrap() form rather than the unsafe { *…unwrap_unchecked() } form used in the sibling executors, since this PR is focused on removing the per-cycle atomic op without changing panic behavior. Switching to the unchecked form is a separate decision.

Benchmark

Apple M-class CPU, Docker rust:1.91-bookworm (aarch64), cargo build --release with default profile + lto = \"thin\" + codegen-units = 1. Workload: fibonacci program from examples/fibonacci/program with input n = 200_000, run end-to-end through MinimalExecutor::execute_chunk until completion.

Each invocation: 2 warmup runs (discarded) + 5 measured runs, average reported. Three independent invocations of the benchmark:

Run	Before (ms)	Before (MHz)	After (ms)	After (MHz)
1	14.61	178.07	11.14	233.49
2	14.66	177.47	10.77	241.66
3	14.62	178.00	10.38	250.56
Mean	14.63	177.85	10.76	241.90

Δ runtime: −26.4 %
Δ throughput: +36.0 %

Cycle count is 2,601,619 in every run, before and after — execution semantics are preserved.

Verification

cargo +nightly fmt --check -p sp1-core-executor — passes
cargo test --release -p sp1-core-executor — 16 unit tests pass, including minimal::tests::test_chunk_stops_correctly

Notes

The change is in the portable (interpreter) backend only. The native x86_64 backend at crates/core/executor/src/minimal/arch/x86_64/ was not touched.
The borrow checker accepts the deref-and-copy form because the &Instruction borrow from self.program.fetch(...) lives only for the duration of the dereferencing expression; once instruction is the owned local, &mut self is free to be reborrowed by the execute_xxx calls.

`MinimalExecutor<SupervisorMode>::execute_instruction` in the portable backend cloned the program `Arc` once per RISC-V cycle just to satisfy the borrow checker, costing two atomic refcount operations on every instruction. The sibling `tracing` / `estimating` / `splicing` executors already dereference the `Option<&Instruction>` returned by `fetch` and copy the 32-byte `Instruction` (it is `#[derive(Clone, Copy)]` with `#[repr(C)]`). The `UserMode` execute_instruction in this same file already uses an owned `Instruction` returned by `Self::fetch()`. Apply the same pattern in the SupervisorMode arm of the portable backend so the inner loop carries a local `Instruction` value instead of holding an `Arc<Program>` clone alive. Each `execute_xxx` callee already takes `&Instruction`; the caller now passes `&instruction` to keep their signatures unchanged. Measured on Apple M-class aarch64, Docker rust:1.91-bookworm, fibonacci program with n=200_000 (2,601,619 RISC-V cycles per run). 5 measured runs after 2 warmups, 3 independent benchmark invocations: Before: 14.61 / 14.66 / 14.62 ms => 178.07 / 177.47 / 178.00 MHz After: 11.14 / 10.77 / 10.38 ms => 233.49 / 241.66 / 250.56 MHz Mean: 14.63 ms -> 10.76 ms (-26.4%) Mean: 177.85 MHz -> 241.90 MHz (+36.0%) Cycle count is identical before and after (2,601,619), so execution semantics are unchanged. `cargo +nightly fmt --check -p sp1-core-executor` passes. `cargo check --release -p sp1-core-executor` passes.

wstran · 2026-05-11T10:45:17Z

Retargeted from dev to main and rebased the branch onto the current main HEAD — dev and main have separated histories and dev has been frozen for ~9 days, so the PR was inert there. The diff is functionally identical: the same per-cycle Arc::clone removal in MinimalExecutor<SupervisorMode>::execute_instruction (post-mprotect refactor). All other fields stay the same.

wstran changed the base branch from dev to main May 11, 2026 10:37

wstran force-pushed the perf/portable-executor-skip-arc-clone branch from 8b9171e to d40f7d4 Compare May 11, 2026 10:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(executor): copy fetched instruction directly in MinimalExecutor#2776

perf(executor): copy fetched instruction directly in MinimalExecutor#2776
wstran wants to merge 1 commit into
succinctlabs:mainfrom
wstran:perf/portable-executor-skip-arc-clone

wstran commented May 6, 2026

Uh oh!

wstran commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wstran commented May 6, 2026

Motivation

Change

Benchmark

Verification

Notes

Uh oh!

wstran commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant