Skip to content

perf(prover): default rayon to available_parallelism#2727

Open
tamirhemo wants to merge 2 commits into
tamir/bench-infrafrom
tamir/perf-physical-cores
Open

perf(prover): default rayon to available_parallelism#2727
tamirhemo wants to merge 2 commits into
tamir/bench-infrafrom
tamir/perf-physical-cores

Conversation

@tamirhemo
Copy link
Copy Markdown
Contributor

@tamirhemo tamirhemo commented Apr 20, 2026

Summary

  • Default rayon global thread pool to min(available_parallelism, physical_cores) instead of logical cores
  • SMT siblings cause excessive crossbeam work-stealing contention (~30% of prove time at 64 threads, ~12% at 32 threads)
  • Users can still override via RAYON_NUM_THREADS environment variable

Benchmark results (AMD Threadripper 7970X, 32 physical / 64 logical)

Workload Before (64t) After (32t) Delta
fib 35,380ms 28,026ms -20.8%
keccak 42,847ms 35,936ms -16.1%
big 47,697ms 38,664ms -18.9%

Profile diff (fib)

  • crossbeam_epoch::with_handle: 17.4% → 7.9% (down 55%)
  • crossbeam_epoch::try_advance: 7.6% → 1.5% (down 80%)
  • crossbeam_deque::steal: 5.3% → 2.9% (down 45%)

Changes

  • slop/crates/futures/src/rayon.rsinit_global_pool() now public, defaults to physical cores
  • crates/prover/src/worker/builder.rs — calls init_global_pool() at start of cpu_worker_builder()

Stack

  1. feat(bench): add CPU prover benchmark harness and baseline profiling #2726 — bench infra + baselines
  2. This PR — E1: rayon defaults to physical cores
  3. tamir/perf-mimalloc — E2: optional mimalloc allocator

Test plan

  • RAYON_NUM_THREADS unset → prover uses physical core count
  • RAYON_NUM_THREADS=64 → prover uses 64 threads (override works)
  • Bench results reproducible on another machine with SMT

🤖 Generated with Claude Code

tamirhemo and others added 2 commits April 21, 2026 16:31
SMT (hyper-threading) siblings cause excessive crossbeam work-stealing
contention that dominates ~30% of CPU prove time. Defaulting the rayon
global thread pool to num_cpus::get_physical() instead of logical
cores reduces this to ~12%.

Benchmark results on AMD Threadripper 7970X (32 physical / 64 logical):
  fib:    35,380ms → 28,026ms  (-20.8%)
  keccak: 42,847ms → 35,936ms  (-16.1%)
  big:    47,697ms → 38,664ms  (-18.9%)

The change is in slop_futures::rayon::init_global_pool(), called
eagerly from cpu_worker_builder(). Users can still override via
RAYON_NUM_THREADS environment variable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
num_cpus::get_physical() reads /proc/cpuinfo which is not namespaced
in containers — it always returns the host's physical core count,
ignoring cgroup CPU quotas (docker --cpus, K8s resources.limits.cpu).

Use min(available_parallelism, get_physical()) instead:
- available_parallelism reads cgroup v1/v2 quota files, so it
  respects container CPU limits
- get_physical caps to avoid SMT oversubscription on bare metal
- RAYON_NUM_THREADS still overrides everything

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tamirhemo tamirhemo force-pushed the tamir/perf-physical-cores branch from b2a6833 to 31ac2a3 Compare April 21, 2026 16:34
@tamirhemo tamirhemo changed the title perf(prover): default rayon to physical cores (-19.8% on big) perf(prover): default rayon to available_parallelism Apr 22, 2026
@tamirhemo tamirhemo changed the title perf(prover): default rayon to available_parallelism perf(prover): default rayon to available_parallelism Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant