perf(ecstore): replace mmap+spawn_blocking shard reads with pread#27
Open
rdiperri-wasabi wants to merge 3 commits into
Open
perf(ecstore): replace mmap+spawn_blocking shard reads with pread#27rdiperri-wasabi wants to merge 3 commits into
rdiperri-wasabi wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR optimizes EC shard reads on the GET hot path by replacing mmap-based reads with pread-style reads and reducing spawn_blocking round-trips, while also extending the perf harness to support more workload mixes and richer reporting.
Changes:
- Switch local “zero-copy” shard reads from mmap to
FileExt::read_atand add batched shard reads (batch_shard_pread) to reduce per-shard scheduling overhead. - Add detailed GET-path tracing spans around fanout, locks, decode, and pipe stages.
- Improve perf harness ergonomics (env overrides for op mix/object size, summary detection) and enhance analysis output (per-interface NIC, disk read stats).
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/perf/run-perf-test.sh | Makes loadgen completion detection op-agnostic; enables GET tracing recipe in TRACE mode. |
| scripts/perf/lib/loadgen-run.sh | Adds env-driven workload controls (op mix, object size, resume/save/preload flags). |
| scripts/perf/analyze/analyze.py | Generalizes loadgen parsing across op types; adds disk read metrics and updates report tables. |
| rustfs/src/app/object_usecase.rs | Adds a pre-dispatch GET tracing span around request-context preparation. |
| crates/ecstore/src/set_disk/read.rs | Introduces a batched local-shard read fast-path and instruments shard setup/fanout timing. |
| crates/ecstore/src/set_disk.rs | Adds GET-path spans around lock acquisition, xlmeta read, and spawned read task. |
| crates/ecstore/src/erasure_coding/decode.rs | Instruments shard read, EC decode, and pipe write stages for GET tracing. |
| crates/ecstore/src/disk/mod.rs | Adds a helper to obtain local filesystem paths only for local disks. |
| crates/ecstore/src/disk/disk_store.rs | Exposes local object-path retrieval on the local disk wrapper. |
| crates/ecstore/src/disk/local.rs | Replaces mmap with pread in read_file_zero_copy, consolidates spawn_blocking, adds batch_shard_pread, and adds tests. |
| crates/ecstore/src/bitrot.rs | Extracts shard offset/length adjustment into a helper and updates docs/tests. |
| crates/ecstore/Cargo.toml | Removes memmap2 dependency from ecstore. |
| Cargo.lock | Drops memmap2 from the lockfile for ecstore. |
b3e8d49 to
e9b8f40
Compare
rdiperri-wasabi
added a commit
that referenced
this pull request
Jun 22, 2026
- set_disk/read.rs: wrap unix batch fast-path in #[cfg(unix)] and use early return so non-unix targets (Windows) compile without batch_shard_pread and adjust_shard_read_params, which are unix-only. Sequential fallback is now unconditional. - disk/local.rs: move should_reclaim_file_cache_after_read() call inside the #[cfg(target_os = "linux")] fadvise block, eliminating the unused- variable warning on macOS/non-linux unix targets. - bitrot.rs: tighten adjust_shard_read_params visibility from pub to pub(crate); it is only used within this crate. - analyze.py: fix empty-device placeholder row from 9 columns to 8 to match the disk table header, preventing zip-truncation in _md_table(). Co-authored-by: Cursor <cursoragent@cursor.com>
rdiperri-wasabi
added a commit
that referenced
this pull request
Jun 22, 2026
- set_disk/read.rs: wrap unix batch fast-path in #[cfg(unix)] and use early return so non-unix targets (Windows) compile without batch_shard_pread and adjust_shard_read_params, which are unix-only. Sequential fallback is now unconditional. - disk/local.rs: move should_reclaim_file_cache_after_read() call inside the #[cfg(target_os = "linux")] fadvise block, eliminating the unused- variable warning on macOS/non-linux unix targets. - bitrot.rs: tighten adjust_shard_read_params visibility from pub to pub(crate); it is only used within this crate. - analyze.py: fix empty-device placeholder row from 9 columns to 8 to match the disk table header, preventing zip-truncation in _md_table(). Co-authored-by: Cursor <cursoragent@cursor.com>
3e5c5f4 to
e4b135b
Compare
auto-merge was automatically disabled
June 22, 2026 22:53
Pull request was closed
Fix 1 — consolidate spawn_blocking in read_all_data_with_dmtime: Previously each EC shard read dispatched two thread-pool round-trips (one for the pread/mmap setup, one for the actual read). This patch collapses both into a single blocking closure, eliminating one cross-thread handoff per shard per GET request. Fix 2 — replace mmap with pread in batch_shard_pread: EC shard reads for object data were using mmap, which causes kernel page-cache pressure for large objects and adds VM fault overhead on each access. Replaced with O_DIRECT-friendly pread calls that read directly into caller-supplied buffers, removing the mmap lifecycle (map, access, unmap) from the hot GET path. Measured on smc1 (single-node, SAS SSDs, EC:1) vs wasabi baseline at 50 threads, 5-minute warm-cache GET runs: Small objects (0–1500K, loopback): 4d: 1,562 → 2,109 MB/s (+35%), 22 → 16ms 6d: 906 → 2,437 MB/s (+169%), 40 → 14ms 8d: 662 → 3,072 MB/s (+364%), 54 → 12ms 10d: 472 → 2,458 MB/s (+420%), 76 → 14ms Large objects (1000–1500K, 100 GbE from smc2): 4d: 1,987 → 2,949 MB/s (+48%), 30 → 19ms 6d: 1,577 → 2,949 MB/s (+87%), 38 → 19ms 8d: 1,270 → 2,867 MB/s (+126%), 48 → 20ms 10d: 824 → 2,785 MB/s (+238%), 73 → 21ms Also includes perf harness improvements: analyze.py adds per-interface NIC reporting; loadgen-run.sh and run-perf-test.sh add LG_OBJ_SIZE, remote loadgen support (LOADGEN_HOST), and related flags. Co-authored-by: Cursor <cursoragent@cursor.com>
- set_disk/read.rs: wrap unix batch fast-path in #[cfg(unix)] and use early return so non-unix targets (Windows) compile without batch_shard_pread and adjust_shard_read_params, which are unix-only. Sequential fallback is now unconditional. - disk/local.rs: move should_reclaim_file_cache_after_read() call inside the #[cfg(target_os = "linux")] fadvise block, eliminating the unused- variable warning on macOS/non-linux unix targets. - bitrot.rs: tighten adjust_shard_read_params visibility from pub to pub(crate); it is only used within this crate. - analyze.py: fix empty-device placeholder row from 9 columns to 8 to match the disk table header, preventing zip-truncation in _md_table(). Co-authored-by: Cursor <cursoragent@cursor.com>
e4b135b to
b044968
Compare
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix 1 — consolidate spawn_blocking in read_all_data_with_dmtime: Previously each EC shard read dispatched two thread-pool round-trips (one for the pread/mmap setup, one for the actual read). This patch collapses both into a single blocking closure, eliminating one cross-thread handoff per shard per GET request.
Fix 2 — replace mmap with pread in batch_shard_pread: EC shard reads for object data were using mmap, which causes kernel page-cache pressure for large objects and adds VM fault overhead on each access. Replaced with O_DIRECT-friendly pread calls that read directly into caller-supplied buffers, removing the mmap lifecycle (map, access, unmap) from the hot GET path.
Measured on smc1 (single-node, SAS SSDs, EC:1) vs wasabi baseline at 50 threads, 5-minute warm-cache GET runs:
Small objects (0–1500K, loopback):
4d: 1,562 → 2,109 MB/s (+35%), 22 → 16ms
6d: 906 → 2,437 MB/s (+169%), 40 → 14ms
8d: 662 → 3,072 MB/s (+364%), 54 → 12ms
10d: 472 → 2,458 MB/s (+420%), 76 → 14ms
Large objects (1000–1500K, 100 GbE from smc2):
4d: 1,987 → 2,949 MB/s (+48%), 30 → 19ms
6d: 1,577 → 2,949 MB/s (+87%), 38 → 19ms
8d: 1,270 → 2,867 MB/s (+126%), 48 → 20ms
10d: 824 → 2,785 MB/s (+238%), 73 → 21ms
Also includes perf harness improvements: analyze.py adds per-interface NIC reporting; loadgen-run.sh and run-perf-test.sh add LG_OBJ_SIZE, remote loadgen support (LOADGEN_HOST), and related flags.
Related Issues
Summary of Changes
Verification
Impact
Additional Notes
Thank you for your contribution! Please ensure your PR follows the community standards (CODE_OF_CONDUCT.md). If this is your first contribution, review the CLA document and sign it by commenting
I have read and agree to the CLA.on the PR.