Add experimental balanced placement strategy#4814
Conversation
Test Results 8 files 8 suites 4m 45s ⏱️ Results for commit a99dea6. ♻️ This comment has been updated with latest results. |
a9e2a39 to
531d9fc
Compare
Real-cluster validation: balanced-v2 removes a leader-imbalance throughput binderTested on a live 3-node BYOC benchmarking cluster (8 vCPU / 16 GiB nodes, replication 2) at 96 partitions, driving a write-heavy workload — a handler that does 100 sequential Before (legacy placement), 96 partitions: leaders landed 34 / 21 / 41 across the three nodes. Under load this made a single node the binder — restate-2 (41 leaders) saturated at ~5.1 of 8 cores while the other two sat at ~2.2-2.6 cores — capping aggregate throughput at ~23k journal records/s (actually below the same cluster's 48-partition number, because the imbalance grows with partitions-per-node). After ( Mechanism confirmed end-to-end: legacy HRW left one node carrying disproportionate leadership at higher partition-to-node ratios, and that node became the throughput wall; balanced-v2 spreads leaders/replicas/nodesets evenly so all nodes contribute. (At 48 partitions the legacy imbalance was milder — 18/12/18 — so the gain there is smaller; the benefit scales with partitions-per-node.) This matches the simulator's predicted leader/replica/nodeset balance improvement and confirms it translates into real throughput under sustained load, applied online to an existing cluster. |
531d9fc to
c0a12b5
Compare
Follow-up: rebalancer thrash fix validated under load + 3n vs 5n scalingUpdate from today's 3n vs 5n cell-sizing matrix on Rebalancer fix holds under load
balanced-v2 scaling 3n -> 5n3n "failures" are all k6 client-side 120 s timeouts (
Non-finding worth notingProfile C (20 × 4KiB + 200 ms sync between steps) hits an identical ~145 inv/s ceiling on BOTH 3n and 5n. Combined with profile D's 287 inv/s × 10 steps == 2878 step-ops/s and profile C's 144.7 × 20 == 2894 step-ops/s, this points to a per-step round-trip cap downstream of cluster scale (likely tunnel-client or SDK protocol round-trip rate). Profile A escapes the cap because with no delay between sets the SDK can pipeline them across a single runtime round-trip. Not a regression vs legacy placement; flagging for separate investigation. Will follow up with a step-count sweep result. |
7f055a5 to
3c21417
Compare
Real-cluster like-for-like: balanced-v2 vs legacy placement = +31% ceiling, better tail latencyBenchmarked on a live BYOC cluster, isolating only the placement strategy (every other variable held constant). Setup (identical both runs)
Images (exact)
Leader distribution (96 partitions / 5 nodes)
Ceiling sweep (step-ops/s = state-writes/s; p95 latency)
Legacy plateaus at ~17,800 step-ops/s; balanced-v2 reaches ~23,400 = +31%, with markedly better tail latency (p95 16.0s -> 8.8s at 6000 VUs). WhyThe workload is compaction-bound at the ceiling. Under legacy placement the hottest node (26 leaders, ~27% of all leaders) saturates its storage first and caps the whole cluster, while the coldest node (9 leaders) sits ~2x underutilized. Per-node CPU at 4000 VUs tracked the skew exactly: balanced-v2 evens the leaders, so all nodes' storage saturates together -> the cluster reaches its true ceiling. At light load (2000 VUs) the two are identical (neither saturated); the gap only appears under load, exactly where it matters. Net: on a hotspot-prone, storage-bound workload, balanced-v2 reclaimed ~31% throughput and halved p95 with no downside observed. |
dd682e0 to
591403b
Compare
591403b to
a99dea6
Compare
|
@tillrohrmann Can you take a look at this one? If you approve the approach, perhaps we can include it in v1.7 cut |
|
I would make this a stretch goal for the v1.7 release. Will try to take look at it as soon as possible. |
Closes #4808
Summary
experimental-placement-strategy = "balanced-v2"andexperimental-placement-rebalance-modeconfigrestate.log_server.nodeset_membershipsgauge and keep the placement simulator undertools/placement-simSimulator signal
For the exact Restate salt across 3/5 node and 24/48/96/128 partition scenarios, combined top-3 reduced average initial ranges to leader=1.25, replica=0.75, nodeset=0.50 versus current leader=12.12, replica=15.25, nodeset=8.62. In the 256-salt sweep, combined top-3 averaged leader=0.93, replica=0.69, nodeset=0.52 versus current leader=8.02, replica=9.32, nodeset=5.32.
Validation
cargo checkcargo fmt --all -- --checkcargo fmt --manifest-path tools/placement-sim/Cargo.toml -- --checkcargo nextest run -p restate-types load_balanced_selectorcargo run --manifest-path tools/placement-sim/Cargo.tomlcargo deny --all-features checkenv -u RUSTC_WRAPPER cargo clippy --all-features --all-targets --workspace -- -D warningscargo nextest run --all-features