Add experimental balanced placement strategy by pcholakov · Pull Request #4814 · restatedev/restate

pcholakov · 2026-05-26T13:39:36Z

Summary

add opt-in experimental-placement-strategy = "balanced-v2" and experimental-placement-rebalance-mode config
balance partition replicas, partition leaders, and replicated-loglet nodeset members with deterministic top-3/load-aware selection for flat placements
add restate.log_server.nodeset_memberships gauge and keep the placement simulator under tools/placement-sim

Simulator signal

For the exact Restate salt across 3/5 node and 24/48/96/128 partition scenarios, combined top-3 reduced average initial ranges to leader=1.25, replica=0.75, nodeset=0.50 versus current leader=12.12, replica=15.25, nodeset=8.62. In the 256-salt sweep, combined top-3 averaged leader=0.93, replica=0.69, nodeset=0.52 versus current leader=8.02, replica=9.32, nodeset=5.32.

Validation

cargo check
cargo fmt --all -- --check
cargo fmt --manifest-path tools/placement-sim/Cargo.toml -- --check
cargo nextest run -p restate-types load_balanced_selector
cargo run --manifest-path tools/placement-sim/Cargo.toml
cargo deny --all-features check
env -u RUSTC_WRAPPER cargo clippy --all-features --all-targets --workspace -- -D warnings
cargo nextest run --all-features

github-actions · 2026-05-26T14:12:15Z

Test Results

8 files 8 suites 4m 45s ⏱️
60 tests 60 ✅ 0 💤 0 ❌
267 runs 267 ✅ 0 💤 0 ❌

Results for commit a99dea6.

♻️ This comment has been updated with latest results.

pcholakov · 2026-05-27T13:10:52Z

Real-cluster validation: balanced-v2 removes a leader-imbalance throughput binder

Tested on a live 3-node BYOC benchmarking cluster (8 vCPU / 16 GiB nodes, replication 2) at 96 partitions, driving a write-heavy workload — a handler that does 100 sequential ctx.set of server-generated blobs per invocation (so the journal-write path is exercised with negligible ingress IO), sync, ramped to 600 concurrent. Clear win at flat scale.

Before (legacy placement), 96 partitions: leaders landed 34 / 21 / 41 across the three nodes. Under load this made a single node the binder — restate-2 (41 leaders) saturated at ~5.1 of 8 cores while the other two sat at ~2.2-2.6 cores — capping aggregate throughput at ~23k journal records/s (actually below the same cluster's 48-partition number, because the imbalance grows with partitions-per-node).

After (experimental-placement-strategy = balanced-v2, experimental-placement-rebalance-mode = rebalance): leadership rebalanced online to a perfect 32 / 32 / 32 — no wipe, just a rolling restart with partitions + data preserved (node generations bumped N1:3/N2:2/N3:2). Under the identical workload all three nodes were evenly utilized at ~6.7-6.8 of 8 cores, and aggregate throughput rose to ~30k records/s — +30%. The logs also showed it reconfiguring loglet nodesets online ([Auto Improvement] Bifrost will reconfigure logs to improve placement), so it balanced more than just leaders.

Mechanism confirmed end-to-end: legacy HRW left one node carrying disproportionate leadership at higher partition-to-node ratios, and that node became the throughput wall; balanced-v2 spreads leaders/replicas/nodesets evenly so all nodes contribute. (At 48 partitions the legacy imbalance was milder — 18/12/18 — so the gain there is smaller; the benefit scales with partitions-per-node.) This matches the simulator's predicted leader/replica/nodeset balance improvement and confirms it translates into real throughput under sustained load, applied online to an existing cluster.

pcholakov · 2026-05-28T12:05:25Z

Follow-up: rebalancer thrash fix validated under load + 3n vs 5n scaling

Update from today's 3n vs 5n cell-sizing matrix on pavel-benchmarking (vqueues on, action throttle off, balanced-v2 placement in rebalance mode, fixed image with the bifrost auto-improvement fix). 5 workload profiles × 2 VU points each, 3 min steady-state, 60 s settle between runs.

Rebalancer fix holds under load

[Auto Improvement] stayed at 0 firings / 20 s for the entire matrix on 5n (~45 min of steady-state load, profiles A through E, VUs up to 6000). Old image collapsed throughput to 0 in this exact scenario under load.
Partition leaders held exactly 19/19/19/19/20 throughout on 5n; 32/32/32 on 3n. No churn.
After 3n -> 5n scale-up via replicas=5 patch + reconcile, balanced-v2 redistributed cleanly: partition placement converged from 32/32/32 to 19/19/19/19/20 leaders (38/38/38/39/39 replicas) within 60 s, autoimp dropped to 0 after the first 20 s window.

balanced-v2 scaling 3n -> 5n

prof shape                       VUs |  3n inv/s   5n inv/s    5n/3n |   3n p95    5n p95 |  3n fail  5n fail
-------------------------------------------------------------------------------------------------------------
A    100 x 256B, no delays       600 |    1391.8     1967.3    1.41x |    1.41s     1.34s |    0.00%    0.00%
A    100 x 256B, no delays      1500 |    1162.8     1785.2    1.54x |    2.32s     2.05s |    0.00%    0.00%
B    20 x 4KiB + 50ms sync       500 |     456.9      464.0    1.02x |    1.16s     1.13s |    0.00%    0.00%
B    20 x 4KiB + 50ms sync      1500 |     519.6      557.6    1.07x |    3.88s     2.87s |    0.00%    0.00%
C    20 x 4KiB + 200ms sync     2000 |     144.7      144.7    1.00x |   15.89s    15.89s |    0.00%    0.00%
C    20 x 4KiB + 200ms sync     4000 |     144.8      145.6    1.01x |   28.43s    28.34s |    0.00%    0.00%
D    10 x 16KiB + 200ms sync    1000 |     104.3      287.8    2.76x |    7.16s     3.99s |    7.63%    0.00%
D    10 x 16KiB + 200ms sync    3000 |      14.3      287.2   20.08x |    1m59s    10.76s |  100.00%    0.00%
E    10 x 4KiB + 1s suspend     1500 |       7.1       57.9    8.15x |    1m59s    30.04s |  100.00%    0.00%
E    10 x 4KiB + 1s suspend     5000 |      23.8       57.1    2.40x |    1m59s     1m30s |  100.00%    0.00%

3n "failures" are all k6 client-side 120 s timeouts (dropped_iterations=0 everywhere; the server kept accepting). 5n absorbs all backpressure cleanly.

Profile A (pure write): clean 1.4-1.5x scaling 3n -> 5n.
Profile D (chatty + 16 KiB): 5n unblocks 3n's compaction wall (3n hits 7.6% -> 100% timeouts; 5n clean at 287 inv/s).
Profile E (suspend path): 5n unblocks 3n's 100% timeouts; ceiling moves from ~7 to ~58 inv/s.

Non-finding worth noting

Profile C (20 × 4KiB + 200 ms sync between steps) hits an identical ~145 inv/s ceiling on BOTH 3n and 5n. Combined with profile D's 287 inv/s × 10 steps == 2878 step-ops/s and profile C's 144.7 × 20 == 2894 step-ops/s, this points to a per-step round-trip cap downstream of cluster scale (likely tunnel-client or SDK protocol round-trip rate). Profile A escapes the cap because with no delay between sets the SDK can pipeline them across a single runtime round-trip. Not a regression vs legacy placement; flagging for separate investigation. Will follow up with a step-count sweep result.

pcholakov · 2026-05-29T13:57:07Z

Real-cluster like-for-like: balanced-v2 vs legacy placement = +31% ceiling, better tail latency

Benchmarked on a live BYOC cluster, isolating only the placement strategy (every other variable held constant).

Setup (identical both runs)

5 nodes, 8 vCPU / 16 GiB each, GKE; 96 partitions; partition + log replication {node:2}.
vqueues on in both (main supports it), action throttle off, WAL-fsync off on log-server.
Service-mesh outbound interception disabled for the ingress->worker hop (so the mesh isn't a confound).
Workload: synchronous invocations doing 20 sequential ctx.set of 4 KiB blobs with a 200 ms non-suspending pause between each (~4 s/invocation), closed-loop, k6.
Only variable: RESTATE_EXPERIMENTAL_PLACEMENT_STRATEGY unset (legacy) vs balanced-v2 + rebalance.

Images (exact)

main: ghcr.io/restatedev/restate@sha256:287ef567935d90e49a5211ca0a724ef92a3ba0083e85ebb737decdf012915206
experimental (this PR branch): ghcr.io/restatedev/restate:pavel-experimental-balanced-placement@sha256:4c16da5acda43bbee32984aaba710c3dd24550793258ef481273faf2890acef8
- (branch also carries an unrelated ingress.http2-max-concurrent-streams config commit, which is unset here and has no effect on this comparison.)

Leader distribution (96 partitions / 5 nodes)

legacy (main): 21 / 9 / 23 / 26 / 17 — 2.9x skew (hottest node 26 leaders, coldest 9)
balanced-v2: 19 / 19 / 19 / 19 / 20 — even

Ceiling sweep (step-ops/s = state-writes/s; p95 latency)

offered (VUs)	legacy/main	balanced-v2
2000	9,594 (p95 4.09s)	9,681 (p95 4.17s)
4000	17,756 (p95 6.06s)	18,993 (p95 4.29s)
6000	17,879 (p95 16.02s)	23,391 (p95 8.77s)

Legacy plateaus at ~17,800 step-ops/s; balanced-v2 reaches ~23,400 = +31%, with markedly better tail latency (p95 16.0s -> 8.8s at 6000 VUs).

Why

The workload is compaction-bound at the ceiling. Under legacy placement the hottest node (26 leaders, ~27% of all leaders) saturates its storage first and caps the whole cluster, while the coldest node (9 leaders) sits ~2x underutilized. Per-node CPU at 4000 VUs tracked the skew exactly:

restate-3 (26 leaders): 3,289m  (hottest)
restate-1 ( 9 leaders): 1,676m  (~2x idle)

balanced-v2 evens the leaders, so all nodes' storage saturates together -> the cluster reaches its true ceiling. At light load (2000 VUs) the two are identical (neither saturated); the gap only appears under load, exactly where it matters.

Net: on a hotspot-prone, storage-bound workload, balanced-v2 reclaimed ~31% throughput and halved p95 with no downside observed.

AhmedSoliman · 2026-06-03T21:02:07Z

@tillrohrmann Can you take a look at this one? If you approve the approach, perhaps we can include it in v1.7 cut

tillrohrmann · 2026-06-04T09:28:52Z

I would make this a stretch goal for the v1.7 release. Will try to take look at it as soon as possible.

pcholakov force-pushed the pavel/experimental-balanced-placement branch 2 times, most recently from a9e2a39 to 531d9fc Compare May 26, 2026 14:59

pcholakov force-pushed the pavel/experimental-balanced-placement branch from 531d9fc to c0a12b5 Compare May 27, 2026 19:29

pcholakov force-pushed the pavel/experimental-balanced-placement branch 2 times, most recently from 7f055a5 to 3c21417 Compare May 29, 2026 09:14

pcholakov requested a review from tillrohrmann May 29, 2026 14:11

pcholakov force-pushed the pavel/experimental-balanced-placement branch 2 times, most recently from dd682e0 to 591403b Compare May 29, 2026 14:29

Add experimental balanced placement strategy

a99dea6

pcholakov force-pushed the pavel/experimental-balanced-placement branch from 591403b to a99dea6 Compare May 29, 2026 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add experimental balanced placement strategy#4814

Add experimental balanced placement strategy#4814
pcholakov wants to merge 1 commit into
mainfrom
pavel/experimental-balanced-placement

pcholakov commented May 26, 2026

Uh oh!

github-actions Bot commented May 26, 2026 •

edited

Loading

Uh oh!

pcholakov commented May 27, 2026

Uh oh!

pcholakov commented May 28, 2026

Uh oh!

pcholakov commented May 29, 2026

Uh oh!

AhmedSoliman commented Jun 3, 2026

Uh oh!

tillrohrmann commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pcholakov commented May 26, 2026

Summary

Simulator signal

Validation

Uh oh!

github-actions Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

pcholakov commented May 27, 2026

Real-cluster validation: balanced-v2 removes a leader-imbalance throughput binder

Uh oh!

pcholakov commented May 28, 2026

Follow-up: rebalancer thrash fix validated under load + 3n vs 5n scaling

Rebalancer fix holds under load

balanced-v2 scaling 3n -> 5n

Non-finding worth noting

Uh oh!

pcholakov commented May 29, 2026

Real-cluster like-for-like: balanced-v2 vs legacy placement = +31% ceiling, better tail latency

Setup (identical both runs)

Images (exact)

Leader distribution (96 partitions / 5 nodes)

Ceiling sweep (step-ops/s = state-writes/s; p95 latency)

Why

Uh oh!

AhmedSoliman commented Jun 3, 2026

Uh oh!

tillrohrmann commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 26, 2026 •

edited

Loading