Skip to content

Add experimental balanced placement strategy#4814

Draft
pcholakov wants to merge 1 commit into
mainfrom
pavel/experimental-balanced-placement
Draft

Add experimental balanced placement strategy#4814
pcholakov wants to merge 1 commit into
mainfrom
pavel/experimental-balanced-placement

Conversation

@pcholakov
Copy link
Copy Markdown
Contributor

Closes #4808

Summary

  • add opt-in experimental-placement-strategy = "balanced-v2" and experimental-placement-rebalance-mode config
  • balance partition replicas, partition leaders, and replicated-loglet nodeset members with deterministic top-3/load-aware selection for flat placements
  • add restate.log_server.nodeset_memberships gauge and keep the placement simulator under tools/placement-sim

Simulator signal

For the exact Restate salt across 3/5 node and 24/48/96/128 partition scenarios, combined top-3 reduced average initial ranges to leader=1.25, replica=0.75, nodeset=0.50 versus current leader=12.12, replica=15.25, nodeset=8.62. In the 256-salt sweep, combined top-3 averaged leader=0.93, replica=0.69, nodeset=0.52 versus current leader=8.02, replica=9.32, nodeset=5.32.

Validation

  • cargo check
  • cargo fmt --all -- --check
  • cargo fmt --manifest-path tools/placement-sim/Cargo.toml -- --check
  • cargo nextest run -p restate-types load_balanced_selector
  • cargo run --manifest-path tools/placement-sim/Cargo.toml
  • cargo deny --all-features check
  • env -u RUSTC_WRAPPER cargo clippy --all-features --all-targets --workspace -- -D warnings
  • cargo nextest run --all-features

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 26, 2026

Test Results

  8 files    8 suites   4m 45s ⏱️
 60 tests  60 ✅ 0 💤 0 ❌
267 runs  267 ✅ 0 💤 0 ❌

Results for commit a99dea6.

♻️ This comment has been updated with latest results.

@pcholakov pcholakov force-pushed the pavel/experimental-balanced-placement branch 2 times, most recently from a9e2a39 to 531d9fc Compare May 26, 2026 14:59
@pcholakov
Copy link
Copy Markdown
Contributor Author

Real-cluster validation: balanced-v2 removes a leader-imbalance throughput binder

Tested on a live 3-node BYOC benchmarking cluster (8 vCPU / 16 GiB nodes, replication 2) at 96 partitions, driving a write-heavy workload — a handler that does 100 sequential ctx.set of server-generated blobs per invocation (so the journal-write path is exercised with negligible ingress IO), sync, ramped to 600 concurrent. Clear win at flat scale.

Before (legacy placement), 96 partitions: leaders landed 34 / 21 / 41 across the three nodes. Under load this made a single node the binder — restate-2 (41 leaders) saturated at ~5.1 of 8 cores while the other two sat at ~2.2-2.6 cores — capping aggregate throughput at ~23k journal records/s (actually below the same cluster's 48-partition number, because the imbalance grows with partitions-per-node).

After (experimental-placement-strategy = balanced-v2, experimental-placement-rebalance-mode = rebalance): leadership rebalanced online to a perfect 32 / 32 / 32 — no wipe, just a rolling restart with partitions + data preserved (node generations bumped N1:3/N2:2/N3:2). Under the identical workload all three nodes were evenly utilized at ~6.7-6.8 of 8 cores, and aggregate throughput rose to ~30k records/s — +30%. The logs also showed it reconfiguring loglet nodesets online ([Auto Improvement] Bifrost will reconfigure logs to improve placement), so it balanced more than just leaders.

Mechanism confirmed end-to-end: legacy HRW left one node carrying disproportionate leadership at higher partition-to-node ratios, and that node became the throughput wall; balanced-v2 spreads leaders/replicas/nodesets evenly so all nodes contribute. (At 48 partitions the legacy imbalance was milder — 18/12/18 — so the gain there is smaller; the benefit scales with partitions-per-node.) This matches the simulator's predicted leader/replica/nodeset balance improvement and confirms it translates into real throughput under sustained load, applied online to an existing cluster.

@pcholakov pcholakov force-pushed the pavel/experimental-balanced-placement branch from 531d9fc to c0a12b5 Compare May 27, 2026 19:29
@pcholakov
Copy link
Copy Markdown
Contributor Author

Follow-up: rebalancer thrash fix validated under load + 3n vs 5n scaling

Update from today's 3n vs 5n cell-sizing matrix on pavel-benchmarking (vqueues on, action throttle off, balanced-v2 placement in rebalance mode, fixed image with the bifrost auto-improvement fix). 5 workload profiles × 2 VU points each, 3 min steady-state, 60 s settle between runs.

Rebalancer fix holds under load

  • [Auto Improvement] stayed at 0 firings / 20 s for the entire matrix on 5n (~45 min of steady-state load, profiles A through E, VUs up to 6000). Old image collapsed throughput to 0 in this exact scenario under load.
  • Partition leaders held exactly 19/19/19/19/20 throughout on 5n; 32/32/32 on 3n. No churn.
  • After 3n -> 5n scale-up via replicas=5 patch + reconcile, balanced-v2 redistributed cleanly: partition placement converged from 32/32/32 to 19/19/19/19/20 leaders (38/38/38/39/39 replicas) within 60 s, autoimp dropped to 0 after the first 20 s window.

balanced-v2 scaling 3n -> 5n

prof shape                       VUs |  3n inv/s   5n inv/s    5n/3n |   3n p95    5n p95 |  3n fail  5n fail
-------------------------------------------------------------------------------------------------------------
A    100 x 256B, no delays       600 |    1391.8     1967.3    1.41x |    1.41s     1.34s |    0.00%    0.00%
A    100 x 256B, no delays      1500 |    1162.8     1785.2    1.54x |    2.32s     2.05s |    0.00%    0.00%
B    20 x 4KiB + 50ms sync       500 |     456.9      464.0    1.02x |    1.16s     1.13s |    0.00%    0.00%
B    20 x 4KiB + 50ms sync      1500 |     519.6      557.6    1.07x |    3.88s     2.87s |    0.00%    0.00%
C    20 x 4KiB + 200ms sync     2000 |     144.7      144.7    1.00x |   15.89s    15.89s |    0.00%    0.00%
C    20 x 4KiB + 200ms sync     4000 |     144.8      145.6    1.01x |   28.43s    28.34s |    0.00%    0.00%
D    10 x 16KiB + 200ms sync    1000 |     104.3      287.8    2.76x |    7.16s     3.99s |    7.63%    0.00%
D    10 x 16KiB + 200ms sync    3000 |      14.3      287.2   20.08x |    1m59s    10.76s |  100.00%    0.00%
E    10 x 4KiB + 1s suspend     1500 |       7.1       57.9    8.15x |    1m59s    30.04s |  100.00%    0.00%
E    10 x 4KiB + 1s suspend     5000 |      23.8       57.1    2.40x |    1m59s     1m30s |  100.00%    0.00%

3n "failures" are all k6 client-side 120 s timeouts (dropped_iterations=0 everywhere; the server kept accepting). 5n absorbs all backpressure cleanly.

  • Profile A (pure write): clean 1.4-1.5x scaling 3n -> 5n.
  • Profile D (chatty + 16 KiB): 5n unblocks 3n's compaction wall (3n hits 7.6% -> 100% timeouts; 5n clean at 287 inv/s).
  • Profile E (suspend path): 5n unblocks 3n's 100% timeouts; ceiling moves from ~7 to ~58 inv/s.

Non-finding worth noting

Profile C (20 × 4KiB + 200 ms sync between steps) hits an identical ~145 inv/s ceiling on BOTH 3n and 5n. Combined with profile D's 287 inv/s × 10 steps == 2878 step-ops/s and profile C's 144.7 × 20 == 2894 step-ops/s, this points to a per-step round-trip cap downstream of cluster scale (likely tunnel-client or SDK protocol round-trip rate). Profile A escapes the cap because with no delay between sets the SDK can pipeline them across a single runtime round-trip. Not a regression vs legacy placement; flagging for separate investigation. Will follow up with a step-count sweep result.

@pcholakov pcholakov force-pushed the pavel/experimental-balanced-placement branch 2 times, most recently from 7f055a5 to 3c21417 Compare May 29, 2026 09:14
@pcholakov
Copy link
Copy Markdown
Contributor Author

Real-cluster like-for-like: balanced-v2 vs legacy placement = +31% ceiling, better tail latency

Benchmarked on a live BYOC cluster, isolating only the placement strategy (every other variable held constant).

Setup (identical both runs)

  • 5 nodes, 8 vCPU / 16 GiB each, GKE; 96 partitions; partition + log replication {node:2}.
  • vqueues on in both (main supports it), action throttle off, WAL-fsync off on log-server.
  • Service-mesh outbound interception disabled for the ingress->worker hop (so the mesh isn't a confound).
  • Workload: synchronous invocations doing 20 sequential ctx.set of 4 KiB blobs with a 200 ms non-suspending pause between each (~4 s/invocation), closed-loop, k6.
  • Only variable: RESTATE_EXPERIMENTAL_PLACEMENT_STRATEGY unset (legacy) vs balanced-v2 + rebalance.

Images (exact)

  • main: ghcr.io/restatedev/restate@sha256:287ef567935d90e49a5211ca0a724ef92a3ba0083e85ebb737decdf012915206
  • experimental (this PR branch): ghcr.io/restatedev/restate:pavel-experimental-balanced-placement@sha256:4c16da5acda43bbee32984aaba710c3dd24550793258ef481273faf2890acef8
    • (branch also carries an unrelated ingress.http2-max-concurrent-streams config commit, which is unset here and has no effect on this comparison.)

Leader distribution (96 partitions / 5 nodes)

  • legacy (main): 21 / 9 / 23 / 26 / 17 — 2.9x skew (hottest node 26 leaders, coldest 9)
  • balanced-v2: 19 / 19 / 19 / 19 / 20 — even

Ceiling sweep (step-ops/s = state-writes/s; p95 latency)

offered (VUs) legacy/main balanced-v2
2000 9,594 (p95 4.09s) 9,681 (p95 4.17s)
4000 17,756 (p95 6.06s) 18,993 (p95 4.29s)
6000 17,879 (p95 16.02s) 23,391 (p95 8.77s)

Legacy plateaus at ~17,800 step-ops/s; balanced-v2 reaches ~23,400 = +31%, with markedly better tail latency (p95 16.0s -> 8.8s at 6000 VUs).

Why

The workload is compaction-bound at the ceiling. Under legacy placement the hottest node (26 leaders, ~27% of all leaders) saturates its storage first and caps the whole cluster, while the coldest node (9 leaders) sits ~2x underutilized. Per-node CPU at 4000 VUs tracked the skew exactly:

restate-3 (26 leaders): 3,289m  (hottest)
restate-1 ( 9 leaders): 1,676m  (~2x idle)

balanced-v2 evens the leaders, so all nodes' storage saturates together -> the cluster reaches its true ceiling. At light load (2000 VUs) the two are identical (neither saturated); the gap only appears under load, exactly where it matters.

Net: on a hotspot-prone, storage-bound workload, balanced-v2 reclaimed ~31% throughput and halved p95 with no downside observed.

@pcholakov pcholakov requested a review from tillrohrmann May 29, 2026 14:11
@pcholakov pcholakov force-pushed the pavel/experimental-balanced-placement branch 2 times, most recently from dd682e0 to 591403b Compare May 29, 2026 14:29
@pcholakov pcholakov force-pushed the pavel/experimental-balanced-placement branch from 591403b to a99dea6 Compare May 29, 2026 14:38
@AhmedSoliman
Copy link
Copy Markdown
Member

@tillrohrmann Can you take a look at this one? If you approve the approach, perhaps we can include it in v1.7 cut

@tillrohrmann
Copy link
Copy Markdown
Contributor

I would make this a stretch goal for the v1.7 release. Will try to take look at it as soon as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Partition and log-server nodeset placement can cause severe hotspots

3 participants