feat(warmer): herd mitigation for rolling deploys (#59) by jensens · Pull Request #69 · bluedynamics/zodb-pgjsonb

jensens · 2026-05-29T13:32:53Z

Summary

Fixes #59 — cluster-wide cache-warmer thundering herd on rolling Kubernetes deploys.

Four composable behaviors added to CacheWarmer.warm():

A2 baseline startup delay (cache-warm-delay, default 15s) — moves the warmer out of the pod's own cold-start window.
A1 uniform-random jitter (cache-warm-jitter, default 30s) — spreads arrival times across pods.
A3 paced batched warmup (cache-warm-batch-size 500, cache-warm-batch-pause 0.5s) — lowers per-pod peak qps.
B2b N-slot PG advisory-lock semaphore (cache-warm-concurrency 2, cache-warm-wait-max 300s) with wait-on-miss + jittered retry — caps cluster-wide concurrent warmers.

Six new ZConfig keys, all defaulting to conservative production values. Reuses the session-level pg_advisory_lock pattern from src/zodb_pgjsonb/startup_locks.py (introduced in #57), with a dedicated psycopg.connect so a pod crash auto-releases the slot.

Observable behavior change: warmer queries are now delayed by ~15-45s post-startup, not immediate. Opt-out documented in CHANGES.

Design: docs/superpowers/specs/2026-05-29-cache-warmer-herd-mitigation-design.md
Plan: docs/superpowers/plans/2026-05-29-cache-warmer-herd-mitigation.md

Issue lifecycle

Issue #59 stays open after merge — the bigger structural plays (materialized warm-set table, dump-export, replica routing) remain tracked there.

Test plan

uv run pytest — full suite green (530 passed locally)
uv run pytest tests/test_cache_warmer.py -v — 43 warmer tests pass (unit + DB integration)
uv run pytest tests/test_cache_warmer.py::TestCacheWarmerDB -v -k 'concurrency or lock_released' — three new DB-marked semaphore tests pass against localhost:5433
uv run ruff check . && uv run ruff format --check . — clean
Validate on aaf-prod: next rolling deploy should replace the ~90% primary CPU spike with sustained ~30% over ~45s

🤖 Generated with Claude Code

A1 (jitter) + A2 (baseline delay) + A3 (paced batched warmup) + B2b (N-slot PG advisory-lock semaphore with wait-on-miss) for the cluster- wide inter-pod warmer stampede observed on rolling deploys. Six new config knobs (cache-warm-delay/-jitter/-concurrency/-wait-max/ -batch-size/-batch-pause) with conservative defaults. Reuses the session-level advisory lock pattern from startup_locks.py for slot allocation. Last-pod latency analysis included. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bite-sized TDD-ordered tasks (14 total) covering: kwarg extension, A2+A1 delay/jitter, A3 batched warmup, B2b advisory-lock semaphore helpers (try-once, retry-loop, release), warm() integration, storage wiring, ZConfig keys, three DB integration tests, CHANGES entry, full test + lint pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- concurrency=1 serializes two warmers - concurrency=2 allows two parallel, third waits - crash-safe slot release on connection close Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- close lock connection on slot acquisition timeout (was leaking) - continue scanning remaining slots when single-slot try_lock raises (was halting prematurely, deviating from spec) - use fresh psycopg connection per warmer thread in DB integration tests (psycopg connections are not thread-safe) - assert lock_conn.close() called on timeout path in unit test Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jensens and others added 15 commits May 29, 2026 13:43

feat(warmer): add six herd-mitigation kwargs (off defaults) (#59)

3892ba3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(warmer): A2+A1 — startup delay + jitter sleep (#59)

d335b3e

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(warmer): A3 — paced batched warmup (#59)

5bd0843

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(warmer): B2b — _try_acquire_slot_once helper (#59)

2551a89

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(warmer): B2b — _acquire_slot with retry loop and wait cap (#59)

1c06aa5

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(warmer): B2b — _release_slot helper (#59)

f712c80

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(warmer): B2b — integrate slot lifecycle into warm() (#59)

5b8ab0d

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(storage): wire herd-mitigation knobs into CacheWarmer (#59)

2f849a7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(config): six ZConfig keys for cache-warm herd mitigation (#59)

c756d89

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(warmer): DB integration tests for B2b semaphore (#59)

574277e

- concurrency=1 serializes two warmers - concurrency=2 allows two parallel, third waits - crash-safe slot release on connection close Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs: changelog entry for cache warmer herd mitigation (#59)

5b34d58

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

style: ruff format + lint after herd-mitigation changes

215361a

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jensens merged commit 110230d into main May 29, 2026
5 checks passed

jensens deleted the 59-warmer-herd-mitigation branch May 29, 2026 13:39

jensens mentioned this pull request May 29, 2026

Cache warmer thundering herd on rolling deploys (N pods warming in parallel) #59

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(warmer): herd mitigation for rolling deploys (#59)#69

feat(warmer): herd mitigation for rolling deploys (#59)#69
jensens merged 15 commits into
mainfrom
59-warmer-herd-mitigation

jensens commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jensens commented May 29, 2026

Summary

Issue lifecycle

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant