feat(warmer): herd mitigation for rolling deploys (#59)#69
Merged
Conversation
A1 (jitter) + A2 (baseline delay) + A3 (paced batched warmup) + B2b (N-slot PG advisory-lock semaphore with wait-on-miss) for the cluster- wide inter-pod warmer stampede observed on rolling deploys. Six new config knobs (cache-warm-delay/-jitter/-concurrency/-wait-max/ -batch-size/-batch-pause) with conservative defaults. Reuses the session-level advisory lock pattern from startup_locks.py for slot allocation. Last-pod latency analysis included. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bite-sized TDD-ordered tasks (14 total) covering: kwarg extension, A2+A1 delay/jitter, A3 batched warmup, B2b advisory-lock semaphore helpers (try-once, retry-loop, release), warm() integration, storage wiring, ZConfig keys, three DB integration tests, CHANGES entry, full test + lint pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- concurrency=1 serializes two warmers - concurrency=2 allows two parallel, third waits - crash-safe slot release on connection close Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- close lock connection on slot acquisition timeout (was leaking) - continue scanning remaining slots when single-slot try_lock raises (was halting prematurely, deviating from spec) - use fresh psycopg connection per warmer thread in DB integration tests (psycopg connections are not thread-safe) - assert lock_conn.close() called on timeout path in unit test Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #59 — cluster-wide cache-warmer thundering herd on rolling Kubernetes deploys.
Four composable behaviors added to
CacheWarmer.warm():cache-warm-delay, default 15s) — moves the warmer out of the pod's own cold-start window.cache-warm-jitter, default 30s) — spreads arrival times across pods.cache-warm-batch-size500,cache-warm-batch-pause0.5s) — lowers per-pod peak qps.cache-warm-concurrency2,cache-warm-wait-max300s) with wait-on-miss + jittered retry — caps cluster-wide concurrent warmers.Six new ZConfig keys, all defaulting to conservative production values. Reuses the session-level
pg_advisory_lockpattern fromsrc/zodb_pgjsonb/startup_locks.py(introduced in #57), with a dedicatedpsycopg.connectso a pod crash auto-releases the slot.Observable behavior change: warmer queries are now delayed by ~15-45s post-startup, not immediate. Opt-out documented in CHANGES.
Design: docs/superpowers/specs/2026-05-29-cache-warmer-herd-mitigation-design.md
Plan: docs/superpowers/plans/2026-05-29-cache-warmer-herd-mitigation.md
Issue lifecycle
Issue #59 stays open after merge — the bigger structural plays (materialized warm-set table, dump-export, replica routing) remain tracked there.
Test plan
uv run pytest— full suite green (530 passed locally)uv run pytest tests/test_cache_warmer.py -v— 43 warmer tests pass (unit + DB integration)uv run pytest tests/test_cache_warmer.py::TestCacheWarmerDB -v -k 'concurrency or lock_released'— three new DB-marked semaphore tests pass againstlocalhost:5433uv run ruff check . && uv run ruff format --check .— clean🤖 Generated with Claude Code