Skip to content

feat(warmer): herd mitigation for rolling deploys (#59)#69

Merged
jensens merged 15 commits into
mainfrom
59-warmer-herd-mitigation
May 29, 2026
Merged

feat(warmer): herd mitigation for rolling deploys (#59)#69
jensens merged 15 commits into
mainfrom
59-warmer-herd-mitigation

Conversation

@jensens

@jensens jensens commented May 29, 2026

Copy link
Copy Markdown
Member

Summary

Fixes #59 — cluster-wide cache-warmer thundering herd on rolling Kubernetes deploys.

Four composable behaviors added to CacheWarmer.warm():

  • A2 baseline startup delay (cache-warm-delay, default 15s) — moves the warmer out of the pod's own cold-start window.
  • A1 uniform-random jitter (cache-warm-jitter, default 30s) — spreads arrival times across pods.
  • A3 paced batched warmup (cache-warm-batch-size 500, cache-warm-batch-pause 0.5s) — lowers per-pod peak qps.
  • B2b N-slot PG advisory-lock semaphore (cache-warm-concurrency 2, cache-warm-wait-max 300s) with wait-on-miss + jittered retry — caps cluster-wide concurrent warmers.

Six new ZConfig keys, all defaulting to conservative production values. Reuses the session-level pg_advisory_lock pattern from src/zodb_pgjsonb/startup_locks.py (introduced in #57), with a dedicated psycopg.connect so a pod crash auto-releases the slot.

Observable behavior change: warmer queries are now delayed by ~15-45s post-startup, not immediate. Opt-out documented in CHANGES.

Design: docs/superpowers/specs/2026-05-29-cache-warmer-herd-mitigation-design.md
Plan: docs/superpowers/plans/2026-05-29-cache-warmer-herd-mitigation.md

Issue lifecycle

Issue #59 stays open after merge — the bigger structural plays (materialized warm-set table, dump-export, replica routing) remain tracked there.

Test plan

  • uv run pytest — full suite green (530 passed locally)
  • uv run pytest tests/test_cache_warmer.py -v — 43 warmer tests pass (unit + DB integration)
  • uv run pytest tests/test_cache_warmer.py::TestCacheWarmerDB -v -k 'concurrency or lock_released' — three new DB-marked semaphore tests pass against localhost:5433
  • uv run ruff check . && uv run ruff format --check . — clean
  • Validate on aaf-prod: next rolling deploy should replace the ~90% primary CPU spike with sustained ~30% over ~45s

🤖 Generated with Claude Code

jensens and others added 15 commits May 29, 2026 13:43
A1 (jitter) + A2 (baseline delay) + A3 (paced batched warmup) + B2b
(N-slot PG advisory-lock semaphore with wait-on-miss) for the cluster-
wide inter-pod warmer stampede observed on rolling deploys.

Six new config knobs (cache-warm-delay/-jitter/-concurrency/-wait-max/
-batch-size/-batch-pause) with conservative defaults. Reuses the
session-level advisory lock pattern from startup_locks.py for slot
allocation. Last-pod latency analysis included.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bite-sized TDD-ordered tasks (14 total) covering: kwarg extension,
A2+A1 delay/jitter, A3 batched warmup, B2b advisory-lock semaphore
helpers (try-once, retry-loop, release), warm() integration, storage
wiring, ZConfig keys, three DB integration tests, CHANGES entry, full
test + lint pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- concurrency=1 serializes two warmers
- concurrency=2 allows two parallel, third waits
- crash-safe slot release on connection close

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- close lock connection on slot acquisition timeout (was leaking)
- continue scanning remaining slots when single-slot try_lock raises
  (was halting prematurely, deviating from spec)
- use fresh psycopg connection per warmer thread in DB integration
  tests (psycopg connections are not thread-safe)
- assert lock_conn.close() called on timeout path in unit test

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jensens jensens merged commit 110230d into main May 29, 2026
5 checks passed
@jensens jensens deleted the 59-warmer-herd-mitigation branch May 29, 2026 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cache warmer thundering herd on rolling deploys (N pods warming in parallel)

1 participant