test(opal-server): git leak/resilience test environment (PR1)#922
test(opal-server): git leak/resilience test environment (PR1)#922dshoen619 wants to merge 15 commits into
Conversation
Add an off-by-default diagnostics endpoint so tests can observe the in-memory GitPolicyFetcher cache sizes (repo_locks/repos/repos_last_fetched) and process RSS that the upcoming memory-leak fix eliminates. - debug_stats.py: read-only git_fetcher_cache_stats() helper + a register_internal_stats_route() registrar that mounts GET /internal/git-fetcher-cache-stats only when enabled. - config.py: new OPAL_DEBUG_INTERNAL_STATS flag, default False. - server.py: register the route, gated by the flag, beside /healthcheck. No production behavior change when the flag is off (the default). Also ignore .claude/ so private planning artifacts are never committed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A self-contained docker-compose stack (opal-server x2 workers + Redis + Postgres broadcaster + Gitea) plus a pytest harness that reproduces, as tests that fail on master, the git-fetcher memory leak, the offline-repo hang, the slow serial boot, and the broadcaster-disconnect gap. These become the regression gates for the follow-up fixes. - seed/: idempotent Gitea seeding sidecar (N policy repos) + Dockerfile. - docker-compose.yml: 4-service stack, opal-server built from the repo's own docker/Dockerfile (server target), scopes on, Postgres broadcaster. - helpers.py / conftest.py: HTTP + infra helpers and stack fixtures. - test_leak.py / test_resilience.py / test_boot.py: the flagship tests. - README.md: how to run and expected fail-on-master behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
✅ Deploy Preview for opal-docs canceled.
|
Complete the helpers.py surface promised in the plan's file-structure table. Both are now functional and used, not dead code: - make_repo_unreachable(name): returns a git URL on a routable-but-dead TEST-NET-1 host (RFC 5737). test_offline_repo now uses it instead of an inlined literal. - GiteaAdmin: host-side Gitea admin client (list_repos / repo_exists / create_repo / delete_repo), exposed as the `gitea_admin` pytest fixture for tests that need to inspect or stage repos beyond the seed sidecar. Gitea is published on host port 13000 (uncommon, to avoid the usual :3000 clash) so GiteaAdmin can reach it; opal_server and the seed sidecar still use the internal http://gitea:3000. README updated with the helper and port notes. Verified live: GiteaAdmin lists the seeded repos and round-trips create/exists/delete against Gitea over the published port. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verified test_server_recovers_after_postgres_bounce against the stack: it PASSES on master (~14-19s). On a broadcaster drop the affected worker triggers a graceful shutdown, gunicorn respawns it, and the sibling worker keeps serving HTTP, so the surface recovers within the window — recovery happens via gunicorn's in-container worker supervision, not an external supervisor and not an in-process reconnect. Reframe #5 as a recovery guard (not a known-broken case) in the docstring and README; the prior "FAILS on master / needs external supervisor" wording was wrong. PER-15065's in-process reconnect would avoid the worker churn but recovery already holds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Run the repo's pinned pre-commit formatters (black 23.1.0, isort 5.12.0, docformatter 1.7.5) over the PR1 files to satisfy the pre-commit CI check. Formatting only — no behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds an off-by-default OPAL server internal diagnostics endpoint for git-fetcher cache/RSS stats, and introduces a docker-compose-based “git leak/resilience” integration testbed intended to reproduce/regress several production issues (leak, offline repo hang, boot slowness, broadcaster disconnect recovery).
Changes:
- Add
/internal/git-fetcher-cache-statsendpoint gated byOPAL_DEBUG_INTERNAL_STATS(default off), plus unit tests. - Add
app-tests/git-leak/docker-compose stack (opal-server + redis/postgres/gitea) and pytest harness (boot/leak/resilience tests). - Add
.claude/to.gitignore.
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/opal-server/opal_server/debug_stats.py | New helper to read git-fetcher cache sizes + RSS and register a gated internal route. |
| packages/opal-server/opal_server/config.py | Adds DEBUG_INTERNAL_STATS config flag (default False). |
| packages/opal-server/opal_server/server.py | Mounts the internal stats route when enabled. |
| packages/opal-server/opal_server/tests/debug_stats_test.py | Unit tests for stats dict sizing + flag default. |
| packages/opal-server/opal_server/tests/debug_stats_endpoint_test.py | Unit tests for endpoint presence/absence when gated. |
| app-tests/git-leak/docker-compose.yml | Compose stack for OPAL + dependencies + seeding sidecars. |
| app-tests/git-leak/helpers.py | Host-side HTTP/compose helpers for the test harness. |
| app-tests/git-leak/conftest.py | Pytest fixtures to boot/teardown the stack and provide clients. |
| app-tests/git-leak/test_leak.py | Leak regression tests (churn + repeat sync). |
| app-tests/git-leak/test_resilience.py | Offline-repo hang test + Postgres bounce recovery guard. |
| app-tests/git-leak/test_boot.py | Boot timing/baseline test. |
| app-tests/git-leak/seed/seed_gitea.py | Idempotent Gitea seeding script for N repos. |
| app-tests/git-leak/seed/Dockerfile | Container image for the seeding sidecar. |
| app-tests/git-leak/README.md | Documentation for running the new testbed locally. |
| .gitignore | Ignores .claude/ working artifacts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The CI `build` job runs `pytest` from the repo root with no path, which recursed into app-tests/git-leak/ and ran the flagship tests — these are designed to FAIL on master (they are the regression gates for PR2-PR5), so they broke the build job. Set `testpaths = packages` so the rootdir run collects only the unit tests under packages/ (matching master's effective behavior, since app-tests/ had no pytest files before). testpaths only applies when pytest is invoked from the rootdir with no args, so `cd app-tests/git-leak && pytest` still collects and runs the test bed. Verified both: root run -> packages only; subdir run -> all 5 flagship tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- stats(): sample the /internal endpoint several times and merge per key with max(). The caches are per-process on a multi-worker server, so a single read can hit an empty (non-leader) worker — max-merge avoids both false negatives (missing a populated leader) and false positives (an `== 0` drain assertion passing only because it hit an empty worker). - test_leak: assert the initial-load `_wait_until` succeeded before deleting / before taking baseline, so the tests can't pass vacuously when load never completed. - refresh_all(): correct the misleading comment — a 404 is a no-op, there is no client-side fallback. - conftest: skip the suite cleanly if docker is unavailable (defense in depth; it's already excluded from the default pytest run via testpaths). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two review findings on the regression gates: - Cross-test contamination: clone paths are keyed by repo URL (source_id = sha256(url)+branch-shard), not scope_id, so scopes from different tests that point at the same seeded repo share one GitPolicyFetcher cache entry. With a session-scoped stack and no teardown, a leftover boot-*/stable-* scope kept those entries alive and would make test_churn's `repos == 0` drain assertion fail on fixed code. OpalServerClient now tracks created scopes and the opal fixture deletes them on teardown (best-effort drain wait, swallows errors so master — where delete never purges — doesn't fail the passing test). - False gate: test_repeat_sync_does_not_grow re-syncs identical scopes, which a path-keyed cache can't grow even on master, so it could never be the leak gate it claimed. Reframed as an honest idempotency guard (passes on master) that points at test_churn_releases_caches as the real leak gate; README's "Expected on master" reclassifies it alongside the postgres-bounce guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
zeevmoney
left a comment
There was a problem hiding this comment.
Requesting changes — the regression gates do not function as gates yet
Scope check first, because it's the reassuring part: the production surface is safe. The only always-on change is a benign import plus a default-off OPAL_DEBUG_INTERNAL_STATS config key; with the flag off (the default) there is zero behavior change, and when on the handler can't crash (the class dicts always exist, len() is GIL-atomic against the event-loop mutations, and _read_rss_kb swallows errors). testpaths = packages drops no existing tests. Git hygiene is clean (nothing private leaked into the diff, no rebase needed). This PR will not crash production.
The problem is the actual deliverable — the five flagship tests meant to gate PR2–PR5.
What's blocking (must fix before this can be trusted as the gate)
-
The load/drain gates assert on
GitPolicyFetcher.repos, which the first-sync clone path never populates (see the CRITICAL inline comment ontest_leak.py:26). On a fresh scope,fetch_and_notify_on_changestakes the_clone()branch, which only setsrepo_locks—reposis filled solely by_get_repo()on a second sync or the bundle path, and periodic re-sync is disabled in the stack (OPAL_POLICY_REFRESH_INTERVAL: "0"). As a resulttest_churn_releases_caches,test_repeat_sync_does_not_grow, andtest_offline_repo_does_not_block_healthy_scopeshang at their_wait_until(repos >= n)load gate for the full timeout and then fail at the load stage — they never reach the leak/offline logic they exist to test, and they would not flip green after PR2/PR3 land. Onlytest_boot_loads_all_scopesworks, and only incidentally (compose restart→preload_scopesre-discovers the on-disk clones). This is the same root cause the PR description flags as "Known caveat #2" — it needs to be fixed, not just noted. -
Even once load is fixed, the cache assertions are not reliable on a 2-worker stack.
stats()'s max-merge cannot prevent a== 0drain assertion from falsely passing when the samples miss the leader worker (HIGH onhelpers.py:50). The cache tests should run single-worker, or the endpoint should aggregate across workers. -
Failure modes are invisible.
compose()swallows stdout/stderr on any compose failure (HIGH onhelpers.py:218), and partial seeding is never detected before the suite runs (HIGH onconftest.py:40) — so a broken stack or a half-seeded Gitea looks like a test failure for the wrong reason.
Secondary correctness (should fix)
test_repeat_sync_does_not_growis tautological —len(repos)can't grow on repeat URL-keyed sync (test_leak.py:64).- The offline test's TEST-NET-1 address likely fails fast instead of hanging, so it may not exercise starvation (
helpers.py:205). - The
opalfixture clears_created_scopesat setup, orphaning scopes from a failed prior test and contaminating the next (conftest.py:55). - The postgres-bounce test asserts only HTTP liveness, not broadcaster recovery, and doesn't
--waitfor Postgres (test_resilience.py:51). - README/docstrings say "fails on master," but the suite requires this PR's
/internalendpoint (README.md:30). - Dead/misleading 404 branch in
refresh_all()(helpers.py:128).
Minor
- Enabling the flag exposes an unauthenticated
/internalroute (non-sensitive payload, off by default) (debug_stats.py:39). || truemasks genuinegitea-adminfailures (docker-compose.yml:58).- Boot-timing clock starts after
wait_healthy, undercounting boot-sync time (test_boot.py:24).
Bottom line
All 11 plan tasks are implemented and the production change is safe, but the suite needs to (1) re-key the load/drain assertions to a cache the sync path actually populates (or force a policy fetch), (2) make the cache reads worker-deterministic, (3) surface compose/seed failures, and then (4) be validated end-to-end against a live stack to confirm each test fails/passes for the right reason. Until then it can't be relied on as the regression gate for PR2–PR5.
…iew) Addresses the CHANGES_REQUESTED review on PR #922. Root cause behind most findings: a fresh scope's first sync takes the _clone() branch, which only fills GitPolicyFetcher.repo_locks; repos/repos_last_fetched are filled on a *second* sync. So the load gates on `repos` hung, and the 2-worker per-process caches made `== 0` drain assertions unsafe. Blocking: - Single worker (UVICORN_NUM_WORKERS=1): deterministic per-process cache reads; removes the false `== 0` drain class. - Load gate (CRITICAL + Zivxx HIGH): _load_scopes gates on repo_locks then refresh_all() to force the second sync, so repos/repos_last_fetched are actually populated before any drain/purge assertion. - compose() surfaces captured stdout/stderr on failure. - Seed completeness asserted in conftest; seed script isolates per-repo failures and exits non-zero with a count. Secondary: - test_repeat_sync asserts an RSS bound (count can't grow for any impl); churn asserts all three caches drain + a loose RSS backstop. - blackhole socat sidecar replaces TEST-NET-1 (deterministic hang); offline test saturates the fetch executor with 40 hung clones and recovers via OpalServerClient.hard_reset() (stop -> redis FLUSHALL -> start). - Per-test clean slate deletes all server scopes (fixes orphan-scope leak). - Postgres-bounce proves broadcast recovery (PUT post-bounce, assert sync) and uses `up -d --wait`. - Remove dead 404 branch in refresh_all; boot clock starts at restart; gitea-admin `|| true` -> "already exists"-only guard; README reworded. opal-server: - /internal stats route now takes the JWTAuthenticator dependency (protected when JWT on, no-op in the test bed); unit test asserts enforcement. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review addressed —
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- snapshot a single /internal stats read per poll in the churn-drain and boot gates (consistent multi-key observation; fewer HTTP round-trips) - document why a 200 from the healthy scope can't be a masked default bundle, and why the stats route is intentionally a sync def - pin alpine/socat and the seed image's pip deps for reproducibility Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t-environment-one-big-pr
…repo Ziv (PR review, round 2) caught that the offline-resilience gate can false-PASS in the full-suite run. The "healthy" scope pointed at policy-repo-0000 (list_seeded_repos(1)[0]), but on-disk clones are keyed by URL-hash and survive compose restart/stop/start (opal_server mounts no volume at /opal; only `down -v` wipes them). test_boot/test_leak run first (alphabetical) and already clone every seeded repo, so the healthy scope hit the existing clone via _discover_repository, skipped _clone(), and served 200 without ever touching the saturated fetch executor — the gate that must FAIL on this branch (no PR3 timeout) passed. Fix: seed a reserved repo (policy-repo-healthy-probe) outside the numeric policy-repo-NNNN range that no boot/leak test enumerates, and point the healthy probe at it. A never-cloned repo forces a genuine fresh clone through the starved pool, so the gate fails correctly. The seed- completeness check in conftest now covers the reserved repo too. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…view) Follow-up to PR review of the git-leak/resilience test bed: - hard_reset: always restart opal_server + wait_healthy via a finally, so a failed redis FLUSHALL can't leave the server stopped (which would fail every later session-scoped test and, running in a test finally, mask the result). - delete_all_scopes drain: a transient /internal read error no longer counts as a successful drain (was `except: return`); keep polling to the deadline so a not-yet-drained cache can't leak into the next test once PR2 lands. - use stats(samples=1) for the zero-waiting drain/empty polls (the peak-merge only matters for load assertions; this also drops 3x HTTP per poll). - resilience: narrow broad `except Exception` to requests.RequestException (and RuntimeError for wait_healthy timeout) so harness bugs surface instead of masquerading as "never served"/"never recovered". - resilience: collapse `assert opal.stats()` + redundant re-read into one read. - debug_stats_test: assert rss_kb > 0 on Linux (was `>= 0`, which passed for the wrong reason where /proc is absent and RSS reads fall back to 0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t-environment-one-big-pr
|
@/private/tmp/claude-501/-Users-zeevmanilovich-NewDev-opal/deddbdb3-9e66-4bdc-86ca-0a2e991c021b/scratchpad/corrected.md |
zeevmoney
left a comment
There was a problem hiding this comment.
Approving — the test bed is sound and this PR has already been through several review rounds (Copilot / @Zivxx / @zeevmoney) with the substantive issues raised and resolved. A fresh python-pro pass (planned against the opal-development references) surfaced only non-blocking items; none are CRITICAL/HIGH, so they don't gate landing the test bed.
Issues to fix (tracked under PER-15155) — inline comments above:
Worth fixing before/with merge
- MEDIUM
pytest.ini:10—testpaths = packagesmakes the README's documentedcd app-tests/git-leak && pytest --boot-scopes=50error withunrecognized arguments: --boot-scopes. Self-root the suite + use a targeted ignore. - MEDIUM
test_resilience.py:123— the bounce test'srepo_locks > baselinesignal is order-dependent across files (sharedpolicy-repo-0000+ non-purging caches on master); assertGET /scopes/post-bounce/policy == 200instead.
Follow-ups (LOW)
helpers.py:274— add atimeout=tocompose()so a wedged setup fails fast instead of hitting the CI job limit.helpers.py:148—delete_all_scopesburns ~40s/test waiting on a drain that can't happen on master; short-circuit it.seed_gitea.py:159— remove the unused/seed-output/tokenartifact + volume.seed_gitea.py:108— makepush_urlcredential injection scheme-agnostic (urllib.parse).
Doc nit (not posted inline): the bounce-test docstring (test_resilience.py:85-93) still describes pre-PER-15065 "graceful shutdown + gunicorn respawn over the recovered broadcaster"; reconnect is now in-place — worth a 3-line update.
Recommend merging PR1 first so PR2 (#923) / PR3 (#924) can build on the test bed + /internal endpoint, then addressing the two MEDIUMs.
| # not be collected by the repo's CI `pytest` run. testpaths only applies when | ||
| # pytest is invoked from the rootdir with no args, so running it explicitly | ||
| # (e.g. `cd app-tests/git-leak && pytest`) still works. | ||
| testpaths = packages |
There was a problem hiding this comment.
[MEDIUM] testpaths = packages breaks the documented way to run this suite
Problem: testpaths resolves against the rootdir (the repo root, where this pytest.ini lives) and applies whenever no positional path is given — independent of cwd. So the README's documented cd app-tests/git-leak && python -m pytest --boot-scopes=50 collects packages/ instead of this directory, never loads app-tests/git-leak/conftest.py (where --boot-scopes is registered), and errors with unrecognized arguments: --boot-scopes. It also silently drops any current/future test outside packages/ from the CI pytest run.
Suggestion: Add a pytest.ini (or a rootdir marker) under app-tests/git-leak/ so the suite is self-rooted and runnable in place, and replace the global testpaths = packages with a targeted --ignore=app-tests / norecursedirs. Please confirm by running the exact documented command.
| synced = False | ||
| deadline = time.time() + 120 | ||
| while time.time() < deadline: | ||
| if opal.stats()["repo_locks"] > baseline_locks: |
There was a problem hiding this comment.
[MEDIUM] Bounce-test signal is order-dependent across test files
Problem: The success check keys on opal.stats()["repo_locks"] > baseline_locks — a process-global, source_id(URL)-keyed counter shared with test_leak.py (both use policy-repo-0000). On master delete_scope never purges the caches (the leak this suite gates), so the opal fixture's delete_all_scopes can't drain repo_locks. If policy-repo-0000 is already cached, the post-bounce PUT adds no new key → the delta check never trips → false "broadcaster did not recover" failure. It passes today only because the preceding offline test's hard_reset() restarts the process and flushes Redis; reorder (e.g. run after test_leak without the offline test in between) and it false-fails.
Suggestion: Point the post-bounce PUT at a dedicated never-synced repo, or assert on the specific scope being served (GET /scopes/post-bounce/policy == 200) instead of a delta on a shared process-global counter.
| re-raise with the captured stdout/stderr embedded, otherwise a | ||
| broken build/seed/ restart is opaque to debug. | ||
| """ | ||
| proc = subprocess.run( |
There was a problem hiding this comment.
[LOW] compose() has no subprocess timeout — fixture setup can hang to the CI job limit
Problem: subprocess.run(["docker", "compose", *args], capture_output=True, text=True) is called with no timeout=. @pytest.mark.timeout(...) on tests does not cover session-scoped fixture setup, so a wedged docker compose up -d --build or a compose("wait", "seed") on a stuck seed container blocks until the workflow's 60-minute job limit rather than failing fast. (Shell-injection is not a concern — args are a list, no shell=True.)
Suggestion: Pass a generous timeout= to compose() (at least for up/wait/stop/start) and raise a clear error on expiry. Optionally probe docker compose version for the availability skip instead of only shutil.which("docker").
| except Exception: | ||
| pass | ||
| self._created_scopes.clear() | ||
| deadline = time.time() + drain_timeout |
There was a problem hiding this comment.
[LOW] delete_all_scopes burns the full drain timeout twice per test
Problem: On master the caches never purge on delete (the leak this suite gates), so the while time.time() < deadline loop waiting for repo_locks == 0 always runs its full drain_timeout=20s. This runs in the opal fixture's setup and teardown → roughly 40s of guaranteed dead wait per test, for every test, for the entire expected-failing lifetime of the suite.
Suggestion: Short-circuit the drain on the known-not-purged path (cap it at a couple of seconds, or skip the drain in teardown), since waiting the full timeout for a state that can't occur on master is pure overhead.
| return 1 | ||
|
|
||
| # write the token where the test harness can read it | ||
| Path("/seed-output/token").write_text(token) |
There was a problem hiding this comment.
[LOW] Dead artifact — seed token written for a harness that never reads it
Problem: Path("/seed-output/token").write_text(token) (plus the seed-output volume in docker-compose.yml) is unused. The host-side harness (GiteaAdmin / OpalServerClient in helpers.py) authenticates with HTTP basic auth (opaladmin/opaladmin), never the token, and can't read a path inside the seed container's volume anyway.
Suggestion: Remove the token write and the seed-output volume, or actually consume the token from the host. Dead test-bed plumbing rots and misleads later readers.
| repo.index.commit("seed policy", author=author, committer=author) | ||
|
|
||
| push_url = ( | ||
| base_url.replace("http://", f"http://{user}:{token}@") + f"/{user}/{name}.git" |
There was a problem hiding this comment.
[LOW] push_url credential injection assumes http://
Problem: push_url = base_url.replace("http://", f"http://{user}:{token}@") + f"/{user}/{name}.git" is a no-op if GITEA_URL is ever https://, silently dropping the credentials and producing an opaque auth failure that the per-repo except records as a generic seed error.
Suggestion: Inject credentials scheme-agnostically via urllib.parse (split, set netloc = f"{user}:{token}@{host}", unsplit) so it works for both http/https.
PR1 — Git Leak/Resilience Test Environment
Builds the test bed and the single diagnostics hook needed to exercise the git-fetcher memory leak, the offline-repo hang, the slow serial boot, and the broadcaster-disconnect path. The leak/offline tests fail on
masterand become the regression gates for the follow-up fix PRs; the boot and bounce tests are tunable/recovery guards (see below).This is intentionally one large PR — it's the foundation every later PR's gate depends on. The actual fixes land in the PRs this one blocks:
Linear: PER-15155
opal-server: an off-by-default stats endpoint
opal_server/debug_stats.py— read-onlygit_fetcher_cache_stats()(sizes of the three process-globalGitPolicyFetchercaches + process RSS) and aregister_internal_stats_route()registrar.opal_server/config.py— newOPAL_DEBUG_INTERNAL_STATSflag, defaultFalse.opal_server/server.py— mountsGET /internal/git-fetcher-cache-statsbeside/healthcheck, gated by the flag.No production behavior change when the flag is off (the default).
app-tests/git-leak/: a self-contained stack + pytest harnessdocker-compose.yml— opal-server (single worker,OPAL_SCOPES=1, Postgres broadcaster) built from the repo's owndocker/Dockerfileservertarget, plus Redis, Postgres, Gitea, and one-shotgitea-admin+seedsidecars.seed/— idempotent Gitea seeding sidecar (N policy repos) + Dockerfile.helpers.py—OpalServerClient,GiteaAdmin(host-side admin client),make_repo_unreachable,bounce_postgres,compose.conftest.py— session-scopedstack+opal+gitea_adminfixtures;--boot-scopes=N,--keep-stack.test_leak.py(WIP: Opal refactor (policy fetcher, tree structure) #1 churn, Data fetcher #2 repeat-sync),test_resilience.py(Update and fetch #4 offline repo, remove singletons from server #5 postgres bounce),test_boot.py(Refactor: Opal Server, Repo Watcher, Cloner, Tracker, LeaderElection, BundleMaker, and bundle diffs #3 boot timing, tunableBOOT_TARGET_SECONDS).README.md.The five flagship tests, and their baseline behavior on master
test_churn_releases_cachestest_repeat_sync_does_not_growtest_boot_loads_all_scopesBOOT_TARGET_SECONDSlowtest_offline_repo_does_not_block_healthy_scopestest_server_recovers_after_postgres_bounceTest #5 — verified. Ran it against this branch's stack: it passes in ~14–19s. On a bounce the broadcast channel drops, the affected worker triggers a graceful shutdown, gunicorn (in-container) respawns it, and the sibling worker keeps serving HTTP — so the surface recovers within the window. It is a recovery guard against regression of that property, not a reproduction of a current failure. PER-15065's in-process reconnect would make recovery cleaner by avoiding the worker churn.
How to run
Requires Docker + compose v2 and host Python with
pytest pytest-timeout requests GitPython.Verification done
serverimage;import opal_server.serverclean; flag defaultsFalsewith a description.docker compose config -qvalid; all 5 flagship tests collect./internal/git-fetcher-cache-statslive,PUT /scopesreturns 201 with auth disabled.GiteaAdminexercised live (list/exists/create/delete over the published port);test_server_recovers_after_postgres_bouncerun to a green pass.Notes for reviewers
OPAL_AUTH_PUBLIC_KEYleft unset → JWT verifier disabled so the harness can drive scope routes without minting JWTs (require_peer_typebecomes a no-op). Test bed only.auth: {auth_type: "none"}(required pydantic-v1 discriminated union).opal_server:7002andgitea:13000(uncommon, to avoid the:3000clash; used byGiteaAdmin); Postgres is internal-only.Cache-read determinism (how the gates stay trustworthy)
Two properties of
masterwould make naive cache reads non-deterministic. Both are handled in this harness — they are not open issues:GitPolicyFetchercaches are per-process, so with >1 worker a round-robin/internalread can miss the worker that fetched and report0. Resolved: the stack runs a single uvicorn worker (UVICORN_NUM_WORKERS=1), so every cache read is deterministic.repospopulates on the second sync, not the first. The first-sync clone path fills onlyrepo_locks;repos/repos_last_fetchedare filled by the discover/fetch path on a subsequent sync. Resolved: the load helpers issuerefresh_all()and wait before asserting onrepos, so the gates assert against a cache the sync path has actually filled.These were called out as caveats in the original design doc; the implementation above closes both, so the leak/drain assertions are deterministic single-worker.
🤖 Generated with Claude Code