test(opal-server): git leak/resilience test environment (PR1) by dshoen619 · Pull Request #922 · permitio/opal

dshoen619 · 2026-06-23T09:26:22Z

PR1 — Git Leak/Resilience Test Environment

Builds the test bed and the single diagnostics hook needed to exercise the git-fetcher memory leak, the offline-repo hang, the slow serial boot, and the broadcaster-disconnect path. The leak/offline tests fail on master and become the regression gates for the follow-up fix PRs; the boot and bounce tests are tunable/recovery guards (see below).

This is intentionally one large PR — it's the foundation every later PR's gate depends on. The actual fixes land in the PRs this one blocks:

PER-15156 (PR2) — memory-leak fix → gated by tests WIP: Opal refactor (policy fetcher, tree structure) #1, Data fetcher #2
PER-15157 (PR3) — offline-repo resilience → gated by test Update and fetch #4
PER-15065 — broadcaster in-process reconnect → relates to test remove singletons from server #5
PR4 (boot parallelism) → gated by test Refactor: Opal Server, Repo Watcher, Cloner, Tracker, LeaderElection, BundleMaker, and bundle diffs #3

Linear: PER-15155

opal-server: an off-by-default stats endpoint

opal_server/debug_stats.py — read-only git_fetcher_cache_stats() (sizes of the three process-global GitPolicyFetcher caches + process RSS) and a register_internal_stats_route() registrar.
opal_server/config.py — new OPAL_DEBUG_INTERNAL_STATS flag, default False.
opal_server/server.py — mounts GET /internal/git-fetcher-cache-stats beside /healthcheck, gated by the flag.
Unit tests for the helper, the flag default, and the gating.

No production behavior change when the flag is off (the default).

`app-tests/git-leak/`: a self-contained stack + pytest harness

docker-compose.yml — opal-server (single worker, OPAL_SCOPES=1, Postgres broadcaster) built from the repo's own docker/Dockerfile server target, plus Redis, Postgres, Gitea, and one-shot gitea-admin + seed sidecars.
seed/ — idempotent Gitea seeding sidecar (N policy repos) + Dockerfile.
helpers.py — OpalServerClient, GiteaAdmin (host-side admin client), make_repo_unreachable, bounce_postgres, compose.
conftest.py — session-scoped stack + opal + gitea_admin fixtures; --boot-scopes=N, --keep-stack.
test_leak.py (WIP: Opal refactor (policy fetcher, tree structure) #1 churn, Data fetcher #2 repeat-sync), test_resilience.py (Update and fetch #4 offline repo, remove singletons from server #5 postgres bounce), test_boot.py (Refactor: Opal Server, Repo Watcher, Cloner, Tracker, LeaderElection, BundleMaker, and bundle diffs #3 boot timing, tunable BOOT_TARGET_SECONDS).
README.md.

The five flagship tests, and their baseline behavior on master

#	Test	On master	Gate
1	`test_churn_releases_caches`	FAILS (caches never drain)	PR2
2	`test_repeat_sync_does_not_grow`	FAILS (cache only grows)	PR2
3	`test_boot_loads_all_scopes`	passes; FAILS with `BOOT_TARGET_SECONDS` low	PR4
4	`test_offline_repo_does_not_block_healthy_scopes`	FAILS (no fetch timeout)	PR3
5	`test_server_recovers_after_postgres_bounce`	PASSES (recovery guard)	PER-15065

Test #5 — verified. Ran it against this branch's stack: it passes in ~14–19s. On a bounce the broadcast channel drops, the affected worker triggers a graceful shutdown, gunicorn (in-container) respawns it, and the sibling worker keeps serving HTTP — so the surface recovers within the window. It is a recovery guard against regression of that property, not a reproduction of a current failure. PER-15065's in-process reconnect would make recovery cleaner by avoiding the worker churn.

How to run

cd app-tests/git-leak
python -m pytest -v --boot-scopes=50              # full set
python -m pytest test_leak.py -v --boot-scopes=20 # just the leak gates

Requires Docker + compose v2 and host Python with pytest pytest-timeout requests GitPython.

Verification done

All 4 opal-server unit tests pass inside the repo's Docker server image; import opal_server.server clean; flag defaults False with a description.
docker compose config -q valid; all 5 flagship tests collect.
Full-stack smoke: stack boots, Gitea seeds repos, opal healthy, /internal/git-fetcher-cache-stats live, PUT /scopes returns 201 with auth disabled.
GiteaAdmin exercised live (list/exists/create/delete over the published port); test_server_recovers_after_postgres_bounce run to a green pass.

Notes for reviewers

Auth disabled deliberately: OPAL_AUTH_PUBLIC_KEY left unset → JWT verifier disabled so the harness can drive scope routes without minting JWTs (require_peer_type becomes a no-op). Test bed only.
Scope create body sets auth: {auth_type: "none"} (required pydantic-v1 discriminated union).
Host ports: opal_server:7002 and gitea:13000 (uncommon, to avoid the :3000 clash; used by GiteaAdmin); Postgres is internal-only.

Cache-read determinism (how the gates stay trustworthy)

Two properties of master would make naive cache reads non-deterministic. Both are handled in this harness — they are not open issues:

Per-worker caches + round-robin reads. The GitPolicyFetcher caches are per-process, so with >1 worker a round-robin /internal read can miss the worker that fetched and report 0. Resolved: the stack runs a single uvicorn worker (UVICORN_NUM_WORKERS=1), so every cache read is deterministic.
repos populates on the second sync, not the first. The first-sync clone path fills only repo_locks; repos / repos_last_fetched are filled by the discover/fetch path on a subsequent sync. Resolved: the load helpers issue refresh_all() and wait before asserting on repos, so the gates assert against a cache the sync path has actually filled.

These were called out as caveats in the original design doc; the implementation above closes both, so the leak/drain assertions are deterministic single-worker.

🤖 Generated with Claude Code

Add an off-by-default diagnostics endpoint so tests can observe the in-memory GitPolicyFetcher cache sizes (repo_locks/repos/repos_last_fetched) and process RSS that the upcoming memory-leak fix eliminates. - debug_stats.py: read-only git_fetcher_cache_stats() helper + a register_internal_stats_route() registrar that mounts GET /internal/git-fetcher-cache-stats only when enabled. - config.py: new OPAL_DEBUG_INTERNAL_STATS flag, default False. - server.py: register the route, gated by the flag, beside /healthcheck. No production behavior change when the flag is off (the default). Also ignore .claude/ so private planning artifacts are never committed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A self-contained docker-compose stack (opal-server x2 workers + Redis + Postgres broadcaster + Gitea) plus a pytest harness that reproduces, as tests that fail on master, the git-fetcher memory leak, the offline-repo hang, the slow serial boot, and the broadcaster-disconnect gap. These become the regression gates for the follow-up fixes. - seed/: idempotent Gitea seeding sidecar (N policy repos) + Dockerfile. - docker-compose.yml: 4-service stack, opal-server built from the repo's own docker/Dockerfile (server target), scopes on, Postgres broadcaster. - helpers.py / conftest.py: HTTP + infra helpers and stack fixtures. - test_leak.py / test_resilience.py / test_boot.py: the flagship tests. - README.md: how to run and expected fail-on-master behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

linear-code · 2026-06-23T09:26:25Z

PER-15155

netlify · 2026-06-23T09:26:26Z

✅ Deploy Preview for opal-docs canceled.

Name	Link
🔨 Latest commit	`a502f2e`
🔍 Latest deploy log	https://app.netlify.com/projects/opal-docs/deploys/6a437e44dd08cb0008a0d982

Complete the helpers.py surface promised in the plan's file-structure table. Both are now functional and used, not dead code: - make_repo_unreachable(name): returns a git URL on a routable-but-dead TEST-NET-1 host (RFC 5737). test_offline_repo now uses it instead of an inlined literal. - GiteaAdmin: host-side Gitea admin client (list_repos / repo_exists / create_repo / delete_repo), exposed as the `gitea_admin` pytest fixture for tests that need to inspect or stage repos beyond the seed sidecar. Gitea is published on host port 13000 (uncommon, to avoid the usual :3000 clash) so GiteaAdmin can reach it; opal_server and the seed sidecar still use the internal http://gitea:3000. README updated with the helper and port notes. Verified live: GiteaAdmin lists the seeded repos and round-trips create/exists/delete against Gitea over the published port. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Verified test_server_recovers_after_postgres_bounce against the stack: it PASSES on master (~14-19s). On a broadcaster drop the affected worker triggers a graceful shutdown, gunicorn respawns it, and the sibling worker keeps serving HTTP, so the surface recovers within the window — recovery happens via gunicorn's in-container worker supervision, not an external supervisor and not an in-process reconnect. Reframe #5 as a recovery guard (not a known-broken case) in the docstring and README; the prior "FAILS on master / needs external supervisor" wording was wrong. PER-15065's in-process reconnect would avoid the worker churn but recovery already holds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Run the repo's pinned pre-commit formatters (black 23.1.0, isort 5.12.0, docformatter 1.7.5) over the PR1 files to satisfy the pre-commit CI check. Formatting only — no behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds an off-by-default OPAL server internal diagnostics endpoint for git-fetcher cache/RSS stats, and introduces a docker-compose-based “git leak/resilience” integration testbed intended to reproduce/regress several production issues (leak, offline repo hang, boot slowness, broadcaster disconnect recovery).

Changes:

Add /internal/git-fetcher-cache-stats endpoint gated by OPAL_DEBUG_INTERNAL_STATS (default off), plus unit tests.
Add app-tests/git-leak/ docker-compose stack (opal-server + redis/postgres/gitea) and pytest harness (boot/leak/resilience tests).
Add .claude/ to .gitignore.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
packages/opal-server/opal_server/debug_stats.py	New helper to read git-fetcher cache sizes + RSS and register a gated internal route.
packages/opal-server/opal_server/config.py	Adds `DEBUG_INTERNAL_STATS` config flag (default `False`).
packages/opal-server/opal_server/server.py	Mounts the internal stats route when enabled.
packages/opal-server/opal_server/tests/debug_stats_test.py	Unit tests for stats dict sizing + flag default.
packages/opal-server/opal_server/tests/debug_stats_endpoint_test.py	Unit tests for endpoint presence/absence when gated.
app-tests/git-leak/docker-compose.yml	Compose stack for OPAL + dependencies + seeding sidecars.
app-tests/git-leak/helpers.py	Host-side HTTP/compose helpers for the test harness.
app-tests/git-leak/conftest.py	Pytest fixtures to boot/teardown the stack and provide clients.
app-tests/git-leak/test_leak.py	Leak regression tests (churn + repeat sync).
app-tests/git-leak/test_resilience.py	Offline-repo hang test + Postgres bounce recovery guard.
app-tests/git-leak/test_boot.py	Boot timing/baseline test.
app-tests/git-leak/seed/seed_gitea.py	Idempotent Gitea seeding script for N repos.
app-tests/git-leak/seed/Dockerfile	Container image for the seeding sidecar.
app-tests/git-leak/README.md	Documentation for running the new testbed locally.
.gitignore	Ignores `.claude/` working artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The CI `build` job runs `pytest` from the repo root with no path, which recursed into app-tests/git-leak/ and ran the flagship tests — these are designed to FAIL on master (they are the regression gates for PR2-PR5), so they broke the build job. Set `testpaths = packages` so the rootdir run collects only the unit tests under packages/ (matching master's effective behavior, since app-tests/ had no pytest files before). testpaths only applies when pytest is invoked from the rootdir with no args, so `cd app-tests/git-leak && pytest` still collects and runs the test bed. Verified both: root run -> packages only; subdir run -> all 5 flagship tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- stats(): sample the /internal endpoint several times and merge per key with max(). The caches are per-process on a multi-worker server, so a single read can hit an empty (non-leader) worker — max-merge avoids both false negatives (missing a populated leader) and false positives (an `== 0` drain assertion passing only because it hit an empty worker). - test_leak: assert the initial-load `_wait_until` succeeded before deleting / before taking baseline, so the tests can't pass vacuously when load never completed. - refresh_all(): correct the misleading comment — a 404 is a no-op, there is no client-side fallback. - conftest: skip the suite cleanly if docker is unavailable (defense in depth; it's already excluded from the default pytest run via testpaths). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two review findings on the regression gates: - Cross-test contamination: clone paths are keyed by repo URL (source_id = sha256(url)+branch-shard), not scope_id, so scopes from different tests that point at the same seeded repo share one GitPolicyFetcher cache entry. With a session-scoped stack and no teardown, a leftover boot-*/stable-* scope kept those entries alive and would make test_churn's `repos == 0` drain assertion fail on fixed code. OpalServerClient now tracks created scopes and the opal fixture deletes them on teardown (best-effort drain wait, swallows errors so master — where delete never purges — doesn't fail the passing test). - False gate: test_repeat_sync_does_not_grow re-syncs identical scopes, which a path-keyed cache can't grow even on master, so it could never be the leak gate it claimed. Reframed as an honest idempotency guard (passes on master) that points at test_churn_releases_caches as the real leak gate; README's "Expected on master" reclassifies it alongside the postgres-bounce guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

zeevmoney

Requesting changes — the regression gates do not function as gates yet

Scope check first, because it's the reassuring part: the production surface is safe. The only always-on change is a benign import plus a default-off OPAL_DEBUG_INTERNAL_STATS config key; with the flag off (the default) there is zero behavior change, and when on the handler can't crash (the class dicts always exist, len() is GIL-atomic against the event-loop mutations, and _read_rss_kb swallows errors). testpaths = packages drops no existing tests. Git hygiene is clean (nothing private leaked into the diff, no rebase needed). This PR will not crash production.

The problem is the actual deliverable — the five flagship tests meant to gate PR2–PR5.

What's blocking (must fix before this can be trusted as the gate)

The load/drain gates assert on GitPolicyFetcher.repos, which the first-sync clone path never populates (see the CRITICAL inline comment on test_leak.py:26). On a fresh scope, fetch_and_notify_on_changes takes the _clone() branch, which only sets repo_locks — repos is filled solely by _get_repo() on a second sync or the bundle path, and periodic re-sync is disabled in the stack (OPAL_POLICY_REFRESH_INTERVAL: "0"). As a result test_churn_releases_caches, test_repeat_sync_does_not_grow, and test_offline_repo_does_not_block_healthy_scopes hang at their _wait_until(repos >= n) load gate for the full timeout and then fail at the load stage — they never reach the leak/offline logic they exist to test, and they would not flip green after PR2/PR3 land. Only test_boot_loads_all_scopes works, and only incidentally (compose restart → preload_scopes re-discovers the on-disk clones). This is the same root cause the PR description flags as "Known caveat #2" — it needs to be fixed, not just noted.
Even once load is fixed, the cache assertions are not reliable on a 2-worker stack. stats()'s max-merge cannot prevent a == 0 drain assertion from falsely passing when the samples miss the leader worker (HIGH on helpers.py:50). The cache tests should run single-worker, or the endpoint should aggregate across workers.
Failure modes are invisible. compose() swallows stdout/stderr on any compose failure (HIGH on helpers.py:218), and partial seeding is never detected before the suite runs (HIGH on conftest.py:40) — so a broken stack or a half-seeded Gitea looks like a test failure for the wrong reason.

Secondary correctness (should fix)

test_repeat_sync_does_not_grow is tautological — len(repos) can't grow on repeat URL-keyed sync (test_leak.py:64).
The offline test's TEST-NET-1 address likely fails fast instead of hanging, so it may not exercise starvation (helpers.py:205).
The opal fixture clears _created_scopes at setup, orphaning scopes from a failed prior test and contaminating the next (conftest.py:55).
The postgres-bounce test asserts only HTTP liveness, not broadcaster recovery, and doesn't --wait for Postgres (test_resilience.py:51).
README/docstrings say "fails on master," but the suite requires this PR's /internal endpoint (README.md:30).
Dead/misleading 404 branch in refresh_all() (helpers.py:128).

Minor

Enabling the flag exposes an unauthenticated /internal route (non-sensitive payload, off by default) (debug_stats.py:39).
|| true masks genuine gitea-admin failures (docker-compose.yml:58).
Boot-timing clock starts after wait_healthy, undercounting boot-sync time (test_boot.py:24).

Bottom line

All 11 plan tasks are implemented and the production change is safe, but the suite needs to (1) re-key the load/drain assertions to a cache the sync path actually populates (or force a policy fetch), (2) make the cache reads worker-deterministic, (3) surface compose/seed failures, and then (4) be validated end-to-end against a live stack to confirm each test fails/passes for the right reason. Until then it can't be relied on as the regression gate for PR2–PR5.

Copilot

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 3 comments.

…iew) Addresses the CHANGES_REQUESTED review on PR #922. Root cause behind most findings: a fresh scope's first sync takes the _clone() branch, which only fills GitPolicyFetcher.repo_locks; repos/repos_last_fetched are filled on a *second* sync. So the load gates on `repos` hung, and the 2-worker per-process caches made `== 0` drain assertions unsafe. Blocking: - Single worker (UVICORN_NUM_WORKERS=1): deterministic per-process cache reads; removes the false `== 0` drain class. - Load gate (CRITICAL + Zivxx HIGH): _load_scopes gates on repo_locks then refresh_all() to force the second sync, so repos/repos_last_fetched are actually populated before any drain/purge assertion. - compose() surfaces captured stdout/stderr on failure. - Seed completeness asserted in conftest; seed script isolates per-repo failures and exits non-zero with a count. Secondary: - test_repeat_sync asserts an RSS bound (count can't grow for any impl); churn asserts all three caches drain + a loose RSS backstop. - blackhole socat sidecar replaces TEST-NET-1 (deterministic hang); offline test saturates the fetch executor with 40 hung clones and recovers via OpalServerClient.hard_reset() (stop -> redis FLUSHALL -> start). - Per-test clean slate deletes all server scopes (fixes orphan-scope leak). - Postgres-bounce proves broadcast recovery (PUT post-bounce, assert sync) and uses `up -d --wait`. - Remove dead 404 branch in refresh_all; boot clock starts at restart; gitea-admin `|| true` -> "already exists"-only guard; README reworded. opal-server: - /internal stats route now takes the JWTAuthenticator dependency (protected when JWT on, no-op in the test bed); unit test asserts enforcement. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dshoen619 · 2026-06-24T13:08:11Z

Review addressed — `6046a10`

Thanks for the thorough pass. All blocking items, both HIGHs from @Zivxx / @zeevmoney, and the secondary/minor items are addressed in 6046a10. Per-comment replies are inline; summary below.

Root cause behind most findings (and why the gates didn't gate): a fresh scope's first sync takes the _clone() branch, which only fills repo_locks; repos/repos_last_fetched are filled on a second sync. The load gates on repos therefore hung, and the 2-worker per-process caches made == 0 drains unsafe.

Blocking

Load gate (CRITICAL + @Zivxx HIGH) — _load_scopes gates on repo_locks (first-sync signal), then refresh_all() forces the second sync so repos/repos_last_fetched are populated before any drain/purge assertion. Churn now asserts all three caches drain to 0 (so a broken repos_last_fetched purge fails instead of passing vacuously).
Worker determinism — stack is now UVICORN_NUM_WORKERS=1; removes the false-== 0-drain class outright (no more reliance on max-merge to paper over it).
Failure visibility — compose() re-raises with captured stdout/stderr; seed completeness is asserted in conftest; seed_gitea.py isolates per-repo failures and exits non-zero with a count.

Secondary

test_repeat_sync now asserts an rss_kb bound (count can't grow for a URL-keyed set).
Offline test uses a blackhole socat sidecar (deterministic hang, verified) and saturates the fetch executor with 40 hung clones; recovers the session stack via hard_reset() (stop → redis FLUSHALL → start).
Per-test clean slate deletes all server scopes (fixes the orphan-scope leak).
Postgres-bounce proves broadcast recovery (PUT post-bounce → assert sync), uses up -d --wait.
Dead 404 branch removed; boot clock starts at restart; gitea-admin || true → "already exists"-only guard; README/docstrings reworded to "fails on this branch without PR2/PR3", noting the suite needs this PR's /internal endpoint.

Minor

/internal route now carries the JWTAuthenticator dependency (protected when JWT on, no-op in the test bed); added a unit test asserting a rejecting dependency → 401.

Validation done: 5/5 opal-server unit tests pass in the Docker server image (incl. the new auth test); docker compose config valid; blackhole hang and the gitea-admin guard verified directly.

Still outstanding (flagging honestly): I have not yet re-run the full live integration suite end-to-end to confirm each flagship test fails/passes for the right reason against a live stack — that's the "validate end-to-end" item from your bottom line. I can kick that off next; it's a ~30 min run and several tests are expected-fail gates on this branch until PR2/PR3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- snapshot a single /internal stats read per poll in the churn-drain and boot gates (consistent multi-key observation; fewer HTTP round-trips) - document why a 200 from the healthy scope can't be a masked default bundle, and why the stats route is intentionally a sync def - pin alpine/socat and the seed image's pip deps for reproducibility Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…t-environment-one-big-pr

…repo Ziv (PR review, round 2) caught that the offline-resilience gate can false-PASS in the full-suite run. The "healthy" scope pointed at policy-repo-0000 (list_seeded_repos(1)[0]), but on-disk clones are keyed by URL-hash and survive compose restart/stop/start (opal_server mounts no volume at /opal; only `down -v` wipes them). test_boot/test_leak run first (alphabetical) and already clone every seeded repo, so the healthy scope hit the existing clone via _discover_repository, skipped _clone(), and served 200 without ever touching the saturated fetch executor — the gate that must FAIL on this branch (no PR3 timeout) passed. Fix: seed a reserved repo (policy-repo-healthy-probe) outside the numeric policy-repo-NNNN range that no boot/leak test enumerates, and point the healthy probe at it. A never-cloned repo forces a genuine fresh clone through the starved pool, so the gate fails correctly. The seed- completeness check in conftest now covers the reserved repo too. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…view) Follow-up to PR review of the git-leak/resilience test bed: - hard_reset: always restart opal_server + wait_healthy via a finally, so a failed redis FLUSHALL can't leave the server stopped (which would fail every later session-scoped test and, running in a test finally, mask the result). - delete_all_scopes drain: a transient /internal read error no longer counts as a successful drain (was `except: return`); keep polling to the deadline so a not-yet-drained cache can't leak into the next test once PR2 lands. - use stats(samples=1) for the zero-waiting drain/empty polls (the peak-merge only matters for load assertions; this also drops 3x HTTP per poll). - resilience: narrow broad `except Exception` to requests.RequestException (and RuntimeError for wait_healthy timeout) so harness bugs surface instead of masquerading as "never served"/"never recovered". - resilience: collapse `assert opal.stats()` + redundant re-read into one read. - debug_stats_test: assert rss_kb > 0 on Linux (was `>= 0`, which passed for the wrong reason where /proc is absent and RSS reads fall back to 0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…t-environment-one-big-pr

approvded

zeevmoney · 2026-06-30T14:43:49Z

@/private/tmp/claude-501/-Users-zeevmanilovich-NewDev-opal/deddbdb3-9e66-4bdc-86ca-0a2e991c021b/scratchpad/corrected.md

zeevmoney

Approving — the test bed is sound and this PR has already been through several review rounds (Copilot / @Zivxx / @zeevmoney) with the substantive issues raised and resolved. A fresh python-pro pass (planned against the opal-development references) surfaced only non-blocking items; none are CRITICAL/HIGH, so they don't gate landing the test bed.

Issues to fix (tracked under PER-15155) — inline comments above:

Worth fixing before/with merge

MEDIUM pytest.ini:10 — testpaths = packages makes the README's documented cd app-tests/git-leak && pytest --boot-scopes=50 error with unrecognized arguments: --boot-scopes. Self-root the suite + use a targeted ignore.
MEDIUM test_resilience.py:123 — the bounce test's repo_locks > baseline signal is order-dependent across files (shared policy-repo-0000 + non-purging caches on master); assert GET /scopes/post-bounce/policy == 200 instead.

Follow-ups (LOW)

helpers.py:274 — add a timeout= to compose() so a wedged setup fails fast instead of hitting the CI job limit.
helpers.py:148 — delete_all_scopes burns ~40s/test waiting on a drain that can't happen on master; short-circuit it.
seed_gitea.py:159 — remove the unused /seed-output/token artifact + volume.
seed_gitea.py:108 — make push_url credential injection scheme-agnostic (urllib.parse).

Doc nit (not posted inline): the bounce-test docstring (test_resilience.py:85-93) still describes pre-PER-15065 "graceful shutdown + gunicorn respawn over the recovered broadcaster"; reconnect is now in-place — worth a 3-line update.

Recommend merging PR1 first so PR2 (#923) / PR3 (#924) can build on the test bed + /internal endpoint, then addressing the two MEDIUMs.

zeevmoney · 2026-06-30T15:28:52Z

+# not be collected by the repo's CI `pytest` run. testpaths only applies when
+# pytest is invoked from the rootdir with no args, so running it explicitly
+# (e.g. `cd app-tests/git-leak && pytest`) still works.
+testpaths = packages


[MEDIUM] testpaths = packages breaks the documented way to run this suite

Problem: testpaths resolves against the rootdir (the repo root, where this pytest.ini lives) and applies whenever no positional path is given — independent of cwd. So the README's documented cd app-tests/git-leak && python -m pytest --boot-scopes=50 collects packages/ instead of this directory, never loads app-tests/git-leak/conftest.py (where --boot-scopes is registered), and errors with unrecognized arguments: --boot-scopes. It also silently drops any current/future test outside packages/ from the CI pytest run.

Suggestion: Add a pytest.ini (or a rootdir marker) under app-tests/git-leak/ so the suite is self-rooted and runnable in place, and replace the global testpaths = packages with a targeted --ignore=app-tests / norecursedirs. Please confirm by running the exact documented command.

zeevmoney · 2026-06-30T15:28:53Z

+    synced = False
+    deadline = time.time() + 120
+    while time.time() < deadline:
+        if opal.stats()["repo_locks"] > baseline_locks:


[MEDIUM] Bounce-test signal is order-dependent across test files

Problem: The success check keys on opal.stats()["repo_locks"] > baseline_locks — a process-global, source_id(URL)-keyed counter shared with test_leak.py (both use policy-repo-0000). On master delete_scope never purges the caches (the leak this suite gates), so the opal fixture's delete_all_scopes can't drain repo_locks. If policy-repo-0000 is already cached, the post-bounce PUT adds no new key → the delta check never trips → false "broadcaster did not recover" failure. It passes today only because the preceding offline test's hard_reset() restarts the process and flushes Redis; reorder (e.g. run after test_leak without the offline test in between) and it false-fails.

Suggestion: Point the post-bounce PUT at a dedicated never-synced repo, or assert on the specific scope being served (GET /scopes/post-bounce/policy == 200) instead of a delta on a shared process-global counter.

zeevmoney · 2026-06-30T15:28:54Z

+    re-raise with the captured stdout/stderr embedded, otherwise a
+    broken build/seed/ restart is opaque to debug.
+    """
+    proc = subprocess.run(


[LOW] compose() has no subprocess timeout — fixture setup can hang to the CI job limit

Problem: subprocess.run(["docker", "compose", *args], capture_output=True, text=True) is called with no timeout=. @pytest.mark.timeout(...) on tests does not cover session-scoped fixture setup, so a wedged docker compose up -d --build or a compose("wait", "seed") on a stuck seed container blocks until the workflow's 60-minute job limit rather than failing fast. (Shell-injection is not a concern — args are a list, no shell=True.)

Suggestion: Pass a generous timeout= to compose() (at least for up/wait/stop/start) and raise a clear error on expiry. Optionally probe docker compose version for the availability skip instead of only shutil.which("docker").

zeevmoney · 2026-06-30T15:28:54Z

+        except Exception:
+            pass
+        self._created_scopes.clear()
+        deadline = time.time() + drain_timeout


[LOW] delete_all_scopes burns the full drain timeout twice per test

Problem: On master the caches never purge on delete (the leak this suite gates), so the while time.time() < deadline loop waiting for repo_locks == 0 always runs its full drain_timeout=20s. This runs in the opal fixture's setup and teardown → roughly 40s of guaranteed dead wait per test, for every test, for the entire expected-failing lifetime of the suite.

Suggestion: Short-circuit the drain on the known-not-purged path (cap it at a couple of seconds, or skip the drain in teardown), since waiting the full timeout for a state that can't occur on master is pure overhead.

zeevmoney · 2026-06-30T15:28:55Z

+        return 1
+
+    # write the token where the test harness can read it
+    Path("/seed-output/token").write_text(token)


[LOW] Dead artifact — seed token written for a harness that never reads it

Problem: Path("/seed-output/token").write_text(token) (plus the seed-output volume in docker-compose.yml) is unused. The host-side harness (GiteaAdmin / OpalServerClient in helpers.py) authenticates with HTTP basic auth (opaladmin/opaladmin), never the token, and can't read a path inside the seed container's volume anyway.

Suggestion: Remove the token write and the seed-output volume, or actually consume the token from the host. Dead test-bed plumbing rots and misleads later readers.

zeevmoney · 2026-06-30T15:28:56Z

+    repo.index.commit("seed policy", author=author, committer=author)
+
+    push_url = (
+        base_url.replace("http://", f"http://{user}:{token}@") + f"/{user}/{name}.git"


[LOW] push_url credential injection assumes http://

Problem: push_url = base_url.replace("http://", f"http://{user}:{token}@") + f"/{user}/{name}.git" is a no-op if GITEA_URL is ever https://, silently dropping the credentials and producing an opaque auth failure that the per-repo except records as a generic seed error.

Suggestion: Inject credentials scheme-agnostically via urllib.parse (split, set netloc = f"{user}:{token}@{host}", unsplit) so it works for both http/https.

dshoen619 and others added 2 commits June 23, 2026 12:24

dshoen619 and others added 3 commits June 23, 2026 12:37

dshoen619 marked this pull request as ready for review June 23, 2026 10:05

dshoen619 requested a review from Copilot June 23, 2026 10:05

dshoen619 self-assigned this Jun 23, 2026

Copilot started reviewing on behalf of dshoen619 June 23, 2026 10:05 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread app-tests/git-leak/conftest.py

Comment thread app-tests/git-leak/helpers.py Outdated

Comment thread app-tests/git-leak/helpers.py Outdated

Comment thread app-tests/git-leak/test_leak.py Outdated

Comment thread app-tests/git-leak/test_leak.py Outdated

dshoen619 and others added 3 commits June 23, 2026 13:13

dshoen619 requested review from Zivxx and zeevmoney June 23, 2026 10:56

dshoen619 marked this pull request as draft June 23, 2026 11:37

dshoen619 marked this pull request as ready for review June 23, 2026 11:49

zeevmoney previously requested changes Jun 23, 2026

View reviewed changes

zeevmoney requested a review from Copilot June 23, 2026 19:08

Copilot started reviewing on behalf of zeevmoney June 23, 2026 19:09 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread app-tests/git-leak/helpers.py

Comment thread packages/opal-server/opal_server/debug_stats.py Outdated

Comment thread packages/opal-server/opal_server/server.py

Zivxx reviewed Jun 24, 2026

View reviewed changes

Comment thread app-tests/git-leak/test_leak.py Outdated

dshoen619 and others added 2 commits June 24, 2026 16:15

style(git-leak): apply black/isort/docformatter (pre-commit)

f810db8

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dshoen619 requested review from Zivxx and zeevmoney June 24, 2026 17:06

Zivxx reviewed Jun 28, 2026

View reviewed changes

Comment thread app-tests/git-leak/test_resilience.py Outdated

dshoen619 and others added 3 commits June 28, 2026 18:20

Merge branch 'master' into david/per-15155-pr1-git-leakresilience-tes…

d719c34

…t-environment-one-big-pr

Zivxx approved these changes Jun 29, 2026

View reviewed changes

Merge branch 'master' into david/per-15155-pr1-git-leakresilience-tes…

a502f2e

…t-environment-one-big-pr

dshoen619 removed the request for review from zeevmoney June 30, 2026 08:30

dshoen619 requested a review from zeevmoney June 30, 2026 09:11

This was referenced Jun 30, 2026

fix(opal-server): git resilience — never stuck on an offline repo (PR3) #924

Open

fix(opal-server): memory leak — purge GitPolicyFetcher caches on scope delete + webhook task cleanup (PR2) #923

Open

zeevmoney changed the title ~~PR1: Git leak/resilience test environment~~ test(opal-server): git leak/resilience test environment (PR1) Jun 30, 2026

zeevmoney approved these changes Jun 30, 2026

View reviewed changes

Uh oh!

Conversation

dshoen619 commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR1 — Git Leak/Resilience Test Environment

opal-server: an off-by-default stats endpoint

app-tests/git-leak/: a self-contained stack + pytest harness

The five flagship tests, and their baseline behavior on master

How to run

Verification done

Notes for reviewers

Cache-read determinism (how the gates stay trustworthy)

Uh oh!

linear-code Bot commented Jun 23, 2026

Uh oh!

netlify Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for opal-docs canceled.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zeevmoney left a comment

Choose a reason for hiding this comment

Requesting changes — the regression gates do not function as gates yet

What's blocking (must fix before this can be trusted as the gate)

Secondary correctness (should fix)

Minor

Bottom line

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dshoen619 commented Jun 24, 2026

Review addressed — 6046a10

Uh oh!

Uh oh!

zeevmoney commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zeevmoney left a comment

Choose a reason for hiding this comment

Uh oh!

zeevmoney Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

zeevmoney Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

zeevmoney Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

zeevmoney Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

zeevmoney Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

dshoen619 commented Jun 23, 2026 •

edited

Loading

`app-tests/git-leak/`: a self-contained stack + pytest harness

netlify Bot commented Jun 23, 2026 •

edited

Loading

Review addressed — `6046a10`

zeevmoney commented Jun 30, 2026 •

edited

Loading