Skip to content

test(opal-server): git leak/resilience test environment (PR1)#922

Open
dshoen619 wants to merge 15 commits into
masterfrom
david/per-15155-pr1-git-leakresilience-test-environment-one-big-pr
Open

test(opal-server): git leak/resilience test environment (PR1)#922
dshoen619 wants to merge 15 commits into
masterfrom
david/per-15155-pr1-git-leakresilience-test-environment-one-big-pr

Conversation

@dshoen619

@dshoen619 dshoen619 commented Jun 23, 2026

Copy link
Copy Markdown

PR1 — Git Leak/Resilience Test Environment

Builds the test bed and the single diagnostics hook needed to exercise the git-fetcher memory leak, the offline-repo hang, the slow serial boot, and the broadcaster-disconnect path. The leak/offline tests fail on master and become the regression gates for the follow-up fix PRs; the boot and bounce tests are tunable/recovery guards (see below).

This is intentionally one large PR — it's the foundation every later PR's gate depends on. The actual fixes land in the PRs this one blocks:

Linear: PER-15155

opal-server: an off-by-default stats endpoint

  • opal_server/debug_stats.py — read-only git_fetcher_cache_stats() (sizes of the three process-global GitPolicyFetcher caches + process RSS) and a register_internal_stats_route() registrar.
  • opal_server/config.py — new OPAL_DEBUG_INTERNAL_STATS flag, default False.
  • opal_server/server.py — mounts GET /internal/git-fetcher-cache-stats beside /healthcheck, gated by the flag.
  • Unit tests for the helper, the flag default, and the gating.

No production behavior change when the flag is off (the default).

app-tests/git-leak/: a self-contained stack + pytest harness

The five flagship tests, and their baseline behavior on master

# Test On master Gate
1 test_churn_releases_caches FAILS (caches never drain) PR2
2 test_repeat_sync_does_not_grow FAILS (cache only grows) PR2
3 test_boot_loads_all_scopes passes; FAILS with BOOT_TARGET_SECONDS low PR4
4 test_offline_repo_does_not_block_healthy_scopes FAILS (no fetch timeout) PR3
5 test_server_recovers_after_postgres_bounce PASSES (recovery guard) PER-15065

Test #5 — verified. Ran it against this branch's stack: it passes in ~14–19s. On a bounce the broadcast channel drops, the affected worker triggers a graceful shutdown, gunicorn (in-container) respawns it, and the sibling worker keeps serving HTTP — so the surface recovers within the window. It is a recovery guard against regression of that property, not a reproduction of a current failure. PER-15065's in-process reconnect would make recovery cleaner by avoiding the worker churn.

How to run

cd app-tests/git-leak
python -m pytest -v --boot-scopes=50              # full set
python -m pytest test_leak.py -v --boot-scopes=20 # just the leak gates

Requires Docker + compose v2 and host Python with pytest pytest-timeout requests GitPython.

Verification done

  • All 4 opal-server unit tests pass inside the repo's Docker server image; import opal_server.server clean; flag defaults False with a description.
  • docker compose config -q valid; all 5 flagship tests collect.
  • Full-stack smoke: stack boots, Gitea seeds repos, opal healthy, /internal/git-fetcher-cache-stats live, PUT /scopes returns 201 with auth disabled.
  • GiteaAdmin exercised live (list/exists/create/delete over the published port); test_server_recovers_after_postgres_bounce run to a green pass.

Notes for reviewers

  • Auth disabled deliberately: OPAL_AUTH_PUBLIC_KEY left unset → JWT verifier disabled so the harness can drive scope routes without minting JWTs (require_peer_type becomes a no-op). Test bed only.
  • Scope create body sets auth: {auth_type: "none"} (required pydantic-v1 discriminated union).
  • Host ports: opal_server:7002 and gitea:13000 (uncommon, to avoid the :3000 clash; used by GiteaAdmin); Postgres is internal-only.

Cache-read determinism (how the gates stay trustworthy)

Two properties of master would make naive cache reads non-deterministic. Both are handled in this harness — they are not open issues:

  1. Per-worker caches + round-robin reads. The GitPolicyFetcher caches are per-process, so with >1 worker a round-robin /internal read can miss the worker that fetched and report 0. Resolved: the stack runs a single uvicorn worker (UVICORN_NUM_WORKERS=1), so every cache read is deterministic.
  2. repos populates on the second sync, not the first. The first-sync clone path fills only repo_locks; repos / repos_last_fetched are filled by the discover/fetch path on a subsequent sync. Resolved: the load helpers issue refresh_all() and wait before asserting on repos, so the gates assert against a cache the sync path has actually filled.

These were called out as caveats in the original design doc; the implementation above closes both, so the leak/drain assertions are deterministic single-worker.

🤖 Generated with Claude Code

dshoen619 and others added 2 commits June 23, 2026 12:24
Add an off-by-default diagnostics endpoint so tests can observe the
in-memory GitPolicyFetcher cache sizes (repo_locks/repos/repos_last_fetched)
and process RSS that the upcoming memory-leak fix eliminates.

- debug_stats.py: read-only git_fetcher_cache_stats() helper + a
  register_internal_stats_route() registrar that mounts
  GET /internal/git-fetcher-cache-stats only when enabled.
- config.py: new OPAL_DEBUG_INTERNAL_STATS flag, default False.
- server.py: register the route, gated by the flag, beside /healthcheck.

No production behavior change when the flag is off (the default). Also
ignore .claude/ so private planning artifacts are never committed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A self-contained docker-compose stack (opal-server x2 workers + Redis +
Postgres broadcaster + Gitea) plus a pytest harness that reproduces, as
tests that fail on master, the git-fetcher memory leak, the offline-repo
hang, the slow serial boot, and the broadcaster-disconnect gap. These
become the regression gates for the follow-up fixes.

- seed/: idempotent Gitea seeding sidecar (N policy repos) + Dockerfile.
- docker-compose.yml: 4-service stack, opal-server built from the repo's
  own docker/Dockerfile (server target), scopes on, Postgres broadcaster.
- helpers.py / conftest.py: HTTP + infra helpers and stack fixtures.
- test_leak.py / test_resilience.py / test_boot.py: the flagship tests.
- README.md: how to run and expected fail-on-master behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@linear-code

linear-code Bot commented Jun 23, 2026

Copy link
Copy Markdown

PER-15155

@netlify

netlify Bot commented Jun 23, 2026

Copy link
Copy Markdown

Deploy Preview for opal-docs canceled.

Name Link
🔨 Latest commit a502f2e
🔍 Latest deploy log https://app.netlify.com/projects/opal-docs/deploys/6a437e44dd08cb0008a0d982

dshoen619 and others added 3 commits June 23, 2026 12:37
Complete the helpers.py surface promised in the plan's file-structure
table. Both are now functional and used, not dead code:

- make_repo_unreachable(name): returns a git URL on a routable-but-dead
  TEST-NET-1 host (RFC 5737). test_offline_repo now uses it instead of an
  inlined literal.
- GiteaAdmin: host-side Gitea admin client (list_repos / repo_exists /
  create_repo / delete_repo), exposed as the `gitea_admin` pytest fixture
  for tests that need to inspect or stage repos beyond the seed sidecar.

Gitea is published on host port 13000 (uncommon, to avoid the usual :3000
clash) so GiteaAdmin can reach it; opal_server and the seed sidecar still
use the internal http://gitea:3000. README updated with the helper and
port notes.

Verified live: GiteaAdmin lists the seeded repos and round-trips
create/exists/delete against Gitea over the published port.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verified test_server_recovers_after_postgres_bounce against the stack: it
PASSES on master (~14-19s). On a broadcaster drop the affected worker
triggers a graceful shutdown, gunicorn respawns it, and the sibling worker
keeps serving HTTP, so the surface recovers within the window — recovery
happens via gunicorn's in-container worker supervision, not an external
supervisor and not an in-process reconnect.

Reframe #5 as a recovery guard (not a known-broken case) in the docstring
and README; the prior "FAILS on master / needs external supervisor" wording
was wrong. PER-15065's in-process reconnect would avoid the worker churn but
recovery already holds.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Run the repo's pinned pre-commit formatters (black 23.1.0, isort 5.12.0,
docformatter 1.7.5) over the PR1 files to satisfy the pre-commit CI check.
Formatting only — no behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dshoen619 dshoen619 marked this pull request as ready for review June 23, 2026 10:05
@dshoen619 dshoen619 requested a review from Copilot June 23, 2026 10:05
@dshoen619 dshoen619 self-assigned this Jun 23, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an off-by-default OPAL server internal diagnostics endpoint for git-fetcher cache/RSS stats, and introduces a docker-compose-based “git leak/resilience” integration testbed intended to reproduce/regress several production issues (leak, offline repo hang, boot slowness, broadcaster disconnect recovery).

Changes:

  • Add /internal/git-fetcher-cache-stats endpoint gated by OPAL_DEBUG_INTERNAL_STATS (default off), plus unit tests.
  • Add app-tests/git-leak/ docker-compose stack (opal-server + redis/postgres/gitea) and pytest harness (boot/leak/resilience tests).
  • Add .claude/ to .gitignore.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
packages/opal-server/opal_server/debug_stats.py New helper to read git-fetcher cache sizes + RSS and register a gated internal route.
packages/opal-server/opal_server/config.py Adds DEBUG_INTERNAL_STATS config flag (default False).
packages/opal-server/opal_server/server.py Mounts the internal stats route when enabled.
packages/opal-server/opal_server/tests/debug_stats_test.py Unit tests for stats dict sizing + flag default.
packages/opal-server/opal_server/tests/debug_stats_endpoint_test.py Unit tests for endpoint presence/absence when gated.
app-tests/git-leak/docker-compose.yml Compose stack for OPAL + dependencies + seeding sidecars.
app-tests/git-leak/helpers.py Host-side HTTP/compose helpers for the test harness.
app-tests/git-leak/conftest.py Pytest fixtures to boot/teardown the stack and provide clients.
app-tests/git-leak/test_leak.py Leak regression tests (churn + repeat sync).
app-tests/git-leak/test_resilience.py Offline-repo hang test + Postgres bounce recovery guard.
app-tests/git-leak/test_boot.py Boot timing/baseline test.
app-tests/git-leak/seed/seed_gitea.py Idempotent Gitea seeding script for N repos.
app-tests/git-leak/seed/Dockerfile Container image for the seeding sidecar.
app-tests/git-leak/README.md Documentation for running the new testbed locally.
.gitignore Ignores .claude/ working artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread app-tests/git-leak/conftest.py
Comment thread app-tests/git-leak/helpers.py Outdated
Comment thread app-tests/git-leak/helpers.py Outdated
Comment thread app-tests/git-leak/test_leak.py Outdated
Comment thread app-tests/git-leak/test_leak.py Outdated
dshoen619 and others added 3 commits June 23, 2026 13:13
The CI `build` job runs `pytest` from the repo root with no path, which
recursed into app-tests/git-leak/ and ran the flagship tests — these are
designed to FAIL on master (they are the regression gates for PR2-PR5), so
they broke the build job.

Set `testpaths = packages` so the rootdir run collects only the unit tests
under packages/ (matching master's effective behavior, since app-tests/ had
no pytest files before). testpaths only applies when pytest is invoked from
the rootdir with no args, so `cd app-tests/git-leak && pytest` still collects
and runs the test bed. Verified both: root run -> packages only; subdir run
-> all 5 flagship tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- stats(): sample the /internal endpoint several times and merge per key
  with max(). The caches are per-process on a multi-worker server, so a
  single read can hit an empty (non-leader) worker — max-merge avoids both
  false negatives (missing a populated leader) and false positives (an
  `== 0` drain assertion passing only because it hit an empty worker).
- test_leak: assert the initial-load `_wait_until` succeeded before deleting
  / before taking baseline, so the tests can't pass vacuously when load
  never completed.
- refresh_all(): correct the misleading comment — a 404 is a no-op, there
  is no client-side fallback.
- conftest: skip the suite cleanly if docker is unavailable (defense in
  depth; it's already excluded from the default pytest run via testpaths).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two review findings on the regression gates:

- Cross-test contamination: clone paths are keyed by repo URL (source_id =
  sha256(url)+branch-shard), not scope_id, so scopes from different tests that
  point at the same seeded repo share one GitPolicyFetcher cache entry. With a
  session-scoped stack and no teardown, a leftover boot-*/stable-* scope kept
  those entries alive and would make test_churn's `repos == 0` drain assertion
  fail on fixed code. OpalServerClient now tracks created scopes and the opal
  fixture deletes them on teardown (best-effort drain wait, swallows errors so
  master — where delete never purges — doesn't fail the passing test).

- False gate: test_repeat_sync_does_not_grow re-syncs identical scopes, which a
  path-keyed cache can't grow even on master, so it could never be the leak gate
  it claimed. Reframed as an honest idempotency guard (passes on master) that
  points at test_churn_releases_caches as the real leak gate; README's
  "Expected on master" reclassifies it alongside the postgres-bounce guard.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dshoen619 dshoen619 requested review from Zivxx and zeevmoney June 23, 2026 10:56
@dshoen619 dshoen619 marked this pull request as draft June 23, 2026 11:37
@dshoen619 dshoen619 marked this pull request as ready for review June 23, 2026 11:49

@zeevmoney zeevmoney left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes — the regression gates do not function as gates yet

Scope check first, because it's the reassuring part: the production surface is safe. The only always-on change is a benign import plus a default-off OPAL_DEBUG_INTERNAL_STATS config key; with the flag off (the default) there is zero behavior change, and when on the handler can't crash (the class dicts always exist, len() is GIL-atomic against the event-loop mutations, and _read_rss_kb swallows errors). testpaths = packages drops no existing tests. Git hygiene is clean (nothing private leaked into the diff, no rebase needed). This PR will not crash production.

The problem is the actual deliverable — the five flagship tests meant to gate PR2–PR5.

What's blocking (must fix before this can be trusted as the gate)

  1. The load/drain gates assert on GitPolicyFetcher.repos, which the first-sync clone path never populates (see the CRITICAL inline comment on test_leak.py:26). On a fresh scope, fetch_and_notify_on_changes takes the _clone() branch, which only sets repo_locksrepos is filled solely by _get_repo() on a second sync or the bundle path, and periodic re-sync is disabled in the stack (OPAL_POLICY_REFRESH_INTERVAL: "0"). As a result test_churn_releases_caches, test_repeat_sync_does_not_grow, and test_offline_repo_does_not_block_healthy_scopes hang at their _wait_until(repos >= n) load gate for the full timeout and then fail at the load stage — they never reach the leak/offline logic they exist to test, and they would not flip green after PR2/PR3 land. Only test_boot_loads_all_scopes works, and only incidentally (compose restartpreload_scopes re-discovers the on-disk clones). This is the same root cause the PR description flags as "Known caveat #2" — it needs to be fixed, not just noted.

  2. Even once load is fixed, the cache assertions are not reliable on a 2-worker stack. stats()'s max-merge cannot prevent a == 0 drain assertion from falsely passing when the samples miss the leader worker (HIGH on helpers.py:50). The cache tests should run single-worker, or the endpoint should aggregate across workers.

  3. Failure modes are invisible. compose() swallows stdout/stderr on any compose failure (HIGH on helpers.py:218), and partial seeding is never detected before the suite runs (HIGH on conftest.py:40) — so a broken stack or a half-seeded Gitea looks like a test failure for the wrong reason.

Secondary correctness (should fix)

  • test_repeat_sync_does_not_grow is tautological — len(repos) can't grow on repeat URL-keyed sync (test_leak.py:64).
  • The offline test's TEST-NET-1 address likely fails fast instead of hanging, so it may not exercise starvation (helpers.py:205).
  • The opal fixture clears _created_scopes at setup, orphaning scopes from a failed prior test and contaminating the next (conftest.py:55).
  • The postgres-bounce test asserts only HTTP liveness, not broadcaster recovery, and doesn't --wait for Postgres (test_resilience.py:51).
  • README/docstrings say "fails on master," but the suite requires this PR's /internal endpoint (README.md:30).
  • Dead/misleading 404 branch in refresh_all() (helpers.py:128).

Minor

  • Enabling the flag exposes an unauthenticated /internal route (non-sensitive payload, off by default) (debug_stats.py:39).
  • || true masks genuine gitea-admin failures (docker-compose.yml:58).
  • Boot-timing clock starts after wait_healthy, undercounting boot-sync time (test_boot.py:24).

Bottom line

All 11 plan tasks are implemented and the production change is safe, but the suite needs to (1) re-key the load/drain assertions to a cache the sync path actually populates (or force a policy fetch), (2) make the cache reads worker-deterministic, (3) surface compose/seed failures, and then (4) be validated end-to-end against a live stack to confirm each test fails/passes for the right reason. Until then it can't be relied on as the regression gate for PR2–PR5.

Comment thread app-tests/git-leak/test_leak.py Outdated
Comment thread app-tests/git-leak/helpers.py Outdated
Comment thread app-tests/git-leak/helpers.py
Comment thread app-tests/git-leak/conftest.py
Comment thread app-tests/git-leak/helpers.py
Comment thread app-tests/git-leak/conftest.py Outdated
Comment thread app-tests/git-leak/test_resilience.py
Comment thread packages/opal-server/opal_server/debug_stats.py Outdated
Comment thread app-tests/git-leak/docker-compose.yml Outdated
Comment thread app-tests/git-leak/test_boot.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 3 comments.

Comment thread app-tests/git-leak/helpers.py
Comment thread packages/opal-server/opal_server/debug_stats.py Outdated
Comment thread packages/opal-server/opal_server/server.py
Comment thread app-tests/git-leak/test_leak.py Outdated
…iew)

Addresses the CHANGES_REQUESTED review on PR #922. Root cause behind most
findings: a fresh scope's first sync takes the _clone() branch, which only
fills GitPolicyFetcher.repo_locks; repos/repos_last_fetched are filled on a
*second* sync. So the load gates on `repos` hung, and the 2-worker per-process
caches made `== 0` drain assertions unsafe.

Blocking:
- Single worker (UVICORN_NUM_WORKERS=1): deterministic per-process cache reads;
  removes the false `== 0` drain class.
- Load gate (CRITICAL + Zivxx HIGH): _load_scopes gates on repo_locks then
  refresh_all() to force the second sync, so repos/repos_last_fetched are
  actually populated before any drain/purge assertion.
- compose() surfaces captured stdout/stderr on failure.
- Seed completeness asserted in conftest; seed script isolates per-repo
  failures and exits non-zero with a count.

Secondary:
- test_repeat_sync asserts an RSS bound (count can't grow for any impl);
  churn asserts all three caches drain + a loose RSS backstop.
- blackhole socat sidecar replaces TEST-NET-1 (deterministic hang); offline
  test saturates the fetch executor with 40 hung clones and recovers via
  OpalServerClient.hard_reset() (stop -> redis FLUSHALL -> start).
- Per-test clean slate deletes all server scopes (fixes orphan-scope leak).
- Postgres-bounce proves broadcast recovery (PUT post-bounce, assert sync)
  and uses `up -d --wait`.
- Remove dead 404 branch in refresh_all; boot clock starts at restart;
  gitea-admin `|| true` -> "already exists"-only guard; README reworded.

opal-server:
- /internal stats route now takes the JWTAuthenticator dependency (protected
  when JWT on, no-op in the test bed); unit test asserts enforcement.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dshoen619

Copy link
Copy Markdown
Author

Review addressed — 6046a10

Thanks for the thorough pass. All blocking items, both HIGHs from @Zivxx / @zeevmoney, and the secondary/minor items are addressed in 6046a10. Per-comment replies are inline; summary below.

Root cause behind most findings (and why the gates didn't gate): a fresh scope's first sync takes the _clone() branch, which only fills repo_locks; repos/repos_last_fetched are filled on a second sync. The load gates on repos therefore hung, and the 2-worker per-process caches made == 0 drains unsafe.

Blocking

  1. Load gate (CRITICAL + @Zivxx HIGH)_load_scopes gates on repo_locks (first-sync signal), then refresh_all() forces the second sync so repos/repos_last_fetched are populated before any drain/purge assertion. Churn now asserts all three caches drain to 0 (so a broken repos_last_fetched purge fails instead of passing vacuously).
  2. Worker determinism — stack is now UVICORN_NUM_WORKERS=1; removes the false-== 0-drain class outright (no more reliance on max-merge to paper over it).
  3. Failure visibilitycompose() re-raises with captured stdout/stderr; seed completeness is asserted in conftest; seed_gitea.py isolates per-repo failures and exits non-zero with a count.

Secondary

  • test_repeat_sync now asserts an rss_kb bound (count can't grow for a URL-keyed set).
  • Offline test uses a blackhole socat sidecar (deterministic hang, verified) and saturates the fetch executor with 40 hung clones; recovers the session stack via hard_reset() (stop → redis FLUSHALL → start).
  • Per-test clean slate deletes all server scopes (fixes the orphan-scope leak).
  • Postgres-bounce proves broadcast recovery (PUT post-bounce → assert sync), uses up -d --wait.
  • Dead 404 branch removed; boot clock starts at restart; gitea-admin || true → "already exists"-only guard; README/docstrings reworded to "fails on this branch without PR2/PR3", noting the suite needs this PR's /internal endpoint.

Minor

  • /internal route now carries the JWTAuthenticator dependency (protected when JWT on, no-op in the test bed); added a unit test asserting a rejecting dependency → 401.

Validation done: 5/5 opal-server unit tests pass in the Docker server image (incl. the new auth test); docker compose config valid; blackhole hang and the gitea-admin guard verified directly.

Still outstanding (flagging honestly): I have not yet re-run the full live integration suite end-to-end to confirm each flagship test fails/passes for the right reason against a live stack — that's the "validate end-to-end" item from your bottom line. I can kick that off next; it's a ~30 min run and several tests are expected-fail gates on this branch until PR2/PR3.

dshoen619 and others added 2 commits June 24, 2026 16:15
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- snapshot a single /internal stats read per poll in the churn-drain and
  boot gates (consistent multi-key observation; fewer HTTP round-trips)
- document why a 200 from the healthy scope can't be a masked default
  bundle, and why the stats route is intentionally a sync def
- pin alpine/socat and the seed image's pip deps for reproducibility

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dshoen619 dshoen619 requested review from Zivxx and zeevmoney June 24, 2026 17:06
Comment thread app-tests/git-leak/test_resilience.py Outdated
dshoen619 and others added 3 commits June 28, 2026 18:20
…repo

Ziv (PR review, round 2) caught that the offline-resilience gate can
false-PASS in the full-suite run. The "healthy" scope pointed at
policy-repo-0000 (list_seeded_repos(1)[0]), but on-disk clones are keyed
by URL-hash and survive compose restart/stop/start (opal_server mounts no
volume at /opal; only `down -v` wipes them). test_boot/test_leak run first
(alphabetical) and already clone every seeded repo, so the healthy scope
hit the existing clone via _discover_repository, skipped _clone(), and
served 200 without ever touching the saturated fetch executor — the gate
that must FAIL on this branch (no PR3 timeout) passed.

Fix: seed a reserved repo (policy-repo-healthy-probe) outside the numeric
policy-repo-NNNN range that no boot/leak test enumerates, and point the
healthy probe at it. A never-cloned repo forces a genuine fresh clone
through the starved pool, so the gate fails correctly. The seed-
completeness check in conftest now covers the reserved repo too.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…view)

Follow-up to PR review of the git-leak/resilience test bed:

- hard_reset: always restart opal_server + wait_healthy via a finally, so a
  failed redis FLUSHALL can't leave the server stopped (which would fail every
  later session-scoped test and, running in a test finally, mask the result).
- delete_all_scopes drain: a transient /internal read error no longer counts as
  a successful drain (was `except: return`); keep polling to the deadline so a
  not-yet-drained cache can't leak into the next test once PR2 lands.
- use stats(samples=1) for the zero-waiting drain/empty polls (the peak-merge
  only matters for load assertions; this also drops 3x HTTP per poll).
- resilience: narrow broad `except Exception` to requests.RequestException
  (and RuntimeError for wait_healthy timeout) so harness bugs surface instead of
  masquerading as "never served"/"never recovered".
- resilience: collapse `assert opal.stats()` + redundant re-read into one read.
- debug_stats_test: assert rss_kb > 0 on Linux (was `>= 0`, which passed for the
  wrong reason where /proc is absent and RSS reads fall back to 0).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dshoen619 dshoen619 removed the request for review from zeevmoney June 30, 2026 08:30
@dshoen619 dshoen619 requested a review from zeevmoney June 30, 2026 09:11
@zeevmoney

zeevmoney commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

@/private/tmp/claude-501/-Users-zeevmanilovich-NewDev-opal/deddbdb3-9e66-4bdc-86ca-0a2e991c021b/scratchpad/corrected.md

@zeevmoney zeevmoney changed the title PR1: Git leak/resilience test environment test(opal-server): git leak/resilience test environment (PR1) Jun 30, 2026

@zeevmoney zeevmoney left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — the test bed is sound and this PR has already been through several review rounds (Copilot / @Zivxx / @zeevmoney) with the substantive issues raised and resolved. A fresh python-pro pass (planned against the opal-development references) surfaced only non-blocking items; none are CRITICAL/HIGH, so they don't gate landing the test bed.

Issues to fix (tracked under PER-15155) — inline comments above:

Worth fixing before/with merge

  • MEDIUM pytest.ini:10testpaths = packages makes the README's documented cd app-tests/git-leak && pytest --boot-scopes=50 error with unrecognized arguments: --boot-scopes. Self-root the suite + use a targeted ignore.
  • MEDIUM test_resilience.py:123 — the bounce test's repo_locks > baseline signal is order-dependent across files (shared policy-repo-0000 + non-purging caches on master); assert GET /scopes/post-bounce/policy == 200 instead.

Follow-ups (LOW)

  • helpers.py:274 — add a timeout= to compose() so a wedged setup fails fast instead of hitting the CI job limit.
  • helpers.py:148delete_all_scopes burns ~40s/test waiting on a drain that can't happen on master; short-circuit it.
  • seed_gitea.py:159 — remove the unused /seed-output/token artifact + volume.
  • seed_gitea.py:108 — make push_url credential injection scheme-agnostic (urllib.parse).

Doc nit (not posted inline): the bounce-test docstring (test_resilience.py:85-93) still describes pre-PER-15065 "graceful shutdown + gunicorn respawn over the recovered broadcaster"; reconnect is now in-place — worth a 3-line update.

Recommend merging PR1 first so PR2 (#923) / PR3 (#924) can build on the test bed + /internal endpoint, then addressing the two MEDIUMs.

Comment thread pytest.ini
# not be collected by the repo's CI `pytest` run. testpaths only applies when
# pytest is invoked from the rootdir with no args, so running it explicitly
# (e.g. `cd app-tests/git-leak && pytest`) still works.
testpaths = packages

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] testpaths = packages breaks the documented way to run this suite

Problem: testpaths resolves against the rootdir (the repo root, where this pytest.ini lives) and applies whenever no positional path is given — independent of cwd. So the README's documented cd app-tests/git-leak && python -m pytest --boot-scopes=50 collects packages/ instead of this directory, never loads app-tests/git-leak/conftest.py (where --boot-scopes is registered), and errors with unrecognized arguments: --boot-scopes. It also silently drops any current/future test outside packages/ from the CI pytest run.

Suggestion: Add a pytest.ini (or a rootdir marker) under app-tests/git-leak/ so the suite is self-rooted and runnable in place, and replace the global testpaths = packages with a targeted --ignore=app-tests / norecursedirs. Please confirm by running the exact documented command.

synced = False
deadline = time.time() + 120
while time.time() < deadline:
if opal.stats()["repo_locks"] > baseline_locks:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Bounce-test signal is order-dependent across test files

Problem: The success check keys on opal.stats()["repo_locks"] > baseline_locks — a process-global, source_id(URL)-keyed counter shared with test_leak.py (both use policy-repo-0000). On master delete_scope never purges the caches (the leak this suite gates), so the opal fixture's delete_all_scopes can't drain repo_locks. If policy-repo-0000 is already cached, the post-bounce PUT adds no new key → the delta check never trips → false "broadcaster did not recover" failure. It passes today only because the preceding offline test's hard_reset() restarts the process and flushes Redis; reorder (e.g. run after test_leak without the offline test in between) and it false-fails.

Suggestion: Point the post-bounce PUT at a dedicated never-synced repo, or assert on the specific scope being served (GET /scopes/post-bounce/policy == 200) instead of a delta on a shared process-global counter.

re-raise with the captured stdout/stderr embedded, otherwise a
broken build/seed/ restart is opaque to debug.
"""
proc = subprocess.run(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] compose() has no subprocess timeout — fixture setup can hang to the CI job limit

Problem: subprocess.run(["docker", "compose", *args], capture_output=True, text=True) is called with no timeout=. @pytest.mark.timeout(...) on tests does not cover session-scoped fixture setup, so a wedged docker compose up -d --build or a compose("wait", "seed") on a stuck seed container blocks until the workflow's 60-minute job limit rather than failing fast. (Shell-injection is not a concern — args are a list, no shell=True.)

Suggestion: Pass a generous timeout= to compose() (at least for up/wait/stop/start) and raise a clear error on expiry. Optionally probe docker compose version for the availability skip instead of only shutil.which("docker").

except Exception:
pass
self._created_scopes.clear()
deadline = time.time() + drain_timeout

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] delete_all_scopes burns the full drain timeout twice per test

Problem: On master the caches never purge on delete (the leak this suite gates), so the while time.time() < deadline loop waiting for repo_locks == 0 always runs its full drain_timeout=20s. This runs in the opal fixture's setup and teardown → roughly 40s of guaranteed dead wait per test, for every test, for the entire expected-failing lifetime of the suite.

Suggestion: Short-circuit the drain on the known-not-purged path (cap it at a couple of seconds, or skip the drain in teardown), since waiting the full timeout for a state that can't occur on master is pure overhead.

return 1

# write the token where the test harness can read it
Path("/seed-output/token").write_text(token)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] Dead artifact — seed token written for a harness that never reads it

Problem: Path("/seed-output/token").write_text(token) (plus the seed-output volume in docker-compose.yml) is unused. The host-side harness (GiteaAdmin / OpalServerClient in helpers.py) authenticates with HTTP basic auth (opaladmin/opaladmin), never the token, and can't read a path inside the seed container's volume anyway.

Suggestion: Remove the token write and the seed-output volume, or actually consume the token from the host. Dead test-bed plumbing rots and misleads later readers.

repo.index.commit("seed policy", author=author, committer=author)

push_url = (
base_url.replace("http://", f"http://{user}:{token}@") + f"/{user}/{name}.git"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] push_url credential injection assumes http://

Problem: push_url = base_url.replace("http://", f"http://{user}:{token}@") + f"/{user}/{name}.git" is a no-op if GITEA_URL is ever https://, silently dropping the credentials and producing an opaque auth failure that the per-repo except records as a generic seed error.

Suggestion: Inject credentials scheme-agnostically via urllib.parse (split, set netloc = f"{user}:{token}@{host}", unsplit) so it works for both http/https.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants