-
Notifications
You must be signed in to change notification settings - Fork 286
test(opal-server): git leak/resilience test environment (PR1) #922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dshoen619
wants to merge
15
commits into
master
Choose a base branch
from
david/per-15155-pr1-git-leakresilience-test-environment-one-big-pr
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
1f9c816
feat(opal-server): gated /internal git-fetcher cache stats endpoint
dshoen619 c176ce9
test(git-leak): add OPAL git leak/resilience test bed
dshoen619 bd49676
test(git-leak): add GiteaAdmin and make_repo_unreachable helpers
dshoen619 afeb969
test(git-leak): correct postgres-bounce framing (passes on master)
dshoen619 92353f6
style(git-leak): apply black/isort/docformatter (pre-commit)
dshoen619 db54ae6
test: scope root pytest collection to packages/ (exclude git-leak bed)
dshoen619 6f9a089
test(git-leak): address Copilot review feedback
dshoen619 1b23ac0
test(git-leak): isolate scopes per test and fix false repeat-sync gate
dshoen619 6046a10
test(git-leak): make the regression gates trustworthy (address PR rev…
dshoen619 f810db8
style(git-leak): apply black/isort/docformatter (pre-commit)
dshoen619 82ff33a
test(git-leak): tighten stat polling and pin test-bed images (PR review)
dshoen619 d719c34
Merge branch 'master' into david/per-15155-pr1-git-leakresilience-tes…
dshoen619 75ad43a
test(git-leak): isolate offline-hang healthy probe to a never-cloned …
dshoen619 15f3cfe
test(git-leak): harden harness teardown and tighten assertions (PR re…
dshoen619 a502f2e
Merge branch 'master' into david/per-15155-pr1-git-leakresilience-tes…
dshoen619 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -137,3 +137,6 @@ dmypy.json | |
| *.iml | ||
|
|
||
| .DS_Store | ||
|
|
||
| # Private Claude Code working artifacts (plans/specs) — never commit | ||
| .claude/ | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| # OPAL git-leak / resilience test bed | ||
|
|
||
| Reproduces (as failing tests) the four issues fixed by PR2–PR5: memory leak, | ||
| offline-repo hang, slow serial boot, broadcaster no-reconnect. | ||
|
|
||
| Every assertion is driven through `GET /internal/git-fetcher-cache-stats`, which | ||
| **this PR (PR1) adds** — it does not exist on `master`. So the suite runs against | ||
| *this branch*: the leak/offline tests fail here *until PR2/PR3 land*, then go | ||
| green. Run against true `master` they would all error at setup on the missing | ||
| endpoint, not "fail for the targeted bug." | ||
|
|
||
| ## Stack | ||
| - `opal_server` (single worker, scopes on, Postgres broadcaster, built from `docker/Dockerfile`) | ||
| - `redis`, `postgres`, `gitea` (+ one-shot `gitea-admin` and `seed` sidecars) | ||
| - `blackhole` (alpine/socat: accepts TCP then never answers — the offline repo) | ||
|
|
||
| Only `opal_server` (`:7002`) and `gitea` (`:13000` on the host) are published; | ||
| Postgres and `blackhole` are internal to the compose network. | ||
|
|
||
| ## Helpers (`helpers.py`) | ||
| - `OpalServerClient` — drive opal over HTTP (`stats`, `put_scope`, `delete_scope`, | ||
| `refresh_all`, `get_scope_policy`, `list_scope_ids`, `delete_all_scopes`). | ||
| - `GiteaAdmin` — host-side Gitea admin client (`list_repos`, `repo_exists`, | ||
| `create_repo`, `delete_repo`); also exposed as the `gitea_admin` pytest fixture. | ||
| - `make_repo_unreachable(name)` — git URL on the `blackhole` sidecar (completes | ||
| the TCP handshake, never answers) so the clone hangs for the offline-repo test. | ||
| - `bounce_postgres(down_seconds)` — stop Postgres, then `up -d --wait` it back to | ||
| simulate a broadcaster outage and await readiness before the recovery poll. | ||
|
|
||
| ## Run | ||
| ```bash | ||
| cd app-tests/git-leak | ||
| python -m pytest -v --boot-scopes=50 # full set | ||
| python -m pytest test_leak.py -v --boot-scopes=20 # just the leak gates | ||
| ``` | ||
| Useful flags: `--boot-scopes=N` (any N), `--keep-stack` (skip teardown), | ||
| env `BOOT_TARGET_SECONDS=120` (tighten the boot gate). | ||
|
|
||
| ## Expected behavior | ||
| The churn leak test (`test_churn_releases_caches`) and the offline-repo test | ||
| FAIL on this branch *without the PR2/PR3 fix* — they target unfixed bugs and | ||
| become the regression gates for PR2/PR3, flipping green when those land. The | ||
| boot test passes but fails when `BOOT_TARGET_SECONDS` is set low (PR4's gate). | ||
|
|
||
| Two tests are guards that PASS rather than reproducing a current failure: | ||
| - `test_repeat_sync_does_not_grow` — clone paths are keyed by the repo URL, so | ||
| re-syncing identical scopes reuses cache entries and the cache *counts* can't | ||
| grow for any implementation; the load-bearing assertion is therefore on RSS, | ||
| guarding against a regression that leaks per-sync allocations. | ||
| - `test_server_recovers_after_postgres_bounce` — when the broadcaster drops, the | ||
| worker is respawned by gunicorn and the broadcaster reconnects once Postgres | ||
| is back; the test PUTs a fresh scope post-bounce and asserts it syncs, proving | ||
| the broadcast path (not just HTTP) recovered. | ||
|
|
||
| ## Requires | ||
| Docker + docker compose v2, plus host Python with `pytest pytest-timeout requests GitPython`. | ||
|
|
||
| ## Notes | ||
| - Auth is disabled in the stack: `OPAL_AUTH_PUBLIC_KEY` is left unset so the JWT | ||
| verifier is disabled and the harness can call scope routes without minting JWTs. | ||
| Local test bed only; never a production setting. (The `/internal` endpoint is | ||
| registered with the same `JWTAuthenticator` dependency as the other routes, so | ||
| it is protected when JWT verification is enabled and open only here.) | ||
| - The server runs a **single** uvicorn worker. The `GitPolicyFetcher` caches read | ||
| by `/internal/git-fetcher-cache-stats` are per-process, so a multi-worker stack | ||
| would make a round-robin read miss the worker that fetched and let a `== 0` | ||
| drain assertion pass falsely. One worker makes every cache read deterministic; | ||
| the leak/boot/offline bugs all reproduce single-worker. | ||
| - First-sync of a fresh scope takes the clone path, which fills only `repo_locks`; | ||
| `repos` / `repos_last_fetched` are filled by the discover/fetch path on a second | ||
| sync, so the load helpers issue a `refresh_all()` before asserting on `repos`. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,86 @@ | ||
| import os | ||
| import shutil | ||
|
|
||
| import pytest | ||
| from helpers import ( | ||
| HEALTHY_PROBE_REPO, | ||
| GiteaAdmin, | ||
| OpalServerClient, | ||
| compose, | ||
| list_seeded_repos, | ||
| ) | ||
|
|
||
|
|
||
| def pytest_addoption(parser): | ||
| parser.addoption( | ||
| "--boot-scopes", | ||
| action="store", | ||
| default="50", | ||
| help="number of repos to seed/boot (default 50)", | ||
| ) | ||
| parser.addoption( | ||
| "--keep-stack", | ||
| action="store_true", | ||
| default=False, | ||
| help="do not tear the compose stack down after the run", | ||
| ) | ||
|
|
||
|
|
||
| @pytest.fixture(scope="session") | ||
| def repo_count(request) -> int: | ||
| return int(request.config.getoption("--boot-scopes")) | ||
|
|
||
|
|
||
| @pytest.fixture(scope="session") | ||
| def stack(request, repo_count): | ||
| # Defense-in-depth: this docker-compose suite is already excluded from the | ||
| # repo's default `pytest` run via `testpaths = packages` in pytest.ini, so | ||
| # the unit-test CI matrix never collects it. If it is ever collected in an | ||
| # environment without docker, skip cleanly instead of erroring. | ||
| if shutil.which("docker") is None: | ||
| pytest.skip("docker (compose) is required for the git-leak test bed") | ||
| os.environ["REPO_COUNT"] = str(repo_count) | ||
| # build + start infra; seed runs to completion then exits | ||
| compose("up", "-d", "--build") | ||
| # block until seeding sidecar has finished creating repos. compose() raises | ||
| # (with output) if the seed container exited non-zero, so a hard seed | ||
| # failure surfaces here rather than as a confusing later test failure. | ||
| compose("wait", "seed") | ||
|
dshoen619 marked this conversation as resolved.
dshoen619 marked this conversation as resolved.
|
||
| # Verify the seed actually produced all N repos before any test runs: a | ||
| # partial seed would otherwise look like a server bug when the load gate | ||
| # can't reach N. Fail loudly with the gap. | ||
| # include the reserved probe repo the resilience test relies on, so a | ||
| # partial seed of it is caught here too rather than as a later test failure | ||
| expected = set(list_seeded_repos(repo_count)) | {HEALTHY_PROBE_REPO} | ||
| present = set(GiteaAdmin().list_repos()) | ||
| missing = expected - present | ||
| assert not missing, ( | ||
| f"seed incomplete: {len(missing)}/{repo_count} repos missing " | ||
| f"(e.g. {sorted(missing)[:5]})" | ||
| ) | ||
| client = OpalServerClient() | ||
| client.wait_healthy() | ||
| yield client | ||
| if not request.config.getoption("--keep-stack"): | ||
| compose("down", "-v") | ||
|
|
||
|
|
||
| @pytest.fixture() | ||
| def opal(stack) -> OpalServerClient: | ||
| # The compose stack is session-scoped (one server for the whole run), but | ||
| # scopes must not leak between tests: clone paths are keyed by repo URL, so | ||
| # a scope left behind by one test shares a cache entry with any later test | ||
| # using the same seeded repo and would pollute its drain assertions. | ||
| # | ||
| # Delete every scope the *server* currently knows (not just this client's | ||
| # tracked set) at setup, so a scope orphaned by a prior failed test can't | ||
| # contaminate this one; then again on teardown. | ||
| stack.delete_all_scopes() | ||
| yield stack | ||
| stack.delete_all_scopes() | ||
|
|
||
|
|
||
| @pytest.fixture(scope="session") | ||
| def gitea_admin(stack) -> GiteaAdmin: | ||
| """Host-side Gitea admin client (depends on `stack` so Gitea is up).""" | ||
| return GiteaAdmin() | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,126 @@ | ||
| name: opal-git-leak-test | ||
|
|
||
| services: | ||
| redis: | ||
| image: redis:7-alpine | ||
| healthcheck: | ||
| test: ["CMD", "redis-cli", "ping"] | ||
| interval: 2s | ||
| timeout: 3s | ||
| retries: 30 | ||
|
|
||
| postgres: | ||
| image: postgres:16-alpine | ||
| environment: | ||
| POSTGRES_USER: opal | ||
| POSTGRES_PASSWORD: opal | ||
| POSTGRES_DB: opal | ||
| # not published to the host: only opal_server reaches it over the compose | ||
| # network, and bounce_postgres() uses `docker compose stop/start`. Publishing | ||
| # 5432 would collide with any Postgres already running on the host. | ||
| healthcheck: | ||
| test: ["CMD-SHELL", "pg_isready -U opal"] | ||
| interval: 2s | ||
| timeout: 3s | ||
| retries: 30 | ||
|
|
||
| gitea: | ||
| image: gitea/gitea:1.21 | ||
| environment: | ||
| GITEA__security__INSTALL_LOCK: "true" | ||
| GITEA__server__ROOT_URL: "http://gitea:3000/" | ||
| GITEA__database__DB_TYPE: "sqlite3" | ||
| # published on 13000 (not 3000) for the host-side GiteaAdmin helper; the | ||
| # uncommon port avoids the usual :3000 clash. opal_server and the seed | ||
| # sidecar still reach it over the compose network via http://gitea:3000. | ||
| ports: | ||
| - "13000:3000" | ||
| volumes: | ||
| - gitea-data:/data | ||
| healthcheck: | ||
| test: ["CMD-SHELL", "wget -qO- http://localhost:3000/api/v1/version || exit 1"] | ||
| interval: 3s | ||
| timeout: 5s | ||
| retries: 40 | ||
|
|
||
| gitea-admin: | ||
| # creates the admin user once gitea is healthy | ||
| image: gitea/gitea:1.21 | ||
| depends_on: | ||
| gitea: | ||
| condition: service_healthy | ||
| user: git | ||
| entrypoint: ["/bin/sh", "-c"] | ||
| # Tolerate only the idempotent "already exists" case; any other failure | ||
| # must abort so `seed` (which depends on this completing) doesn't run | ||
| # against a Gitea with no admin user and fail with a confusing 401. | ||
| command: | ||
| - > | ||
| out=$$(gitea admin user create --username opaladmin --password opaladmin | ||
| --email admin@example.com --admin --must-change-password=false | ||
| --config /data/gitea/conf/app.ini 2>&1); rc=$$?; | ||
| echo "$$out"; | ||
| if [ $$rc -ne 0 ] && ! echo "$$out" | grep -qi "already exist"; then | ||
| exit $$rc; | ||
| fi | ||
| volumes: | ||
| - gitea-data:/data | ||
| restart: "no" | ||
|
|
||
| blackhole: | ||
| # Accepts the TCP handshake then never answers — a clone connects and | ||
| # blocks reading the git smart-HTTP response, holding the fetch executor. | ||
| # Deterministic, unlike a TEST-NET-1 address which many networks reject | ||
| # fast with ICMP-unreachable (so the clone would fail fast, not hang). | ||
| image: alpine/socat:1.8.0.3 | ||
| command: ["TCP-LISTEN:80,fork,reuseaddr", "SYSTEM:sleep 3600"] | ||
|
|
||
| seed: | ||
| build: ./seed | ||
| depends_on: | ||
| gitea: | ||
| condition: service_healthy | ||
| gitea-admin: | ||
| condition: service_completed_successfully | ||
| environment: | ||
| GITEA_URL: "http://gitea:3000" | ||
| GITEA_ADMIN_USER: "opaladmin" | ||
| GITEA_ADMIN_PASSWORD: "opaladmin" | ||
| REPO_COUNT: "${REPO_COUNT:-50}" | ||
| volumes: | ||
| - seed-output:/seed-output | ||
| restart: "no" | ||
|
|
||
| opal_server: | ||
| build: | ||
| context: ../.. | ||
| dockerfile: docker/Dockerfile | ||
| target: server | ||
| environment: | ||
| # Single worker on purpose: the GitPolicyFetcher caches read by | ||
| # /internal/git-fetcher-cache-stats are per-process, so with >1 worker a | ||
| # round-robin read can miss the worker that fetched and make a `== 0` | ||
| # drain assertion pass falsely. One worker makes every cache read | ||
| # deterministic. The leak/boot/offline bugs all reproduce single-worker. | ||
| UVICORN_NUM_WORKERS: "1" | ||
| OPAL_SCOPES: "1" | ||
| OPAL_REDIS_URL: "redis://redis:6379" | ||
| OPAL_BROADCAST_URI: "postgres://opal:opal@postgres:5432/opal" | ||
| OPAL_BASE_DIR: "/opal" | ||
| OPAL_POLICY_REFRESH_INTERVAL: "0" | ||
| OPAL_DEBUG_INTERNAL_STATS: "1" | ||
| # OPAL_AUTH_PUBLIC_KEY is intentionally left unset: with no public key the | ||
| # JWT verifier is disabled, so the harness can call scope routes without | ||
| # minting JWTs. Local test bed only; never a production setting. | ||
| OPAL_LOG_FORMAT_INCLUDE_PID: "true" | ||
| ports: | ||
| - "7002:7002" | ||
| depends_on: | ||
| redis: | ||
| condition: service_healthy | ||
| postgres: | ||
| condition: service_healthy | ||
|
|
||
| volumes: | ||
| gitea-data: | ||
| seed-output: |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.