Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
15 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -137,3 +137,6 @@ dmypy.json
*.iml

.DS_Store

# Private Claude Code working artifacts (plans/specs) — never commit
.claude/
71 changes: 71 additions & 0 deletions app-tests/git-leak/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# OPAL git-leak / resilience test bed

Reproduces (as failing tests) the four issues fixed by PR2–PR5: memory leak,
offline-repo hang, slow serial boot, broadcaster no-reconnect.

Every assertion is driven through `GET /internal/git-fetcher-cache-stats`, which
**this PR (PR1) adds** — it does not exist on `master`. So the suite runs against
*this branch*: the leak/offline tests fail here *until PR2/PR3 land*, then go
green. Run against true `master` they would all error at setup on the missing
endpoint, not "fail for the targeted bug."

## Stack
- `opal_server` (single worker, scopes on, Postgres broadcaster, built from `docker/Dockerfile`)
- `redis`, `postgres`, `gitea` (+ one-shot `gitea-admin` and `seed` sidecars)
- `blackhole` (alpine/socat: accepts TCP then never answers — the offline repo)

Only `opal_server` (`:7002`) and `gitea` (`:13000` on the host) are published;
Postgres and `blackhole` are internal to the compose network.

## Helpers (`helpers.py`)
- `OpalServerClient` — drive opal over HTTP (`stats`, `put_scope`, `delete_scope`,
`refresh_all`, `get_scope_policy`, `list_scope_ids`, `delete_all_scopes`).
- `GiteaAdmin` — host-side Gitea admin client (`list_repos`, `repo_exists`,
`create_repo`, `delete_repo`); also exposed as the `gitea_admin` pytest fixture.
- `make_repo_unreachable(name)` — git URL on the `blackhole` sidecar (completes
the TCP handshake, never answers) so the clone hangs for the offline-repo test.
- `bounce_postgres(down_seconds)` — stop Postgres, then `up -d --wait` it back to
simulate a broadcaster outage and await readiness before the recovery poll.

## Run
```bash
cd app-tests/git-leak
python -m pytest -v --boot-scopes=50 # full set
python -m pytest test_leak.py -v --boot-scopes=20 # just the leak gates
```
Useful flags: `--boot-scopes=N` (any N), `--keep-stack` (skip teardown),
env `BOOT_TARGET_SECONDS=120` (tighten the boot gate).

## Expected behavior
The churn leak test (`test_churn_releases_caches`) and the offline-repo test
Comment thread
dshoen619 marked this conversation as resolved.
FAIL on this branch *without the PR2/PR3 fix* — they target unfixed bugs and
become the regression gates for PR2/PR3, flipping green when those land. The
boot test passes but fails when `BOOT_TARGET_SECONDS` is set low (PR4's gate).

Two tests are guards that PASS rather than reproducing a current failure:
- `test_repeat_sync_does_not_grow` — clone paths are keyed by the repo URL, so
re-syncing identical scopes reuses cache entries and the cache *counts* can't
grow for any implementation; the load-bearing assertion is therefore on RSS,
guarding against a regression that leaks per-sync allocations.
- `test_server_recovers_after_postgres_bounce` — when the broadcaster drops, the
worker is respawned by gunicorn and the broadcaster reconnects once Postgres
is back; the test PUTs a fresh scope post-bounce and asserts it syncs, proving
the broadcast path (not just HTTP) recovered.

## Requires
Docker + docker compose v2, plus host Python with `pytest pytest-timeout requests GitPython`.

## Notes
- Auth is disabled in the stack: `OPAL_AUTH_PUBLIC_KEY` is left unset so the JWT
verifier is disabled and the harness can call scope routes without minting JWTs.
Local test bed only; never a production setting. (The `/internal` endpoint is
registered with the same `JWTAuthenticator` dependency as the other routes, so
it is protected when JWT verification is enabled and open only here.)
- The server runs a **single** uvicorn worker. The `GitPolicyFetcher` caches read
by `/internal/git-fetcher-cache-stats` are per-process, so a multi-worker stack
would make a round-robin read miss the worker that fetched and let a `== 0`
drain assertion pass falsely. One worker makes every cache read deterministic;
the leak/boot/offline bugs all reproduce single-worker.
- First-sync of a fresh scope takes the clone path, which fills only `repo_locks`;
`repos` / `repos_last_fetched` are filled by the discover/fetch path on a second
sync, so the load helpers issue a `refresh_all()` before asserting on `repos`.
86 changes: 86 additions & 0 deletions app-tests/git-leak/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
import os
import shutil

import pytest
from helpers import (
HEALTHY_PROBE_REPO,
GiteaAdmin,
OpalServerClient,
compose,
list_seeded_repos,
)


def pytest_addoption(parser):
parser.addoption(
"--boot-scopes",
action="store",
default="50",
help="number of repos to seed/boot (default 50)",
)
parser.addoption(
"--keep-stack",
action="store_true",
default=False,
help="do not tear the compose stack down after the run",
)


@pytest.fixture(scope="session")
def repo_count(request) -> int:
return int(request.config.getoption("--boot-scopes"))


@pytest.fixture(scope="session")
def stack(request, repo_count):
# Defense-in-depth: this docker-compose suite is already excluded from the
# repo's default `pytest` run via `testpaths = packages` in pytest.ini, so
# the unit-test CI matrix never collects it. If it is ever collected in an
# environment without docker, skip cleanly instead of erroring.
if shutil.which("docker") is None:
pytest.skip("docker (compose) is required for the git-leak test bed")
os.environ["REPO_COUNT"] = str(repo_count)
# build + start infra; seed runs to completion then exits
compose("up", "-d", "--build")
# block until seeding sidecar has finished creating repos. compose() raises
# (with output) if the seed container exited non-zero, so a hard seed
# failure surfaces here rather than as a confusing later test failure.
compose("wait", "seed")
Comment thread
dshoen619 marked this conversation as resolved.
Comment thread
dshoen619 marked this conversation as resolved.
# Verify the seed actually produced all N repos before any test runs: a
# partial seed would otherwise look like a server bug when the load gate
# can't reach N. Fail loudly with the gap.
# include the reserved probe repo the resilience test relies on, so a
# partial seed of it is caught here too rather than as a later test failure
expected = set(list_seeded_repos(repo_count)) | {HEALTHY_PROBE_REPO}
present = set(GiteaAdmin().list_repos())
missing = expected - present
assert not missing, (
f"seed incomplete: {len(missing)}/{repo_count} repos missing "
f"(e.g. {sorted(missing)[:5]})"
)
client = OpalServerClient()
client.wait_healthy()
yield client
if not request.config.getoption("--keep-stack"):
compose("down", "-v")


@pytest.fixture()
def opal(stack) -> OpalServerClient:
# The compose stack is session-scoped (one server for the whole run), but
# scopes must not leak between tests: clone paths are keyed by repo URL, so
# a scope left behind by one test shares a cache entry with any later test
# using the same seeded repo and would pollute its drain assertions.
#
# Delete every scope the *server* currently knows (not just this client's
# tracked set) at setup, so a scope orphaned by a prior failed test can't
# contaminate this one; then again on teardown.
stack.delete_all_scopes()
yield stack
stack.delete_all_scopes()


@pytest.fixture(scope="session")
def gitea_admin(stack) -> GiteaAdmin:
"""Host-side Gitea admin client (depends on `stack` so Gitea is up)."""
return GiteaAdmin()
126 changes: 126 additions & 0 deletions app-tests/git-leak/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
name: opal-git-leak-test

services:
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 2s
timeout: 3s
retries: 30

postgres:
image: postgres:16-alpine
environment:
POSTGRES_USER: opal
POSTGRES_PASSWORD: opal
POSTGRES_DB: opal
# not published to the host: only opal_server reaches it over the compose
# network, and bounce_postgres() uses `docker compose stop/start`. Publishing
# 5432 would collide with any Postgres already running on the host.
healthcheck:
test: ["CMD-SHELL", "pg_isready -U opal"]
interval: 2s
timeout: 3s
retries: 30

gitea:
image: gitea/gitea:1.21
environment:
GITEA__security__INSTALL_LOCK: "true"
GITEA__server__ROOT_URL: "http://gitea:3000/"
GITEA__database__DB_TYPE: "sqlite3"
# published on 13000 (not 3000) for the host-side GiteaAdmin helper; the
# uncommon port avoids the usual :3000 clash. opal_server and the seed
# sidecar still reach it over the compose network via http://gitea:3000.
ports:
- "13000:3000"
volumes:
- gitea-data:/data
healthcheck:
test: ["CMD-SHELL", "wget -qO- http://localhost:3000/api/v1/version || exit 1"]
interval: 3s
timeout: 5s
retries: 40

gitea-admin:
# creates the admin user once gitea is healthy
image: gitea/gitea:1.21
depends_on:
gitea:
condition: service_healthy
user: git
entrypoint: ["/bin/sh", "-c"]
# Tolerate only the idempotent "already exists" case; any other failure
# must abort so `seed` (which depends on this completing) doesn't run
# against a Gitea with no admin user and fail with a confusing 401.
command:
- >
out=$$(gitea admin user create --username opaladmin --password opaladmin
--email admin@example.com --admin --must-change-password=false
--config /data/gitea/conf/app.ini 2>&1); rc=$$?;
echo "$$out";
if [ $$rc -ne 0 ] && ! echo "$$out" | grep -qi "already exist"; then
exit $$rc;
fi
volumes:
- gitea-data:/data
restart: "no"

blackhole:
# Accepts the TCP handshake then never answers — a clone connects and
# blocks reading the git smart-HTTP response, holding the fetch executor.
# Deterministic, unlike a TEST-NET-1 address which many networks reject
# fast with ICMP-unreachable (so the clone would fail fast, not hang).
image: alpine/socat:1.8.0.3
command: ["TCP-LISTEN:80,fork,reuseaddr", "SYSTEM:sleep 3600"]

seed:
build: ./seed
depends_on:
gitea:
condition: service_healthy
gitea-admin:
condition: service_completed_successfully
environment:
GITEA_URL: "http://gitea:3000"
GITEA_ADMIN_USER: "opaladmin"
GITEA_ADMIN_PASSWORD: "opaladmin"
REPO_COUNT: "${REPO_COUNT:-50}"
volumes:
- seed-output:/seed-output
restart: "no"

opal_server:
build:
context: ../..
dockerfile: docker/Dockerfile
target: server
environment:
# Single worker on purpose: the GitPolicyFetcher caches read by
# /internal/git-fetcher-cache-stats are per-process, so with >1 worker a
# round-robin read can miss the worker that fetched and make a `== 0`
# drain assertion pass falsely. One worker makes every cache read
# deterministic. The leak/boot/offline bugs all reproduce single-worker.
UVICORN_NUM_WORKERS: "1"
OPAL_SCOPES: "1"
OPAL_REDIS_URL: "redis://redis:6379"
OPAL_BROADCAST_URI: "postgres://opal:opal@postgres:5432/opal"
OPAL_BASE_DIR: "/opal"
OPAL_POLICY_REFRESH_INTERVAL: "0"
OPAL_DEBUG_INTERNAL_STATS: "1"
# OPAL_AUTH_PUBLIC_KEY is intentionally left unset: with no public key the
# JWT verifier is disabled, so the harness can call scope routes without
# minting JWTs. Local test bed only; never a production setting.
OPAL_LOG_FORMAT_INCLUDE_PID: "true"
ports:
- "7002:7002"
depends_on:
redis:
condition: service_healthy
postgres:
condition: service_healthy

volumes:
gitea-data:
seed-output:
Loading
Loading