feat(opal-server): parallel scope loading — bounded-concurrency boot (PR4)#932
Open
Zivxx wants to merge 31 commits into
Open
feat(opal-server): parallel scope loading — bounded-concurrency boot (PR4)#932Zivxx wants to merge 31 commits into
Zivxx wants to merge 31 commits into
Conversation
Add an off-by-default diagnostics endpoint so tests can observe the in-memory GitPolicyFetcher cache sizes (repo_locks/repos/repos_last_fetched) and process RSS that the upcoming memory-leak fix eliminates. - debug_stats.py: read-only git_fetcher_cache_stats() helper + a register_internal_stats_route() registrar that mounts GET /internal/git-fetcher-cache-stats only when enabled. - config.py: new OPAL_DEBUG_INTERNAL_STATS flag, default False. - server.py: register the route, gated by the flag, beside /healthcheck. No production behavior change when the flag is off (the default). Also ignore .claude/ so private planning artifacts are never committed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A self-contained docker-compose stack (opal-server x2 workers + Redis + Postgres broadcaster + Gitea) plus a pytest harness that reproduces, as tests that fail on master, the git-fetcher memory leak, the offline-repo hang, the slow serial boot, and the broadcaster-disconnect gap. These become the regression gates for the follow-up fixes. - seed/: idempotent Gitea seeding sidecar (N policy repos) + Dockerfile. - docker-compose.yml: 4-service stack, opal-server built from the repo's own docker/Dockerfile (server target), scopes on, Postgres broadcaster. - helpers.py / conftest.py: HTTP + infra helpers and stack fixtures. - test_leak.py / test_resilience.py / test_boot.py: the flagship tests. - README.md: how to run and expected fail-on-master behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Complete the helpers.py surface promised in the plan's file-structure table. Both are now functional and used, not dead code: - make_repo_unreachable(name): returns a git URL on a routable-but-dead TEST-NET-1 host (RFC 5737). test_offline_repo now uses it instead of an inlined literal. - GiteaAdmin: host-side Gitea admin client (list_repos / repo_exists / create_repo / delete_repo), exposed as the `gitea_admin` pytest fixture for tests that need to inspect or stage repos beyond the seed sidecar. Gitea is published on host port 13000 (uncommon, to avoid the usual :3000 clash) so GiteaAdmin can reach it; opal_server and the seed sidecar still use the internal http://gitea:3000. README updated with the helper and port notes. Verified live: GiteaAdmin lists the seeded repos and round-trips create/exists/delete against Gitea over the published port. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verified test_server_recovers_after_postgres_bounce against the stack: it PASSES on master (~14-19s). On a broadcaster drop the affected worker triggers a graceful shutdown, gunicorn respawns it, and the sibling worker keeps serving HTTP, so the surface recovers within the window — recovery happens via gunicorn's in-container worker supervision, not an external supervisor and not an in-process reconnect. Reframe #5 as a recovery guard (not a known-broken case) in the docstring and README; the prior "FAILS on master / needs external supervisor" wording was wrong. PER-15065's in-process reconnect would avoid the worker churn but recovery already holds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Run the repo's pinned pre-commit formatters (black 23.1.0, isort 5.12.0, docformatter 1.7.5) over the PR1 files to satisfy the pre-commit CI check. Formatting only — no behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The CI `build` job runs `pytest` from the repo root with no path, which recursed into app-tests/git-leak/ and ran the flagship tests — these are designed to FAIL on master (they are the regression gates for PR2-PR5), so they broke the build job. Set `testpaths = packages` so the rootdir run collects only the unit tests under packages/ (matching master's effective behavior, since app-tests/ had no pytest files before). testpaths only applies when pytest is invoked from the rootdir with no args, so `cd app-tests/git-leak && pytest` still collects and runs the test bed. Verified both: root run -> packages only; subdir run -> all 5 flagship tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- stats(): sample the /internal endpoint several times and merge per key with max(). The caches are per-process on a multi-worker server, so a single read can hit an empty (non-leader) worker — max-merge avoids both false negatives (missing a populated leader) and false positives (an `== 0` drain assertion passing only because it hit an empty worker). - test_leak: assert the initial-load `_wait_until` succeeded before deleting / before taking baseline, so the tests can't pass vacuously when load never completed. - refresh_all(): correct the misleading comment — a 404 is a no-op, there is no client-side fallback. - conftest: skip the suite cleanly if docker is unavailable (defense in depth; it's already excluded from the default pytest run via testpaths). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two review findings on the regression gates: - Cross-test contamination: clone paths are keyed by repo URL (source_id = sha256(url)+branch-shard), not scope_id, so scopes from different tests that point at the same seeded repo share one GitPolicyFetcher cache entry. With a session-scoped stack and no teardown, a leftover boot-*/stable-* scope kept those entries alive and would make test_churn's `repos == 0` drain assertion fail on fixed code. OpalServerClient now tracks created scopes and the opal fixture deletes them on teardown (best-effort drain wait, swallows errors so master — where delete never purges — doesn't fail the passing test). - False gate: test_repeat_sync_does_not_grow re-syncs identical scopes, which a path-keyed cache can't grow even on master, so it could never be the leak gate it claimed. Reframed as an honest idempotency guard (passes on master) that points at test_churn_releases_caches as the real leak gate; README's "Expected on master" reclassifies it alongside the postgres-bounce guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire scope clone/fetch through run_in_git_executor with SCOPES_GIT_FETCH_TIMEOUT, and broaden the _clone except to catch asyncio.TimeoutError so a hung clone is logged and the scope skipped instead of crashing the caller. Drop the now-unused run_sync import. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…iew) Addresses the CHANGES_REQUESTED review on PR #922. Root cause behind most findings: a fresh scope's first sync takes the _clone() branch, which only fills GitPolicyFetcher.repo_locks; repos/repos_last_fetched are filled on a *second* sync. So the load gates on `repos` hung, and the 2-worker per-process caches made `== 0` drain assertions unsafe. Blocking: - Single worker (UVICORN_NUM_WORKERS=1): deterministic per-process cache reads; removes the false `== 0` drain class. - Load gate (CRITICAL + Zivxx HIGH): _load_scopes gates on repo_locks then refresh_all() to force the second sync, so repos/repos_last_fetched are actually populated before any drain/purge assertion. - compose() surfaces captured stdout/stderr on failure. - Seed completeness asserted in conftest; seed script isolates per-repo failures and exits non-zero with a count. Secondary: - test_repeat_sync asserts an RSS bound (count can't grow for any impl); churn asserts all three caches drain + a loose RSS backstop. - blackhole socat sidecar replaces TEST-NET-1 (deterministic hang); offline test saturates the fetch executor with 40 hung clones and recovers via OpalServerClient.hard_reset() (stop -> redis FLUSHALL -> start). - Per-test clean slate deletes all server scopes (fixes orphan-scope leak). - Postgres-bounce proves broadcast recovery (PUT post-bounce, assert sync) and uses `up -d --wait`. - Remove dead 404 branch in refresh_all; boot clock starts at restart; gitea-admin `|| true` -> "already exists"-only guard; README reworded. opal-server: - /internal stats route now takes the JWTAuthenticator dependency (protected when JWT on, no-op in the test bed); unit test asserts enforcement. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- snapshot a single /internal stats read per poll in the churn-drain and boot gates (consistent multi-key observation; fewer HTTP round-trips) - document why a 200 from the healthy scope can't be a masked default bundle, and why the stats route is intentionally a sync def - pin alpine/socat and the seed image's pip deps for reproducibility Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On Python < 3.11 asyncio.TimeoutError is a distinct class from the builtin TimeoutError, so run_in_git_executor's wait_for timeout was not caught by `pytest.raises(TimeoutError)` — failing build (3.9)/(3.10). Normalize to the builtin TimeoutError so the documented contract holds on every supported Python, and update the _clone catch site to match. Also apply black/isort/docformatter formatting to satisfy pre-commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- run_in_git_executor: use asyncio.get_running_loop() instead of the deprecated get_event_loop() inside an async function - fetch_and_notify_on_changes: set repos_last_fetched only after a successful fetch so a timeout/error does not wrongly suppress a later force_fetch via _was_fetched_after - fetch_timeout_test: measure elapsed time with time.monotonic() Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The fetch path let TimeoutError propagate to sync_scope's catch-all, which logged a full traceback at ERROR level for the expected unreachable-repo case — inconsistent with the clone path's quiet logger.error. Catch TimeoutError at the fetch site and log without a traceback, then skip (repos_last_fetched stays stale so the next cycle retries). Also shorten the hanging-thread sleeps in the timeout tests so the lingering pool thread doesn't delay process teardown. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tuck-on-an-offline-repo
…t-environment-one-big-pr
…repo Ziv (PR review, round 2) caught that the offline-resilience gate can false-PASS in the full-suite run. The "healthy" scope pointed at policy-repo-0000 (list_seeded_repos(1)[0]), but on-disk clones are keyed by URL-hash and survive compose restart/stop/start (opal_server mounts no volume at /opal; only `down -v` wipes them). test_boot/test_leak run first (alphabetical) and already clone every seeded repo, so the healthy scope hit the existing clone via _discover_repository, skipped _clone(), and served 200 without ever touching the saturated fetch executor — the gate that must FAIL on this branch (no PR3 timeout) passed. Fix: seed a reserved repo (policy-repo-healthy-probe) outside the numeric policy-repo-NNNN range that no boot/leak test enumerates, and point the healthy probe at it. A never-cloned repo forces a genuine fresh clone through the starved pool, so the gate fails correctly. The seed- completeness check in conftest now covers the reserved repo too. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…view) Follow-up to PR review of the git-leak/resilience test bed: - hard_reset: always restart opal_server + wait_healthy via a finally, so a failed redis FLUSHALL can't leave the server stopped (which would fail every later session-scoped test and, running in a test finally, mask the result). - delete_all_scopes drain: a transient /internal read error no longer counts as a successful drain (was `except: return`); keep polling to the deadline so a not-yet-drained cache can't leak into the next test once PR2 lands. - use stats(samples=1) for the zero-waiting drain/empty polls (the peak-merge only matters for load assertions; this also drops 3x HTTP per poll). - resilience: narrow broad `except Exception` to requests.RequestException (and RuntimeError for wait_healthy timeout) so harness bugs surface instead of masquerading as "never served"/"never recovered". - resilience: collapse `assert opal.stats()` + redundant re-read into one read. - debug_stats_test: assert rss_kb > 0 on Linux (was `>= 0`, which passed for the wrong reason where /proc is absent and RSS reads fall back to 0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tuck-on-an-offline-repo
…t-environment-one-big-pr
… repo handles Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Capture the deleted scope's source before scanning siblings (fixing the loop-variable shadowing bug where the clone path was computed from the last-iterated scope). Drop the cached pygit2.Repository handle via forget_repo and pop repos_last_fetched when no sibling shares the source. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…terating) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ce-never-stuck-on-an-offline-repo' into worktree-opal-improvements-pr4
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…per-source lock Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
✅ Deploy Preview for opal-docs canceled.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR4 — Parallel Scope Loading (fix the ~20-minute boot)
Turns the serial scope-loading loop into a bounded-concurrency loop so a fleet of repositories loads in parallel (≈10 at a time by default), cutting the ~20-minute boot to a couple of minutes — while keeping per-source serialization and adding a delete-vs-fetch lock to stay correct under concurrency.
Part of the opal-server-git-fixes project (PER-15158). Depends on PR2 (#923) and PR3 (#924).
PR4 commits
feat(opal-server): add SCOPES_SYNC_CONCURRENCY config (default 10)refactor(opal-server): expose stable per-source lock accessorfix(opal-server): serialize scope delete against in-flight fetch via per-source lockfeat(opal-server): bounded-concurrency scope sync (parallel boot)What changed
sync_scopesrewritten into a two-pass, bounded-concurrency loop via a new_sync_scope_batchhelper:asyncio.Semaphore(max(1, N))+asyncio.gather, with a per-itemtry/exceptso one failing scope never sinks the batch. The two passes and the per-source dedup (oneforce_fetchper distinctsource_id; shared sources only re-check locally) are preserved exactly.OPAL_SCOPES_SYNC_CONCURRENCY(default10) bounds in-flight fetches. Higher = faster boot but more memory / file descriptors / concurrent git connections. Operators with many pods × many sources can tune it down.GitPolicyFetcher.source_lock(source_id)— a stable, process-shared per-sourceasyncio.Lockaccessor;_get_repo_locknow delegates to it (single lock-creation path).delete_scopenow holdssource_lock(deleted_source_id)around thermtree+ cache-purge block, so a delete cannot race an in-flight fetch of the same source. Sibling-scan reads stay above the lock;_scopes.deletestays after it.Correctness properties (all covered by tests)
Semaphore+gather, Python-3.9-safe — noTaskGroup);max(1, …)guards against aSemaphore(0)deadlock.Tests
packages/opal-server/opal_server/tests/sync_concurrency_test.py:test_sync_concurrency_default— config default is 10test_source_lock_is_stable_per_source— samesource_id→ same lock objecttest_delete_waits_for_in_flight_fetch— delete blocks while the per-source lock is held, proceeds on releasetest_sync_scopes_runs_in_parallel_bounded— peak concurrency respects the cap and exceeds 1 (real parallelism)test_sync_scopes_isolates_one_failure— a mid-flight failure doesn't stop the batch or cancel siblings; assertspeak > 1so it exercises the concurrent path, not a serial fallbackFull opal-server suite: 77 passed.
black/isort/docformatterclean.Task 5 (the docker-based boot/resilience regression gate under
app-tests/git-leak) is not run here; the in-process concurrency + isolation unit tests cover the same logic.🤖 Generated with Claude Code