feat(opal-server): parallel scope loading — bounded-concurrency boot (PR4) by Zivxx · Pull Request #932 · permitio/opal

Zivxx · 2026-07-01T00:47:06Z

PR4 — Parallel Scope Loading (fix the ~20-minute boot)

Turns the serial scope-loading loop into a bounded-concurrency loop so a fleet of repositories loads in parallel (≈10 at a time by default), cutting the ~20-minute boot to a couple of minutes — while keeping per-source serialization and adding a delete-vs-fetch lock to stay correct under concurrency.

Part of the opal-server-git-fixes project (PER-15158). Depends on PR2 (#923) and PR3 (#924).

⚠️ Base = master for visibility. PR4 has a diamond dependency (it needs both PR2 and PR3, which are separate branches), so this PR can't cleanly stack on a single base. For now it bundles PR1 (#922) + PR2 (#923) + PR3 (#924) + PR4 so the team can see it's ready. Once PR2/PR3 land on master, this will be rebased down to a clean 4-commit diff (or reopened fresh). Review only the 4 PR4 commits below.

PR4 commits

feat(opal-server): add SCOPES_SYNC_CONCURRENCY config (default 10)
refactor(opal-server): expose stable per-source lock accessor
fix(opal-server): serialize scope delete against in-flight fetch via per-source lock
feat(opal-server): bounded-concurrency scope sync (parallel boot)

What changed

sync_scopes rewritten into a two-pass, bounded-concurrency loop via a new _sync_scope_batch helper: asyncio.Semaphore(max(1, N)) + asyncio.gather, with a per-item try/except so one failing scope never sinks the batch. The two passes and the per-source dedup (one force_fetch per distinct source_id; shared sources only re-check locally) are preserved exactly.
OPAL_SCOPES_SYNC_CONCURRENCY (default 10) bounds in-flight fetches. Higher = faster boot but more memory / file descriptors / concurrent git connections. Operators with many pods × many sources can tune it down.
GitPolicyFetcher.source_lock(source_id) — a stable, process-shared per-source asyncio.Lock accessor; _get_repo_lock now delegates to it (single lock-creation path).
delete_scope now holds source_lock(deleted_source_id) around the rmtree + cache-purge block, so a delete cannot race an in-flight fetch of the same source. Sibling-scan reads stay above the lock; _scopes.delete stays after it.

Correctness properties (all covered by tests)

Distinct sources run in parallel; same source still serializes via the shared per-source lock.
Per-source dedup preserved — no redundant re-clones.
Failure isolation — a mid-flight failure neither stops the batch nor cancels concurrently-running siblings.
Bounded concurrency (Semaphore + gather, Python-3.9-safe — no TaskGroup); max(1, …) guards against a Semaphore(0) deadlock.
delete/fetch race closed by the shared per-source lock.

Tests

packages/opal-server/opal_server/tests/sync_concurrency_test.py:

test_sync_concurrency_default — config default is 10
test_source_lock_is_stable_per_source — same source_id → same lock object
test_delete_waits_for_in_flight_fetch — delete blocks while the per-source lock is held, proceeds on release
test_sync_scopes_runs_in_parallel_bounded — peak concurrency respects the cap and exceeds 1 (real parallelism)
test_sync_scopes_isolates_one_failure — a mid-flight failure doesn't stop the batch or cancel siblings; asserts peak > 1 so it exercises the concurrent path, not a serial fallback

Full opal-server suite: 77 passed. black / isort / docformatter clean.

Task 5 (the docker-based boot/resilience regression gate under app-tests/git-leak) is not run here; the in-process concurrency + isolation unit tests cover the same logic.

🤖 Generated with Claude Code

Add an off-by-default diagnostics endpoint so tests can observe the in-memory GitPolicyFetcher cache sizes (repo_locks/repos/repos_last_fetched) and process RSS that the upcoming memory-leak fix eliminates. - debug_stats.py: read-only git_fetcher_cache_stats() helper + a register_internal_stats_route() registrar that mounts GET /internal/git-fetcher-cache-stats only when enabled. - config.py: new OPAL_DEBUG_INTERNAL_STATS flag, default False. - server.py: register the route, gated by the flag, beside /healthcheck. No production behavior change when the flag is off (the default). Also ignore .claude/ so private planning artifacts are never committed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A self-contained docker-compose stack (opal-server x2 workers + Redis + Postgres broadcaster + Gitea) plus a pytest harness that reproduces, as tests that fail on master, the git-fetcher memory leak, the offline-repo hang, the slow serial boot, and the broadcaster-disconnect gap. These become the regression gates for the follow-up fixes. - seed/: idempotent Gitea seeding sidecar (N policy repos) + Dockerfile. - docker-compose.yml: 4-service stack, opal-server built from the repo's own docker/Dockerfile (server target), scopes on, Postgres broadcaster. - helpers.py / conftest.py: HTTP + infra helpers and stack fixtures. - test_leak.py / test_resilience.py / test_boot.py: the flagship tests. - README.md: how to run and expected fail-on-master behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Complete the helpers.py surface promised in the plan's file-structure table. Both are now functional and used, not dead code: - make_repo_unreachable(name): returns a git URL on a routable-but-dead TEST-NET-1 host (RFC 5737). test_offline_repo now uses it instead of an inlined literal. - GiteaAdmin: host-side Gitea admin client (list_repos / repo_exists / create_repo / delete_repo), exposed as the `gitea_admin` pytest fixture for tests that need to inspect or stage repos beyond the seed sidecar. Gitea is published on host port 13000 (uncommon, to avoid the usual :3000 clash) so GiteaAdmin can reach it; opal_server and the seed sidecar still use the internal http://gitea:3000. README updated with the helper and port notes. Verified live: GiteaAdmin lists the seeded repos and round-trips create/exists/delete against Gitea over the published port. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Verified test_server_recovers_after_postgres_bounce against the stack: it PASSES on master (~14-19s). On a broadcaster drop the affected worker triggers a graceful shutdown, gunicorn respawns it, and the sibling worker keeps serving HTTP, so the surface recovers within the window — recovery happens via gunicorn's in-container worker supervision, not an external supervisor and not an in-process reconnect. Reframe #5 as a recovery guard (not a known-broken case) in the docstring and README; the prior "FAILS on master / needs external supervisor" wording was wrong. PER-15065's in-process reconnect would avoid the worker churn but recovery already holds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Run the repo's pinned pre-commit formatters (black 23.1.0, isort 5.12.0, docformatter 1.7.5) over the PR1 files to satisfy the pre-commit CI check. Formatting only — no behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The CI `build` job runs `pytest` from the repo root with no path, which recursed into app-tests/git-leak/ and ran the flagship tests — these are designed to FAIL on master (they are the regression gates for PR2-PR5), so they broke the build job. Set `testpaths = packages` so the rootdir run collects only the unit tests under packages/ (matching master's effective behavior, since app-tests/ had no pytest files before). testpaths only applies when pytest is invoked from the rootdir with no args, so `cd app-tests/git-leak && pytest` still collects and runs the test bed. Verified both: root run -> packages only; subdir run -> all 5 flagship tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- stats(): sample the /internal endpoint several times and merge per key with max(). The caches are per-process on a multi-worker server, so a single read can hit an empty (non-leader) worker — max-merge avoids both false negatives (missing a populated leader) and false positives (an `== 0` drain assertion passing only because it hit an empty worker). - test_leak: assert the initial-load `_wait_until` succeeded before deleting / before taking baseline, so the tests can't pass vacuously when load never completed. - refresh_all(): correct the misleading comment — a 404 is a no-op, there is no client-side fallback. - conftest: skip the suite cleanly if docker is unavailable (defense in depth; it's already excluded from the default pytest run via testpaths). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two review findings on the regression gates: - Cross-test contamination: clone paths are keyed by repo URL (source_id = sha256(url)+branch-shard), not scope_id, so scopes from different tests that point at the same seeded repo share one GitPolicyFetcher cache entry. With a session-scoped stack and no teardown, a leftover boot-*/stable-* scope kept those entries alive and would make test_churn's `repos == 0` drain assertion fail on fixed code. OpalServerClient now tracks created scopes and the opal fixture deletes them on teardown (best-effort drain wait, swallows errors so master — where delete never purges — doesn't fail the passing test). - False gate: test_repeat_sync_does_not_grow re-syncs identical scopes, which a path-keyed cache can't grow even on master, so it could never be the leak gate it claimed. Reframed as an honest idempotency guard (passes on master) that points at test_churn_releases_caches as the real leak gate; README's "Expected on master" reclassifies it alongside the postgres-bounce guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Wire scope clone/fetch through run_in_git_executor with SCOPES_GIT_FETCH_TIMEOUT, and broaden the _clone except to catch asyncio.TimeoutError so a hung clone is logged and the scope skipped instead of crashing the caller. Drop the now-unused run_sync import. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…iew) Addresses the CHANGES_REQUESTED review on PR #922. Root cause behind most findings: a fresh scope's first sync takes the _clone() branch, which only fills GitPolicyFetcher.repo_locks; repos/repos_last_fetched are filled on a *second* sync. So the load gates on `repos` hung, and the 2-worker per-process caches made `== 0` drain assertions unsafe. Blocking: - Single worker (UVICORN_NUM_WORKERS=1): deterministic per-process cache reads; removes the false `== 0` drain class. - Load gate (CRITICAL + Zivxx HIGH): _load_scopes gates on repo_locks then refresh_all() to force the second sync, so repos/repos_last_fetched are actually populated before any drain/purge assertion. - compose() surfaces captured stdout/stderr on failure. - Seed completeness asserted in conftest; seed script isolates per-repo failures and exits non-zero with a count. Secondary: - test_repeat_sync asserts an RSS bound (count can't grow for any impl); churn asserts all three caches drain + a loose RSS backstop. - blackhole socat sidecar replaces TEST-NET-1 (deterministic hang); offline test saturates the fetch executor with 40 hung clones and recovers via OpalServerClient.hard_reset() (stop -> redis FLUSHALL -> start). - Per-test clean slate deletes all server scopes (fixes orphan-scope leak). - Postgres-bounce proves broadcast recovery (PUT post-bounce, assert sync) and uses `up -d --wait`. - Remove dead 404 branch in refresh_all; boot clock starts at restart; gitea-admin `|| true` -> "already exists"-only guard; README reworded. opal-server: - /internal stats route now takes the JWTAuthenticator dependency (protected when JWT on, no-op in the test bed); unit test asserts enforcement. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- snapshot a single /internal stats read per poll in the churn-drain and boot gates (consistent multi-key observation; fewer HTTP round-trips) - document why a 200 from the healthy scope can't be a masked default bundle, and why the stats route is intentionally a sync def - pin alpine/socat and the seed image's pip deps for reproducibility Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

On Python < 3.11 asyncio.TimeoutError is a distinct class from the builtin TimeoutError, so run_in_git_executor's wait_for timeout was not caught by `pytest.raises(TimeoutError)` — failing build (3.9)/(3.10). Normalize to the builtin TimeoutError so the documented contract holds on every supported Python, and update the _clone catch site to match. Also apply black/isort/docformatter formatting to satisfy pre-commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- run_in_git_executor: use asyncio.get_running_loop() instead of the deprecated get_event_loop() inside an async function - fetch_and_notify_on_changes: set repos_last_fetched only after a successful fetch so a timeout/error does not wrongly suppress a later force_fetch via _was_fetched_after - fetch_timeout_test: measure elapsed time with time.monotonic() Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The fetch path let TimeoutError propagate to sync_scope's catch-all, which logged a full traceback at ERROR level for the expected unreachable-repo case — inconsistent with the clone path's quiet logger.error. Catch TimeoutError at the fetch site and log without a traceback, then skip (repos_last_fetched stays stale so the next cycle retries). Also shorten the hanging-thread sleeps in the timeout tests so the lingering pool thread doesn't delay process teardown. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tuck-on-an-offline-repo

…t-environment-one-big-pr

…repo Ziv (PR review, round 2) caught that the offline-resilience gate can false-PASS in the full-suite run. The "healthy" scope pointed at policy-repo-0000 (list_seeded_repos(1)[0]), but on-disk clones are keyed by URL-hash and survive compose restart/stop/start (opal_server mounts no volume at /opal; only `down -v` wipes them). test_boot/test_leak run first (alphabetical) and already clone every seeded repo, so the healthy scope hit the existing clone via _discover_repository, skipped _clone(), and served 200 without ever touching the saturated fetch executor — the gate that must FAIL on this branch (no PR3 timeout) passed. Fix: seed a reserved repo (policy-repo-healthy-probe) outside the numeric policy-repo-NNNN range that no boot/leak test enumerates, and point the healthy probe at it. A never-cloned repo forces a genuine fresh clone through the starved pool, so the gate fails correctly. The seed- completeness check in conftest now covers the reserved repo too. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…view) Follow-up to PR review of the git-leak/resilience test bed: - hard_reset: always restart opal_server + wait_healthy via a finally, so a failed redis FLUSHALL can't leave the server stopped (which would fail every later session-scoped test and, running in a test finally, mask the result). - delete_all_scopes drain: a transient /internal read error no longer counts as a successful drain (was `except: return`); keep polling to the deadline so a not-yet-drained cache can't leak into the next test once PR2 lands. - use stats(samples=1) for the zero-waiting drain/empty polls (the peak-merge only matters for load assertions; this also drops 3x HTTP per poll). - resilience: narrow broad `except Exception` to requests.RequestException (and RuntimeError for wait_healthy timeout) so harness bugs surface instead of masquerading as "never served"/"never recovered". - resilience: collapse `assert opal.stats()` + redundant re-read into one read. - debug_stats_test: assert rss_kb > 0 on Linux (was `>= 0`, which passed for the wrong reason where /proc is absent and RSS reads fall back to 0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tuck-on-an-offline-repo

…t-environment-one-big-pr

… repo handles Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Capture the deleted scope's source before scanning siblings (fixing the loop-variable shadowing bug where the clone path was computed from the last-iterated scope). Drop the cached pygit2.Repository handle via forget_repo and pop repos_last_fetched when no sibling shares the source. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…terating) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ce-never-stuck-on-an-offline-repo' into worktree-opal-improvements-pr4

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…per-source lock Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

linear-code · 2026-07-01T00:47:09Z

PER-15158

netlify · 2026-07-01T00:47:10Z

✅ Deploy Preview for opal-docs canceled.

Name	Link
🔨 Latest commit	`a20489a`
🔍 Latest deploy log	https://app.netlify.com/projects/opal-docs/deploys/6a44638c113df300085fc2ea

dshoen619 and others added 30 commits June 23, 2026 12:24

feat(opal-server): add scope git fetch timeout + dedicated pool config

245f80b

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(opal-server): dedicated git thread pool with hard per-op timeout

6abbc2a

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

style(git-leak): apply black/isort/docformatter (pre-commit)

f810db8

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge branch 'master' into david/per-15157-pr3-git-resilience-never-s…

f778edc

…tuck-on-an-offline-repo

Merge branch 'master' into david/per-15155-pr1-git-leakresilience-tes…

d719c34

…t-environment-one-big-pr

Merge branch 'master' into david/per-15157-pr3-git-resilience-never-s…

c353ceb

…tuck-on-an-offline-repo

Merge branch 'master' into david/per-15155-pr1-git-leakresilience-tes…

a502f2e

…t-environment-one-big-pr

feat(opal-server): add GitPolicyFetcher.forget_repo to release cached…

ac29c2e

… repo handles Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix(opal-server): clean all finished webhook tasks (no remove-while-i…

8db1b6b

…terating) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/david/per-15157-pr3-git-resilien…

0051383

…ce-never-stuck-on-an-offline-repo' into worktree-opal-improvements-pr4

feat(opal-server): add SCOPES_SYNC_CONCURRENCY config (default 10)

7a18ab4

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

refactor(opal-server): expose stable per-source lock accessor

7ce2a83

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix(opal-server): serialize scope delete against in-flight fetch via …

170a221

…per-source lock Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(opal-server): bounded-concurrency scope sync (parallel boot)

a20489a

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(opal-server): parallel scope loading — bounded-concurrency boot (PR4)#932

feat(opal-server): parallel scope loading — bounded-concurrency boot (PR4)#932
Zivxx wants to merge 31 commits into
masterfrom
ziv/per-15158-pr4-parallel-scope-loading

Zivxx commented Jul 1, 2026

Uh oh!

linear-code Bot commented Jul 1, 2026

Uh oh!

netlify Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Zivxx commented Jul 1, 2026

PR4 — Parallel Scope Loading (fix the ~20-minute boot)

PR4 commits

What changed

Correctness properties (all covered by tests)

Tests

Uh oh!

linear-code Bot commented Jul 1, 2026

Uh oh!

netlify Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for opal-docs canceled.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

netlify Bot commented Jul 1, 2026 •

edited

Loading