Skip to content

feat(opal-server): parallel scope loading — bounded-concurrency boot (PR4)#932

Open
Zivxx wants to merge 31 commits into
masterfrom
ziv/per-15158-pr4-parallel-scope-loading
Open

feat(opal-server): parallel scope loading — bounded-concurrency boot (PR4)#932
Zivxx wants to merge 31 commits into
masterfrom
ziv/per-15158-pr4-parallel-scope-loading

Conversation

@Zivxx

@Zivxx Zivxx commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

PR4 — Parallel Scope Loading (fix the ~20-minute boot)

Turns the serial scope-loading loop into a bounded-concurrency loop so a fleet of repositories loads in parallel (≈10 at a time by default), cutting the ~20-minute boot to a couple of minutes — while keeping per-source serialization and adding a delete-vs-fetch lock to stay correct under concurrency.

Part of the opal-server-git-fixes project (PER-15158). Depends on PR2 (#923) and PR3 (#924).

⚠️ Base = master for visibility. PR4 has a diamond dependency (it needs both PR2 and PR3, which are separate branches), so this PR can't cleanly stack on a single base. For now it bundles PR1 (#922) + PR2 (#923) + PR3 (#924) + PR4 so the team can see it's ready. Once PR2/PR3 land on master, this will be rebased down to a clean 4-commit diff (or reopened fresh). Review only the 4 PR4 commits below.

PR4 commits

  • feat(opal-server): add SCOPES_SYNC_CONCURRENCY config (default 10)
  • refactor(opal-server): expose stable per-source lock accessor
  • fix(opal-server): serialize scope delete against in-flight fetch via per-source lock
  • feat(opal-server): bounded-concurrency scope sync (parallel boot)

What changed

  • sync_scopes rewritten into a two-pass, bounded-concurrency loop via a new _sync_scope_batch helper: asyncio.Semaphore(max(1, N)) + asyncio.gather, with a per-item try/except so one failing scope never sinks the batch. The two passes and the per-source dedup (one force_fetch per distinct source_id; shared sources only re-check locally) are preserved exactly.
  • OPAL_SCOPES_SYNC_CONCURRENCY (default 10) bounds in-flight fetches. Higher = faster boot but more memory / file descriptors / concurrent git connections. Operators with many pods × many sources can tune it down.
  • GitPolicyFetcher.source_lock(source_id) — a stable, process-shared per-source asyncio.Lock accessor; _get_repo_lock now delegates to it (single lock-creation path).
  • delete_scope now holds source_lock(deleted_source_id) around the rmtree + cache-purge block, so a delete cannot race an in-flight fetch of the same source. Sibling-scan reads stay above the lock; _scopes.delete stays after it.

Correctness properties (all covered by tests)

  • Distinct sources run in parallel; same source still serializes via the shared per-source lock.
  • Per-source dedup preserved — no redundant re-clones.
  • Failure isolation — a mid-flight failure neither stops the batch nor cancels concurrently-running siblings.
  • Bounded concurrency (Semaphore + gather, Python-3.9-safe — no TaskGroup); max(1, …) guards against a Semaphore(0) deadlock.
  • delete/fetch race closed by the shared per-source lock.

Tests

packages/opal-server/opal_server/tests/sync_concurrency_test.py:

  • test_sync_concurrency_default — config default is 10
  • test_source_lock_is_stable_per_source — same source_id → same lock object
  • test_delete_waits_for_in_flight_fetch — delete blocks while the per-source lock is held, proceeds on release
  • test_sync_scopes_runs_in_parallel_bounded — peak concurrency respects the cap and exceeds 1 (real parallelism)
  • test_sync_scopes_isolates_one_failure — a mid-flight failure doesn't stop the batch or cancel siblings; asserts peak > 1 so it exercises the concurrent path, not a serial fallback

Full opal-server suite: 77 passed. black / isort / docformatter clean.

Task 5 (the docker-based boot/resilience regression gate under app-tests/git-leak) is not run here; the in-process concurrency + isolation unit tests cover the same logic.

🤖 Generated with Claude Code

dshoen619 and others added 30 commits June 23, 2026 12:24
Add an off-by-default diagnostics endpoint so tests can observe the
in-memory GitPolicyFetcher cache sizes (repo_locks/repos/repos_last_fetched)
and process RSS that the upcoming memory-leak fix eliminates.

- debug_stats.py: read-only git_fetcher_cache_stats() helper + a
  register_internal_stats_route() registrar that mounts
  GET /internal/git-fetcher-cache-stats only when enabled.
- config.py: new OPAL_DEBUG_INTERNAL_STATS flag, default False.
- server.py: register the route, gated by the flag, beside /healthcheck.

No production behavior change when the flag is off (the default). Also
ignore .claude/ so private planning artifacts are never committed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A self-contained docker-compose stack (opal-server x2 workers + Redis +
Postgres broadcaster + Gitea) plus a pytest harness that reproduces, as
tests that fail on master, the git-fetcher memory leak, the offline-repo
hang, the slow serial boot, and the broadcaster-disconnect gap. These
become the regression gates for the follow-up fixes.

- seed/: idempotent Gitea seeding sidecar (N policy repos) + Dockerfile.
- docker-compose.yml: 4-service stack, opal-server built from the repo's
  own docker/Dockerfile (server target), scopes on, Postgres broadcaster.
- helpers.py / conftest.py: HTTP + infra helpers and stack fixtures.
- test_leak.py / test_resilience.py / test_boot.py: the flagship tests.
- README.md: how to run and expected fail-on-master behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Complete the helpers.py surface promised in the plan's file-structure
table. Both are now functional and used, not dead code:

- make_repo_unreachable(name): returns a git URL on a routable-but-dead
  TEST-NET-1 host (RFC 5737). test_offline_repo now uses it instead of an
  inlined literal.
- GiteaAdmin: host-side Gitea admin client (list_repos / repo_exists /
  create_repo / delete_repo), exposed as the `gitea_admin` pytest fixture
  for tests that need to inspect or stage repos beyond the seed sidecar.

Gitea is published on host port 13000 (uncommon, to avoid the usual :3000
clash) so GiteaAdmin can reach it; opal_server and the seed sidecar still
use the internal http://gitea:3000. README updated with the helper and
port notes.

Verified live: GiteaAdmin lists the seeded repos and round-trips
create/exists/delete against Gitea over the published port.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verified test_server_recovers_after_postgres_bounce against the stack: it
PASSES on master (~14-19s). On a broadcaster drop the affected worker
triggers a graceful shutdown, gunicorn respawns it, and the sibling worker
keeps serving HTTP, so the surface recovers within the window — recovery
happens via gunicorn's in-container worker supervision, not an external
supervisor and not an in-process reconnect.

Reframe #5 as a recovery guard (not a known-broken case) in the docstring
and README; the prior "FAILS on master / needs external supervisor" wording
was wrong. PER-15065's in-process reconnect would avoid the worker churn but
recovery already holds.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Run the repo's pinned pre-commit formatters (black 23.1.0, isort 5.12.0,
docformatter 1.7.5) over the PR1 files to satisfy the pre-commit CI check.
Formatting only — no behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The CI `build` job runs `pytest` from the repo root with no path, which
recursed into app-tests/git-leak/ and ran the flagship tests — these are
designed to FAIL on master (they are the regression gates for PR2-PR5), so
they broke the build job.

Set `testpaths = packages` so the rootdir run collects only the unit tests
under packages/ (matching master's effective behavior, since app-tests/ had
no pytest files before). testpaths only applies when pytest is invoked from
the rootdir with no args, so `cd app-tests/git-leak && pytest` still collects
and runs the test bed. Verified both: root run -> packages only; subdir run
-> all 5 flagship tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- stats(): sample the /internal endpoint several times and merge per key
  with max(). The caches are per-process on a multi-worker server, so a
  single read can hit an empty (non-leader) worker — max-merge avoids both
  false negatives (missing a populated leader) and false positives (an
  `== 0` drain assertion passing only because it hit an empty worker).
- test_leak: assert the initial-load `_wait_until` succeeded before deleting
  / before taking baseline, so the tests can't pass vacuously when load
  never completed.
- refresh_all(): correct the misleading comment — a 404 is a no-op, there
  is no client-side fallback.
- conftest: skip the suite cleanly if docker is unavailable (defense in
  depth; it's already excluded from the default pytest run via testpaths).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two review findings on the regression gates:

- Cross-test contamination: clone paths are keyed by repo URL (source_id =
  sha256(url)+branch-shard), not scope_id, so scopes from different tests that
  point at the same seeded repo share one GitPolicyFetcher cache entry. With a
  session-scoped stack and no teardown, a leftover boot-*/stable-* scope kept
  those entries alive and would make test_churn's `repos == 0` drain assertion
  fail on fixed code. OpalServerClient now tracks created scopes and the opal
  fixture deletes them on teardown (best-effort drain wait, swallows errors so
  master — where delete never purges — doesn't fail the passing test).

- False gate: test_repeat_sync_does_not_grow re-syncs identical scopes, which a
  path-keyed cache can't grow even on master, so it could never be the leak gate
  it claimed. Reframed as an honest idempotency guard (passes on master) that
  points at test_churn_releases_caches as the real leak gate; README's
  "Expected on master" reclassifies it alongside the postgres-bounce guard.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire scope clone/fetch through run_in_git_executor with
SCOPES_GIT_FETCH_TIMEOUT, and broaden the _clone except to catch
asyncio.TimeoutError so a hung clone is logged and the scope skipped
instead of crashing the caller. Drop the now-unused run_sync import.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…iew)

Addresses the CHANGES_REQUESTED review on PR #922. Root cause behind most
findings: a fresh scope's first sync takes the _clone() branch, which only
fills GitPolicyFetcher.repo_locks; repos/repos_last_fetched are filled on a
*second* sync. So the load gates on `repos` hung, and the 2-worker per-process
caches made `== 0` drain assertions unsafe.

Blocking:
- Single worker (UVICORN_NUM_WORKERS=1): deterministic per-process cache reads;
  removes the false `== 0` drain class.
- Load gate (CRITICAL + Zivxx HIGH): _load_scopes gates on repo_locks then
  refresh_all() to force the second sync, so repos/repos_last_fetched are
  actually populated before any drain/purge assertion.
- compose() surfaces captured stdout/stderr on failure.
- Seed completeness asserted in conftest; seed script isolates per-repo
  failures and exits non-zero with a count.

Secondary:
- test_repeat_sync asserts an RSS bound (count can't grow for any impl);
  churn asserts all three caches drain + a loose RSS backstop.
- blackhole socat sidecar replaces TEST-NET-1 (deterministic hang); offline
  test saturates the fetch executor with 40 hung clones and recovers via
  OpalServerClient.hard_reset() (stop -> redis FLUSHALL -> start).
- Per-test clean slate deletes all server scopes (fixes orphan-scope leak).
- Postgres-bounce proves broadcast recovery (PUT post-bounce, assert sync)
  and uses `up -d --wait`.
- Remove dead 404 branch in refresh_all; boot clock starts at restart;
  gitea-admin `|| true` -> "already exists"-only guard; README reworded.

opal-server:
- /internal stats route now takes the JWTAuthenticator dependency (protected
  when JWT on, no-op in the test bed); unit test asserts enforcement.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- snapshot a single /internal stats read per poll in the churn-drain and
  boot gates (consistent multi-key observation; fewer HTTP round-trips)
- document why a 200 from the healthy scope can't be a masked default
  bundle, and why the stats route is intentionally a sync def
- pin alpine/socat and the seed image's pip deps for reproducibility

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On Python < 3.11 asyncio.TimeoutError is a distinct class from the
builtin TimeoutError, so run_in_git_executor's wait_for timeout was not
caught by `pytest.raises(TimeoutError)` — failing build (3.9)/(3.10).
Normalize to the builtin TimeoutError so the documented contract holds
on every supported Python, and update the _clone catch site to match.

Also apply black/isort/docformatter formatting to satisfy pre-commit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- run_in_git_executor: use asyncio.get_running_loop() instead of the
  deprecated get_event_loop() inside an async function
- fetch_and_notify_on_changes: set repos_last_fetched only after a
  successful fetch so a timeout/error does not wrongly suppress a later
  force_fetch via _was_fetched_after
- fetch_timeout_test: measure elapsed time with time.monotonic()

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The fetch path let TimeoutError propagate to sync_scope's catch-all,
which logged a full traceback at ERROR level for the expected
unreachable-repo case — inconsistent with the clone path's quiet
logger.error. Catch TimeoutError at the fetch site and log without a
traceback, then skip (repos_last_fetched stays stale so the next cycle
retries). Also shorten the hanging-thread sleeps in the timeout tests so
the lingering pool thread doesn't delay process teardown.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…repo

Ziv (PR review, round 2) caught that the offline-resilience gate can
false-PASS in the full-suite run. The "healthy" scope pointed at
policy-repo-0000 (list_seeded_repos(1)[0]), but on-disk clones are keyed
by URL-hash and survive compose restart/stop/start (opal_server mounts no
volume at /opal; only `down -v` wipes them). test_boot/test_leak run first
(alphabetical) and already clone every seeded repo, so the healthy scope
hit the existing clone via _discover_repository, skipped _clone(), and
served 200 without ever touching the saturated fetch executor — the gate
that must FAIL on this branch (no PR3 timeout) passed.

Fix: seed a reserved repo (policy-repo-healthy-probe) outside the numeric
policy-repo-NNNN range that no boot/leak test enumerates, and point the
healthy probe at it. A never-cloned repo forces a genuine fresh clone
through the starved pool, so the gate fails correctly. The seed-
completeness check in conftest now covers the reserved repo too.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…view)

Follow-up to PR review of the git-leak/resilience test bed:

- hard_reset: always restart opal_server + wait_healthy via a finally, so a
  failed redis FLUSHALL can't leave the server stopped (which would fail every
  later session-scoped test and, running in a test finally, mask the result).
- delete_all_scopes drain: a transient /internal read error no longer counts as
  a successful drain (was `except: return`); keep polling to the deadline so a
  not-yet-drained cache can't leak into the next test once PR2 lands.
- use stats(samples=1) for the zero-waiting drain/empty polls (the peak-merge
  only matters for load assertions; this also drops 3x HTTP per poll).
- resilience: narrow broad `except Exception` to requests.RequestException
  (and RuntimeError for wait_healthy timeout) so harness bugs surface instead of
  masquerading as "never served"/"never recovered".
- resilience: collapse `assert opal.stats()` + redundant re-read into one read.
- debug_stats_test: assert rss_kb > 0 on Linux (was `>= 0`, which passed for the
  wrong reason where /proc is absent and RSS reads fall back to 0).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… repo handles

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Capture the deleted scope's source before scanning siblings (fixing the
loop-variable shadowing bug where the clone path was computed from the
last-iterated scope). Drop the cached pygit2.Repository handle via
forget_repo and pop repos_last_fetched when no sibling shares the source.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…terating)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ce-never-stuck-on-an-offline-repo' into worktree-opal-improvements-pr4
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…per-source lock

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@linear-code

linear-code Bot commented Jul 1, 2026

Copy link
Copy Markdown

PER-15158

@netlify

netlify Bot commented Jul 1, 2026

Copy link
Copy Markdown

Deploy Preview for opal-docs canceled.

Name Link
🔨 Latest commit a20489a
🔍 Latest deploy log https://app.netlify.com/projects/opal-docs/deploys/6a44638c113df300085fc2ea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants