fix(opal-server): git resilience — never stuck on an offline repo (PR3)#924
fix(opal-server): git resilience — never stuck on an offline repo (PR3)#924dshoen619 wants to merge 8 commits into
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire scope clone/fetch through run_in_git_executor with SCOPES_GIT_FETCH_TIMEOUT, and broaden the _clone except to catch asyncio.TimeoutError so a hung clone is logged and the scope skipped instead of crashing the caller. Drop the now-unused run_sync import. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
✅ Deploy Preview for opal-docs canceled.
|
There was a problem hiding this comment.
Pull request overview
This PR improves opal-server resilience when syncing scope policy repos by ensuring git clone/fetch operations can’t block indefinitely or starve the server’s shared executor.
Changes:
- Add a dedicated, bounded
ThreadPoolExecutorandrun_in_git_executor(...)helper to run blocking pygit2 operations with anasyncio.wait_fortimeout. - Apply the helper + new timeout config to scope repo clone and fetch paths.
- Add server config keys for timeout and executor sizing, plus focused unit tests for timeout behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/opal-server/opal_server/git_fetcher.py | Introduces dedicated git executor + timeout helper; routes scope clone/fetch through it. |
| packages/opal-server/opal_server/config.py | Adds SCOPES_GIT_FETCH_TIMEOUT and SCOPES_GIT_MAX_WORKERS configuration. |
| packages/opal-server/opal_server/tests/git_executor_test.py | Tests config defaults and run_in_git_executor basic behavior. |
| packages/opal-server/opal_server/tests/fetch_timeout_test.py | Tests that a hanging git op times out quickly (doesn’t block). |
| .claude/plans/docs/05-config-reference.md | Internal config reference entry for the new env vars and caveat. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| loop = asyncio.get_event_loop() | ||
| fut = loop.run_in_executor(_get_git_executor(), partial(func, *args, **kwargs)) |
There was a problem hiding this comment.
Good catch — switched to asyncio.get_running_loop() in 5dd53a2.
| GitPolicyFetcher.repos_last_fetched[ | ||
| self._source_id | ||
| ] = datetime.datetime.now() | ||
| await run_sync( | ||
| await run_in_git_executor( | ||
| repo.remotes[self._remote].fetch, |
There was a problem hiding this comment.
Fixed in 5dd53a2 — repos_last_fetched is now updated only after the fetch completes successfully, so a timeout/error leaves it stale and won't suppress a later force_fetch via _was_fetched_after.
On Python < 3.11 asyncio.TimeoutError is a distinct class from the builtin TimeoutError, so run_in_git_executor's wait_for timeout was not caught by `pytest.raises(TimeoutError)` — failing build (3.9)/(3.10). Normalize to the builtin TimeoutError so the documented contract holds on every supported Python, and update the _clone catch site to match. Also apply black/isort/docformatter formatting to satisfy pre-commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- run_in_git_executor: use asyncio.get_running_loop() instead of the deprecated get_event_loop() inside an async function - fetch_and_notify_on_changes: set repos_last_fetched only after a successful fetch so a timeout/error does not wrongly suppress a later force_fetch via _was_fetched_after - fetch_timeout_test: measure elapsed time with time.monotonic() Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The fetch path let TimeoutError propagate to sync_scope's catch-all, which logged a full traceback at ERROR level for the expected unreachable-repo case — inconsistent with the clone path's quiet logger.error. Catch TimeoutError at the fetch site and log without a traceback, then skip (repos_last_fetched stays stale so the next cycle retries). Also shorten the hanging-thread sleeps in the timeout tests so the lingering pool thread doesn't delay process teardown. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tuck-on-an-offline-repo
…tuck-on-an-offline-repo
Review notes — overlap + gate location (planned with the
|
PR3 — Git Resilience: never stuck on an offline repo
Closes PER-15157.
Problem
Scope git clone/fetch use pygit2 with no timeout, dispatched through
run_sync(...)onto the shared default executor with no time limit on the await:clone_repository(...)inGitPolicyFetcher._clonerepo.remotes[self._remote].fetch(...)infetch_and_notify_on_changesPOLICY_REPO_CLONE_TIMEOUTis wired only to the legacy non-scopes path, so it does not apply in scopes mode. The consequences:gunicorn_conf→preload_scopes()→asyncio.run(sync_scopes(...))) blocks indefinitely on a single unreachable repo.Fix
A hard per-operation timeout plus an isolated, bounded thread pool so a hung repo fails fast, is skipped, and never starves anything else.
ThreadPoolExecutor(opal-gitthreads, lazily built) for scope git work, isolated from the default executor.run_in_git_executor(func, *args, timeout=..., **kwargs)that runs the blocking call on that pool and wraps the await inasyncio.wait_for.timeout <= 0means no limit.SCOPES_GIT_FETCH_TIMEOUT._cloneerror handling broadened to(pygit2.GitError, asyncio.TimeoutError)— a hung clone is logged and the scope skipped instead of crashing the caller. Per-scope isolation inScopesService.sync_scope(try/except Exception) already handles a timed-out fetch, making boot best-effort once operations can no longer hang forever.run_syncimport fromgit_fetcher.py.New config keys (opal-server, server-only)
OPAL_SCOPES_GIT_FETCH_TIMEOUT120.00= no timeout.OPAL_SCOPES_GIT_MAX_WORKERS10asyncio.wait_forcancels the await — unblocking the event loop and the awaiting coroutine — but the underlying pygit2 call keeps running on its pool thread until the OS network timeout. The dedicated bounded pool (OPAL_SCOPES_GIT_MAX_WORKERS) isolates those lingering threads so they cannot affect bundle serving or other scopes. Hard-kill via subprocess is explicitly out of scope (spec §6).Tests
tests/git_executor_test.py— config defaults; helper returns value; helper times out;timeout=0means no limit.tests/fetch_timeout_test.py— a hanging op surfacesTimeoutErrorandwait_forunblocks promptly (< 2s with a 0.2s timeout).Local run (Python 3.12 venv, editable installs of all three packages):
Not runnable in this PR's CI yet
Task 4 of the plan includes the PR1 regression gate
app-tests/git-leak/test_resilience.py::test_offline_repo_does_not_block_healthy_scopesand a live/healthchecksmoke test. Both depend on PR1 (test environment) being merged — the stated prerequisite — and a running stack, so they should be exercised once PR1 lands. This PR ships after PR2 and is independent of it.Review checklist (opal-development)
OPAL_double-prefix); mandatorydescription=present (covered bytest_config.py); exactly one declaration per env name (no collision).opal_client/opal_commonsymbol touched; no PDP override of these names.concurrent.futures+asyncioonly.sync_scopesrelies on this PR's guarantee that each fetch terminates, so its boundedgathercan never wait on an infinite member.🤖 Generated with Claude Code