Skip to content

fix(opal-server): git resilience — never stuck on an offline repo (PR3)#924

Open
dshoen619 wants to merge 8 commits into
masterfrom
david/per-15157-pr3-git-resilience-never-stuck-on-an-offline-repo
Open

fix(opal-server): git resilience — never stuck on an offline repo (PR3)#924
dshoen619 wants to merge 8 commits into
masterfrom
david/per-15157-pr3-git-resilience-never-stuck-on-an-offline-repo

Conversation

@dshoen619

Copy link
Copy Markdown

PR3 — Git Resilience: never stuck on an offline repo

Closes PER-15157.

Problem

Scope git clone/fetch use pygit2 with no timeout, dispatched through run_sync(...) onto the shared default executor with no time limit on the await:

  • clone_repository(...) in GitPolicyFetcher._clone
  • repo.remotes[self._remote].fetch(...) in fetch_and_notify_on_changes

POLICY_REPO_CLONE_TIMEOUT is wired only to the legacy non-scopes path, so it does not apply in scopes mode. The consequences:

  • Boot (gunicorn_confpreload_scopes()asyncio.run(sync_scopes(...))) blocks indefinitely on a single unreachable repo.
  • Normal operation: a hung op holds the per-source lock and a shared-pool thread. Enough hung ops starve bundle serving and other server work that shares the default executor.

Fix

A hard per-operation timeout plus an isolated, bounded thread pool so a hung repo fails fast, is skipped, and never starves anything else.

  • New module-level dedicated ThreadPoolExecutor (opal-git threads, lazily built) for scope git work, isolated from the default executor.
  • New helper run_in_git_executor(func, *args, timeout=..., **kwargs) that runs the blocking call on that pool and wraps the await in asyncio.wait_for. timeout <= 0 means no limit.
  • Clone + fetch now go through the helper with SCOPES_GIT_FETCH_TIMEOUT.
  • _clone error handling broadened to (pygit2.GitError, asyncio.TimeoutError) — a hung clone is logged and the scope skipped instead of crashing the caller. Per-scope isolation in ScopesService.sync_scope (try/except Exception) already handles a timed-out fetch, making boot best-effort once operations can no longer hang forever.
  • Dropped the now-unused run_sync import from git_fetcher.py.

New config keys (opal-server, server-only)

Env var Type Default Purpose
OPAL_SCOPES_GIT_FETCH_TIMEOUT float (s) 120.0 Hard timeout for a single scope git clone/fetch. On timeout the op is abandoned and the scope marked failed, so one unreachable repo can never block boot or other scopes. 0 = no timeout.
OPAL_SCOPES_GIT_MAX_WORKERS int 10 Size of the dedicated thread pool for scope git operations; isolates git work from the default executor.

⚠️ Caveat — the timeout is soft, not a hard kill

asyncio.wait_for cancels the await — unblocking the event loop and the awaiting coroutine — but the underlying pygit2 call keeps running on its pool thread until the OS network timeout. The dedicated bounded pool (OPAL_SCOPES_GIT_MAX_WORKERS) isolates those lingering threads so they cannot affect bundle serving or other scopes. Hard-kill via subprocess is explicitly out of scope (spec §6).

Tests

  • tests/git_executor_test.py — config defaults; helper returns value; helper times out; timeout=0 means no limit.
  • tests/fetch_timeout_test.py — a hanging op surfaces TimeoutError and wait_for unblocks promptly (< 2s with a 0.2s timeout).

Local run (Python 3.12 venv, editable installs of all three packages):

  • New tests: 5 passed
  • Full opal-server suite: 14 passed
  • Full opal-common suite: 64 passed (warnings are pre-existing deprecations)

Not runnable in this PR's CI yet

Task 4 of the plan includes the PR1 regression gate app-tests/git-leak/test_resilience.py::test_offline_repo_does_not_block_healthy_scopes and a live /healthcheck smoke test. Both depend on PR1 (test environment) being merged — the stated prerequisite — and a running stack, so they should be exercised once PR1 lands. This PR ships after PR2 and is independent of it.

Review checklist (opal-development)

  • Config keys: server-only consumption (correct component); bare env names (no OPAL_ double-prefix); mandatory description= present (covered by test_config.py); exactly one declaration per env name (no collision).
  • PDP impact: server-only keys; no opal_client/opal_common symbol touched; no PDP override of these names.
  • Pub/sub: no topic, broadcaster, or fan-out change.
  • No new dependency — stdlib concurrent.futures + asyncio only.
  • Interlock with PR4: PR4's parallel sync_scopes relies on this PR's guarantee that each fetch terminates, so its bounded gather can never wait on an infinite member.

🤖 Generated with Claude Code

dshoen619 and others added 3 commits June 23, 2026 14:19
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire scope clone/fetch through run_in_git_executor with
SCOPES_GIT_FETCH_TIMEOUT, and broaden the _clone except to catch
asyncio.TimeoutError so a hung clone is logged and the scope skipped
instead of crashing the caller. Drop the now-unused run_sync import.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@linear-code

linear-code Bot commented Jun 23, 2026

Copy link
Copy Markdown

PER-15157

@netlify

netlify Bot commented Jun 23, 2026

Copy link
Copy Markdown

Deploy Preview for opal-docs canceled.

Name Link
🔨 Latest commit c353ceb
🔍 Latest deploy log https://app.netlify.com/projects/opal-docs/deploys/6a437e031090a70008fd0b52

@dshoen619 dshoen619 marked this pull request as draft June 23, 2026 11:35
@dshoen619 dshoen619 requested a review from Copilot June 24, 2026 16:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves opal-server resilience when syncing scope policy repos by ensuring git clone/fetch operations can’t block indefinitely or starve the server’s shared executor.

Changes:

  • Add a dedicated, bounded ThreadPoolExecutor and run_in_git_executor(...) helper to run blocking pygit2 operations with an asyncio.wait_for timeout.
  • Apply the helper + new timeout config to scope repo clone and fetch paths.
  • Add server config keys for timeout and executor sizing, plus focused unit tests for timeout behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/opal-server/opal_server/git_fetcher.py Introduces dedicated git executor + timeout helper; routes scope clone/fetch through it.
packages/opal-server/opal_server/config.py Adds SCOPES_GIT_FETCH_TIMEOUT and SCOPES_GIT_MAX_WORKERS configuration.
packages/opal-server/opal_server/tests/git_executor_test.py Tests config defaults and run_in_git_executor basic behavior.
packages/opal-server/opal_server/tests/fetch_timeout_test.py Tests that a hanging git op times out quickly (doesn’t block).
.claude/plans/docs/05-config-reference.md Internal config reference entry for the new env vars and caveat.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +65 to +66
loop = asyncio.get_event_loop()
fut = loop.run_in_executor(_get_git_executor(), partial(func, *args, **kwargs))

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — switched to asyncio.get_running_loop() in 5dd53a2.

Comment on lines 225 to 229
GitPolicyFetcher.repos_last_fetched[
self._source_id
] = datetime.datetime.now()
await run_sync(
await run_in_git_executor(
repo.remotes[self._remote].fetch,

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5dd53a2repos_last_fetched is now updated only after the fetch completes successfully, so a timeout/error leaves it stale and won't suppress a later force_fetch via _was_fetched_after.

Comment thread packages/opal-server/opal_server/tests/fetch_timeout_test.py Outdated
dshoen619 and others added 2 commits June 24, 2026 20:09
On Python < 3.11 asyncio.TimeoutError is a distinct class from the
builtin TimeoutError, so run_in_git_executor's wait_for timeout was not
caught by `pytest.raises(TimeoutError)` — failing build (3.9)/(3.10).
Normalize to the builtin TimeoutError so the documented contract holds
on every supported Python, and update the _clone catch site to match.

Also apply black/isort/docformatter formatting to satisfy pre-commit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- run_in_git_executor: use asyncio.get_running_loop() instead of the
  deprecated get_event_loop() inside an async function
- fetch_and_notify_on_changes: set repos_last_fetched only after a
  successful fetch so a timeout/error does not wrongly suppress a later
  force_fetch via _was_fetched_after
- fetch_timeout_test: measure elapsed time with time.monotonic()

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dshoen619 dshoen619 marked this pull request as ready for review June 24, 2026 17:15
@dshoen619 dshoen619 self-assigned this Jun 24, 2026
The fetch path let TimeoutError propagate to sync_scope's catch-all,
which logged a full traceback at ERROR level for the expected
unreachable-repo case — inconsistent with the clone path's quiet
logger.error. Catch TimeoutError at the fetch site and log without a
traceback, then skip (repos_last_fetched stays stale so the next cycle
retries). Also shorten the hanging-thread sleeps in the timeout tests so
the lingering pool thread doesn't delay process teardown.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dshoen619 dshoen619 requested review from Zivxx and zeevmoney June 24, 2026 17:23
@zeevmoney

Copy link
Copy Markdown
Contributor

Review notes — overlap + gate location (planned with the opal-development skill: references/add-config-key.md, references/debug-pubsub.md)

Faithful to the PR3 plan, with two improvements over it: normalizing asyncio.TimeoutError → builtin TimeoutError (so callers catch the builtin on 3.9/3.10 where they're distinct), and moving repos_last_fetched to after a successful fetch — a timed-out fetch no longer falsely marks the source "fresh," and it's lock-safe because _should_fetch runs inside the per-source repo_lock. The two config keys follow references/add-config-key.md (server-only OpalServerConfig, bare names, mandatory descriptions, no double-prefix).

Three things to resolve:

  1. Direct overlap with Fix git clone/fetch hanging indefinitely on unreachable repos #875 (PER-13817). Same root cause — pygit2 clone_repository / remotes.fetch going through run_sync with no timeout — and the same edited blocks in git_fetcher.py (_clone and fetch_and_notify_on_changes). They cannot both merge; whichever lands second will conflict. This PR is the stronger of the two:

    Recommend closing Fix git clone/fetch hanging indefinitely on unreachable repos #875 in favor of this PR.

  2. Its regression gate lives in PR1 (test(opal-server): git leak/resilience test environment (PR1) #922). test_offline_repo_does_not_block_healthy_scopes is the fail-now/pass-after gate for this fix and only exists on test(opal-server): git leak/resilience test environment (PR1) #922. So this can't be validated end-to-end until test(opal-server): git leak/resilience test environment (PR1) #922 merges (or is merged into this branch). Suggested order: test(opal-server): git leak/resilience test environment (PR1) #922 → this.

  3. This PR is currently check-blocked by branch protection (required checks / review), not by a merge conflict — needs CI green + an approval.

Minor:

  • The dedicated pool reads SCOPES_GIT_MAX_WORKERS once, lazily, and caches the executor for the process lifetime — it isn't runtime-reconfigurable and is never shut down. Matches the plan's design; worth a one-line note in the docstring.
  • The line numbers cited in .claude/plans/docs/05-config-reference.md for the two keys are stale vs where they actually land on this branch (cosmetic).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants