Skip to content

fix(opal-server): memory leak — purge GitPolicyFetcher caches on scope delete + webhook task cleanup (PR2)#923

Open
Zivxx wants to merge 3 commits into
david/per-15155-pr1-git-leakresilience-test-environment-one-big-prfrom
ziv/per-15156-pr2-memory-leak-fix-gitpolicyfetcher-caches-webhook-task
Open

fix(opal-server): memory leak — purge GitPolicyFetcher caches on scope delete + webhook task cleanup (PR2)#923
Zivxx wants to merge 3 commits into
david/per-15155-pr1-git-leakresilience-test-environment-one-big-prfrom
ziv/per-15156-pr2-memory-leak-fix-gitpolicyfetcher-caches-webhook-task

Conversation

@Zivxx

@Zivxx Zivxx commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

PR2 — Memory leak fix (GitPolicyFetcher caches + webhook task list)

Linear: PER-15156

Stacked on PR1 (PER-15155, branch david/per-15155-…). This PR targets PR1's branch so the diff shows only PR2's changes; PR1's leak tests are the regression gate this PR turns green. Re-base onto master once PR1 merges.

What

Stops OPAL-server memory growth by releasing the leaked pygit2.Repository handle when a scope is deleted, and fixes the webhook task-list cleanup bug. Serial only — no concurrency introduced (the delete/fetch race is deferred to PR4). repo_locks deliberately left untouched (lock-identity hazard, deferred).

Changes

  • GitPolicyFetcher.forget_repo(path) (git_fetcher.py) — pops repos[path] and calls Repository.free() when available (guarded; falls back to GC), releasing OS fds + mmapped pack indexes.
  • ScopesService.delete_scope (scopes/service.py) — captures the deleted scope's source before scanning siblings (fixes a latent loop-variable shadowing bug that computed the clone path from the wrong scope), then removes the clone dir + forget_repo + pops repos_last_fetched when no sibling shares the source.
  • BasePolicyWatcherTask._on_webhook (policy/watcher/task.py) — replaces remove-while-iterating (which skipped elements, leaking finished tasks) with a list rebuild.

Review-driven fix (beyond the original plan)

The clone-deletion gate originally used url_shared, but clones are keyed strictly per-source_id (base_dir/git_sources/<source_id>). With OPAL_SCOPES_REPO_CLONES_SHARDS > 1, same-url/different-branch scopes resolve to different clone dirs — so url_shared would skip deleting a clone that no sibling actually uses, leaking the exact handle this PR targets. Both gates now use source_id_shared (strictly stronger; url_shared was redundant), with a dedicated SHARDS>1 test.

Tests

New unit tests (TDD): forget_repo_test.py (3), delete_scope_cache_purge_test.py (3, incl. the shards>1 divergence case), webhook_task_cleanup_test.py (1, asserts both purge and that the new trigger task is scheduled).

Verification

  • opal-server unit suite: 20 passed
  • PR1 leak gate in a real docker stack (20 scopes, ~11 min): test_churn_releases_cachestest_repeat_sync_does_not_grow
  • pygit2 in env: 1.14.1, Repository.free present
  • black/isort/docformatter (pinned) clean; zero opal_common/opal_client changes; no repo_locks mutation; no config key added

🤖 Generated with Claude Code

Zivxx and others added 3 commits June 23, 2026 12:50
… repo handles

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Capture the deleted scope's source before scanning siblings (fixing the
loop-variable shadowing bug where the clone path was computed from the
last-iterated scope). Drop the cached pygit2.Repository handle via
forget_repo and pop repos_last_fetched when no sibling shares the source.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…terating)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@linear-code

linear-code Bot commented Jun 23, 2026

Copy link
Copy Markdown

PER-15156

@Zivxx Zivxx marked this pull request as ready for review June 28, 2026 14:03
@zeevmoney

Copy link
Copy Markdown
Contributor

Review notes (planned with the opal-development skill)

Faithful to the plan, and the source_id_shared gate (instead of the plan's url_shared) is a genuine improvement: with SCOPES_REPO_CLONES_SHARDS > 1, same-URL/different-branch siblings get different source_ids and therefore different clone dirs, so gating clone deletion on the URL would have skipped deleting the deleted scope's own clone — leaking exactly what this PR fixes. The added shards=4 test covers it, and the loop-variable shadowing bug is fixed by capturing the deleted source before the sibling scan. repo_locks correctly left untouched.

Three small notes (none blocking):

  1. Its only effective gate in PR1 (test(opal-server): git leak/resilience test environment (PR1) #922) is test_churn_releases_caches. The sibling test_repeat_sync_does_not_grow is non-gating — caches are source_id-keyed, so a re-sync of identical scopes can't grow the counts for any implementation; it's fine as a loose RSS guard but shouldn't be read as proof of this fix.
  2. forget_repo's except / logger.debug GC-fallback branch is untested — i.e. the path where Repository.free() itself raises. A one-line test (a fake repo whose free() raises, asserting the entry is still popped and no exception escapes) would close it.
  3. Tracking note: deferring the repo_locks purge is the right call (the lock-identity hazard is real), but it does leave a tiny, bounded residual leak — one asyncio.Lock per distinct source ever seen. Worth keeping the structural-cleanup follow-up on the board so it isn't forgotten.

Depends on #922 for its end-to-end gate; suggested merge order is #922 → this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants