Skip to content

fix: Windows worktree creation falls back gracefully instead of using temp dirs#440

Merged
PureWeen merged 3 commits intomainfrom
fix/windows-worktree-fallback
Apr 21, 2026
Merged

fix: Windows worktree creation falls back gracefully instead of using temp dirs#440
PureWeen merged 3 commits intomainfrom
fix/windows-worktree-fallback

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Problem

On Windows, "Implement & Challenge" multi-agent groups end up with sessions pointing to temp directories instead of proper git worktrees. This works fine on Mac.

Root Cause

CreateGroupFromPresetAsync has try/catch blocks around worktree creation that silently swallow exceptions (only Debug log). When worktree creation fails on Windows (long paths, file locking), orchWorkDir stays null, CreateSessionAsync gets null workingDirectory, and creates a temp dir.

Additionally, GroupShared strategy was falling into the wrong code path, creating -orchestrator- branches instead of -shared- branches.

Fix

  1. RepoManager.cs: Set core.longpaths on new worktrees for Windows
  2. CopilotService.Organization.cs:
    • Handle GroupShared strategy explicitly with -shared- branch naming
    • Fall back to existing worktree (TryGetExistingWorktreePath) when creation fails
    • Better error logging with exception type
  3. WorktreeStrategyTests.cs: 4 new regression tests for fallback behavior

Testing

  • All 38 WorktreeStrategy tests pass (including 4 new ones)
  • Tests cover: GroupShared failure fallback, fallback with existing workingDirectory, FullyIsolated failure fallback, and shared branch naming

@PureWeen PureWeen force-pushed the fix/windows-worktree-fallback branch from 931d56b to 68aa1e9 Compare March 26, 2026 20:47
@PureWeen
Copy link
Copy Markdown
Owner Author

PR #440 Review — R1

Tests: ✅ 2970/2970 passed
Verdict: ⚠️ Request Changes

The core approach is correct — GroupShared now properly creates -shared- branches and falls back gracefully instead of using temp dirs. However, there are two substantive issues and a few improvements worth making before merge.


🔴 Major: TryGetExistingWorktreePath picks an arbitrary worktree

FirstOrDefault(w => w.RepoId == repoId) returns the first worktree in insertion order with no preference logic. If a user has multiple worktrees for the same repo (e.g., one from a previous group, one from manual worktree creation), this could return a worktree that is:

  • Already being used by another active group (shared mutation risk)
  • On a stale or dirty branch inappropriate for new work
  • At a path that no longer exists on disk (the path is stored in state but WorktreeInfo.Path isn't validated)
private string? TryGetExistingWorktreePath(string repoId, ref string? worktreeId, SessionGroup group)
{
    var existing = _repoManager.Worktrees.FirstOrDefault(w => w.RepoId == repoId); // arbitrary

Recommended fix: At minimum, validate that existing.Path exists on disk before returning it. Consider adding a comment that this is intentionally read-only (no CreatedWorktreeIds.Add), and/or prefer worktrees not already claimed by an active group.


🔴 Major: Corrupted bare repo deletion failure leads to fetch on corrupted data

In EnsureRepoCloneInCurrentRootAsync, the corrupted repo is deleted with a silent catch { }. On Windows, file locks (antivirus, git index readers) can cause this delete to fail silently. The code then falls through to Directory.Exists(targetBarePath) — which is still true — and calls git fetch on the corrupted directory, which will fail with a confusing git error rather than a diagnostic about the lock:

if (Directory.Exists(targetBarePath) && !IsValidBareRepository(targetBarePath))
{
    Console.WriteLine($"[RepoManager] Bare clone corrupted (missing HEAD/refs), removing: {targetBarePath}");
    try { Directory.Delete(targetBarePath, recursive: true); } catch { }  // ← silent failure
}

if (Directory.Exists(targetBarePath))
{
    // ← still reaches here with corrupt data on Windows file-lock failure
    onProgress?.Invoke($"Fetching {repo.Id}…");

Recommended fix: After the delete, check !IsValidBareRepository(targetBarePath) again and throw with a clear message if the directory is still corrupted/locked:

catch (Exception ex)
{
    Console.WriteLine($"[RepoManager] Cannot remove corrupted bare clone at {targetBarePath}: {ex.Message}");
    throw new InvalidOperationException(
        $"Bare clone at {targetBarePath} is corrupted and locked. Close any tools accessing it and retry.", ex);
}

🟡 Minor: core.longpaths gap for existing bare clones

The core.longpaths fix is applied:

  • ✅ To new bare clones (after git clone --bare)
  • ✅ To worktrees (after git worktree add, line ~763)
  • Missing for existing bare clones that are reused (the Directory.Exists(targetBarePath) branch just fetches)

Since git worktree add uses the bare clone's config, users with a pre-existing bare clone will still get long-path failures on worktree add. The worktree-level config set after add helps subsequent operations but not the add itself.

Recommended fix: Apply core.longpaths unconditionally when the bare clone is reused, not only at clone time:

if (Directory.Exists(targetBarePath))
{
    onProgress?.Invoke($"Fetching {repo.Id}…");
    await RunGitWithProgressAsync(...);
    if (OperatingSystem.IsWindows())
        try { await RunGitAsync(targetBarePath, ct, "config", "core.longpaths", "true"); } catch { }
}

🟡 Minor: GroupShared success path silently overrides provided workingDirectory

The block condition is string.IsNullOrEmpty(worktreeId) — it doesn't check workingDirectory. So if a caller passes workingDirectory="/some/path" + no worktreeId, worktree creation succeeds, and orchWorkDir is overwritten with the new worktree path, silently discarding the caller's directory. The failure path correctly preserves workingDirectory (due to if (orchWorkDir == null) guard). This asymmetry is intentional per the comment, but there's no test for the success-with-workingDirectory case, and the behavior could surprise callers.

Recommended fix: Add a test for the workingDirectory + creation-succeeds path and clarify in the comment that GroupShared always creates its own branch regardless of workingDirectory.


🟡 Minor: Worker worktree failure has no explicit fallback in logs

When a worker's CreateWorktreeAsync throws (the FullyIsolated worker loop), the catch only logs "Failed to create worker-N worktree" without indicating the actual fallback path. Workers implicitly fall back to orchWorkDir at line ~3410, which is correct, but the log makes it look like a complete failure:

catch (Exception ex)
{
    Debug($"[WorktreeStrategy] Failed to create worker-{i + 1} worktree: {ex.GetType().Name}: {ex.Message}");
    // no mention of orchWorkDir fallback
}

Recommended: Add Debug($"[WorktreeStrategy] Worker-{i+1} will fall back to orchestrator dir: {orchWorkDir}") so diagnostics are clear.


🟡 Minor: IsValidBareRepository is a heuristic, not a full validity check

Checking HEAD + refs/ misses objects/ — a repo missing its object store would pass validation but fail on any git operation. Also, a partially-cloned repo with HEAD + refs/ but no objects/ or config would be falsely accepted.

Recommended: Add Directory.Exists(Path.Combine(barePath, "objects")) to the check, and/or add a comment that this is intentionally a lightweight heuristic for the specific corruption pattern (only objects/ survived).


🟡 Minor: Reflection in test stubs is fragile and triplicated

FailingRepoManager, FakeRepoManager, and FailingRepoManagerWithExistingWorktree all set private _state/_loaded via reflection. If these fields are renamed, the failure is at runtime (NullReferenceException) not compile time.

Recommended: Add an internal constructor or internal static RepoManager CreateForTesting(RepositoryState state) factory with [assembly: InternalsVisibleTo("PolyPilot.Tests")] to eliminate the reflection pattern. Minor improvement but worth doing given how many tests depend on it.


Nit: Missing blank line in RepoManagerTests.cs

    private static void CreateFakeBareRepoSkeleton(string bareDir)
    {
        ...
    }
    [Theory]   // ← needs a blank line before [Theory]

The GroupShared strategy fix and -shared- branch naming correction are exactly right. The main blockers are the corrupted-delete silent failure path and the arbitrary-worktree selection in the fallback. The core.longpaths gap for existing users also seems worth addressing since that's the primary Windows use case.

@PureWeen
Copy link
Copy Markdown
Owner Author

🔄 Re-Review Round 2

No new commits since R1 (still 1 commit, last updated 2026-03-26). The R1 findings remain open.

R1 Findings Status

Finding Status
🔴 C1 — TryGetExistingWorktreePath returns arbitrary worktree (no path validation, no preference logic) ⏳ Still present
🔴 C2 — Silent delete failure on locked corrupted bare clone → fetch on corrupt data ⏳ Still present
🟡 M1 — core.longpaths not applied to existing bare clones on reuse ⏳ Still present
🟡 M2 — GroupShared silently overrides workingDirectory on creation success (no test) ⏳ Still present
🟡 M3 — Worker fallback-to-orchWorkDir not logged ⏳ Still present
🟡 M4 — IsValidBareRepository missing objects/ check ⏳ Still present
🟡 M5 — Reflection in test stubs is fragile and triplicated ⏳ Still present

⚠️ Still requesting changes

When the two 🔴 issues (C1 + C2) are addressed, happy to re-review. The core feature (GroupShared with -shared- branch naming + graceful fallback) is solid.

@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Apr 3, 2026

🔍 Multi-Model Code Review — PR #440 (Re-Review R5, Final)

Branch: fix/windows-worktree-fallbackmain
Commits: 3 (rebased on main + fixes from R4 review + critical session-metadata fix)
CI: No checks configured on this branch


Previous Findings Status

# Severity Finding Status
1 🔴 CRITICAL TryGetExistingWorktreePath set group.WorktreeId on borrowed worktree → DeleteGroup destroys it ✅ FIXED (R4)
2 🔴 CRITICAL Borrowed worktree ID leaks through session metadata (orchWtIdorchMeta.WorktreeId) → DeleteGroup still destroys it ✅ FIXED (R5)
3 🟡 MODERATE No disk validation (Directory.Exists) on fallback worktree ✅ FIXED
4 🟡 MODERATE FullyIsolated test did not verify WorkingDirectory ✅ FIXED
5 🟢 MINOR Worker fallback-to-orchWorkDir not logged ✅ FIXED
6 🟢 MINOR Missing blank line in RepoManagerTests.cs ✅ FIXED
7 🟢 MINOR Reflection in test stubs fragile ➖ N/A (pre-existing pattern, unchanged)

R5 Fix: Session Metadata Worktree ID Leak (3/3 consensus)

All 3 reviewers independently identified that R4's fix was incomplete:

  • R4 protected group.WorktreeId and group.CreatedWorktreeIds
  • But the borrowed ID still leaked via ref worktreeIdorchWtId → session metadata
  • DeleteGroup collects from session WorktreeId too (line 1101-1102), bypassing the R4 protection

Fix: Removed the ref string? worktreeId parameter from TryGetExistingWorktreePath entirely. The method now returns only the path. orchWtId stays null on fallback paths, so the downstream guards (if (orchWtId != null)) skip writing borrowed IDs onto session metadata.

Test coverage: Both GroupShared and FullyIsolated fallback tests now assert Assert.Null(m.WorktreeId) on all session metadata in the group — verifying the borrowed ID does not leak to any level (group or session).

Test Results

3510/3510 tests pass (including 38 worktree strategy tests)

Recommendation

Approve and merge — all findings addressed, critical data-loss bug fully fixed with behavioral test coverage at both group and session metadata levels.

… temp dirs

Rebased on main — dropped RepoManager.cs changes (core.longpaths,
IsValidBareRepository, corruption cleanup) that were already merged.

Remaining changes:
- CopilotService.Organization.cs: fallback to existing worktree
  (TryGetExistingWorktreePath) when worktree creation fails
- WorktreeStrategyTests.cs: 4 regression tests for fallback behavior
- RepoManagerTests.cs: CreateFakeBareRepoSkeleton helper

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the fix/windows-worktree-fallback branch from 68aa1e9 to 3a53014 Compare April 21, 2026 01:12
PureWeen and others added 2 commits April 20, 2026 23:19
… add disk validation

🔴 CRITICAL: TryGetExistingWorktreePath no longer sets group.WorktreeId.
Borrowed worktrees are not owned — DeleteGroup must not destroy them.
Only orchWtId is set (for session metadata), group.WorktreeId stays null.

🟡 MODERATE: Added Directory.Exists check before returning a fallback
worktree path. Stale/deleted worktrees are skipped.

🟡 MODERATE: FullyIsolated test now asserts WorkingDirectory for all
sessions (matching the GroupShared test pattern).

🟢 MINOR: Worker fallback logs which directory it falls back to.
🟢 MINOR: Tests use real temp dirs (disk validation works in tests).
🟢 MINOR: Missing blank line in RepoManagerTests.cs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
TryGetExistingWorktreePath now returns only the path — the ref worktreeId
parameter is removed entirely. This prevents the borrowed ID from propagating
to orchWtId → session metadata (WorktreeId), which DeleteGroup collects
from all sessions in a group and deletes. Without this fix, deleting a group
that used a borrowed worktree would destroy a worktree owned by another group.

All 3 reviewers flagged this as a remaining critical issue after the previous
fix (which only protected group.WorktreeId but not session-level metadata).

Tests updated to assert session metadata WorktreeId is null for borrowed
worktrees, not just group.WorktreeId.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit 5a791ba into main Apr 21, 2026
@PureWeen PureWeen deleted the fix/windows-worktree-fallback branch April 21, 2026 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant