Skip to content

Handle failing push better#1193

Open
Soph wants to merge 9 commits into
mainfrom
soph/handle-failing-push-better
Open

Handle failing push better#1193
Soph wants to merge 9 commits into
mainfrom
soph/handle-failing-push-better

Conversation

@Soph
Copy link
Copy Markdown
Collaborator

@Soph Soph commented May 12, 2026

https://entire.io/gh/entireio/cli/trails/365

Make a stuck entire/checkpoints/v1 push diagnosable and bounded

A colleague's git push hung with an unbounded stream of dots on the pre-push hook's [entire] Pushing entire/checkpoints/v1 to origin... line. Only Ctrl-C broke it, and the resulting Warning: couldn't sync ...: fetch failed: had nothing after the colon — the doctor bundle showed no push events at all.

Root cause

exec.CommandContext's default cancellation only SIGKILLs the direct child. git push over HTTPS spawns git-remote-https + credential helpers that inherit our stdio pipes; killing git leaves them alive holding stdout, so cmd.CombinedOutput() blocks past tryPushSessionsCommon's 2-minute deadline. The dots are 1-Hz progress emitted in-process; with no cancellation, they tick forever.

Fixes

  • execx.KillOnCancel(cmd, waitDelay) — Unix uses Setpgid + Cancel killing the process group; Windows uses CREATE_NEW_PROCESS_GROUP + taskkill.exe /F /T /PID for descendant-tree termination. WaitDelay backstops pipe drain. newCommand in checkpoint/remote applies it to every git subprocess.
  • doPushBranch UX: returns distinct errPushTimedOut/errFetchTimedOut; ends the dot line with " timed out" instead of trailing silently; skips the sync→retry cascade on timeout (same network condition would just trip the fetch deadline); falls back to the wrapped fetch error when git's output is empty so no more bare fetch failed:.
  • INFO logging of every push attempt (branch, target alias, elapsed, classification — never raw URLs/tokens) so doctor bundles capture push history at the default log level.
  • Outer-ctx cancellation (Ctrl-C / hook deadline) bails silently at every stage of the cascade — first attempt, sync stage, and retry — instead of cascading into misleading warnings.

Tests

  • execx: pins the stdlib hang (TestExecCommandContext_HangsOnGrandchildHoldingPipe) and verifies KillOnCancel returns within WaitDelay (~200 ms vs 5 s).
  • strategy: covers timeout suffix, sync-cascade skip, outer-cancel bail, INFO log writes, sentinel propagation, classifyForLog, and the reportPushFailure chokepoint under -race.

Net effect for the colleague's symptom

Next time GitHub stalls a push, the 2-minute timeout actually fires, the dot line ends with timed out, a hint points at log_level=DEBUG, and the bundle records the attempt. No infinite dots, no Ctrl-C needed.


Note

Medium Risk
Touches git push/fetch execution and hook messaging, including new process-group kill behavior and timeout handling; while scoped to checkpoint syncing, it could change cancellation semantics across platforms and affect reliability if misconfigured.

Overview
Prevents checkpoint git push/fetch from hanging indefinitely on cancellation. All git subprocesses created in checkpoint/remote/newCommand are now configured with a new cross-platform execx.KillOnCancel helper that kills the entire process tree/process group and applies a bounded WaitDelay to unblock CombinedOutput when helpers keep stdio pipes open.

Improves checkpoint push UX and diagnosability. doPushBranch now logs push attempts at INFO with a sanitized class field, introduces explicit push/fetch timeout sentinels, prints clearer timeout messaging (including a DEBUG logging hint), skips the sync→retry cascade on push timeouts, bails quietly when the outer context is cancelled, and ensures fetch failures include a non-empty underlying error when git produced no output.

Adds focused regression tests. New unix-only tests pin the stdlib hang behavior and verify KillOnCancel wiring, plus strategy tests covering timeout messaging, cancellation behavior, logging, and error classification.

Reviewed by Cursor Bugbot for commit 8246e78. Configure here.

Soph and others added 5 commits May 12, 2026 17:10
git push/fetch over HTTPS spawns helpers (git-remote-https, credential
helpers) that inherit the parent's stdio pipes. The default
exec.CommandContext cancellation only SIGKILLs the direct child, so the
helpers keep stdout open and CombinedOutput blocks on Read past the
context deadline. tryPushSessionsCommon's 2-minute timeout was therefore
unbounded in practice — observed in the wild as a runaway dot stream on
[entire] Pushing entire/checkpoints/v1 to origin... that only Ctrl-C
could end.

Add execx.KillOnCancel(cmd, waitDelay) which sets Setpgid (or
CREATE_NEW_PROCESS_GROUP on Windows), a Cancel that SIGKILLs the whole
group, and WaitDelay as a pipe-drain backstop. newCommand in
checkpoint/remote applies it to every git subprocess routed through the
package.

Tests:
- TestExecCommandContext_HangsOnGrandchildHoldingPipe pins the stdlib
  behavior that motivates the fix (5s hang vs 200ms deadline).
- TestKillOnCancel_TerminatesGrandchildHoldingPipe verifies the helper
  returns within WaitDelay in the same scenario.
- TestNewCommand_AppliesKillOnCancel confirms the wiring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 24d37e721664
A colleague's stuck push surfaced four gaps in how doPushBranch
communicates failure. Fix each, plus add tests that pin the new
behaviour.

1. fetchAndRebaseSessionsCommon now returns errFetchTimedOut when its
   inner deadline fires (distinct from server errors) and falls back to
   the wrapped fetch error when git's output is empty — so users no
   longer see a bare "Warning: couldn't sync ...: fetch failed:" with
   nothing after the colon.

2. tryPushSessionsCommon emits errPushTimedOut on its inner deadline.
   doPushBranch pairs this with " timed out" as the dot-line suffix, so
   the user sees the failure mode instead of the dots silently trailing
   off into a "Syncing" line that looks like normal progression.

3. Every push attempt logs at INFO with branch, target (display alias —
   no URLs/tokens), elapsed, and a coarse class field. Doctor bundles
   now record push history even at default log level; previously the
   only push trace was a DEBUG line that nobody had enabled.

4. doPushBranch short-circuits in two cases that previously cascaded
   confusingly:
   - outer ctx cancelled (user Ctrl-C, hook timeout) → bail silently
     instead of running fetch+rebase that will also fail.
   - inner push timed out → skip the sync→retry cascade entirely
     (same network condition would just trip the fetch timeout) and
     print a hint pointing the user at log_level=DEBUG.

A pushStderr package writer replaces direct os.Stderr writes from
doPushBranch so tests can capture the new output without globally
swapping os.Stderr; the legacy captureStderr helper patches both vars
for backward compatibility with existing hint tests.

Tests in push_messaging_test.go cover each behavior change plus
classifyForLog. pushAttemptTimeout is now a var (not const) so tests
can shorten it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 6819da9e1737
Findings from /simplify cleanup pass:

- Extract reportPushFailure to dedupe the sync-failure and retry-failure
  tails of doPushBranch (suffix selection, INFO log, warning, hint).
  Drops ~15 lines and keeps the two branches in sync by construction.

- Replace classifyForLog's bare string literals with named constants
  (pushClassTimedOut, pushClassProtectedRef, pushClassNonFastForward,
  pushClassOther) so log consumers and tests share one source of truth.

- Fix aliasing smell in fetchAndRebaseSessionsCommon: drop the
  `localCtx, cancel := …; ctx = localCtx` half-rename in favour of the
  original `ctx, cancel := context.WithTimeout(…)` pattern.

- Rewrite pushStderr doc comment to describe the actual capture path
  (captureStderr) and document why the standalone hint helpers stay on
  os.Stderr — the function withPushStderr it referenced never existed.

- Strengthen TestFetchAndRebase_FetchFailure_IncludesUnderlyingError:
  assert on "fetch failed: " prefix + non-empty trimmed detail instead of
  just len(msg) > prefixLen, which would pass with any single character.

- Drop sleep 5 -> sleep 3 in the execx regression-pinning test. Keeps
  the 2s assertion gap, saves ~2s on every test run (execx package time
  dropped 5.2s -> 3.4s).

- Remove numbered // (1) // (2) // (4b) scaffolding from test docstrings
  and trim WHAT-comments ("Var (not const)…", "Try pushing again after
  rebase.", "Push failed — likely non-fast-forward…").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: bbc8a4322628
Two review findings against the previous two commits:

Finding 1: outer-context cancellation was only honored after the first
push attempt. doPushBranch checked ctx.Err() once before deciding to
sync, but the sync-failure and retry-failure tails unconditionally
flowed through reportPushFailure, so a Ctrl-C or hook deadline arriving
during fetch+rebase or during the retry push still produced misleading
"Warning: couldn't sync …" / "failed to push … after sync" output. Move
the ctx.Err() check into reportPushFailure itself so every cancellation
point in the cascade collapses to the same silent bail: empty dot
suffix, no log, no warning. New TestReportPushFailure_BailsOnOuter-
ContextCancel pins the contract.

Finding 2: KillOnCancel on Windows only terminated the direct child.
CREATE_NEW_PROCESS_GROUP isolates console signals but does not propagate
SIGKILL-equivalent semantics to descendants, and Process.Kill walks no
tree — git-remote-https and credential helpers survived the cancel and
kept the inherited stdio pipes open. Replace the direct kill with
taskkill.exe /F /T /PID which walks the descendant chain. Fall back to
Process.Kill if taskkill is unavailable; WaitDelay still bounds the
user-visible hang in that case. Cross-builds under GOOS=windows;
runtime test would require a Windows environment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: d14d590faf48
TestReportPushFailure_BailsOnOuterContextCancel previously called
t.Parallel() and then reassigned the package-global pushStderr, which
races with any non-parallel test that uses captureStderr (the shared
helper that also swaps pushStderr). Under the race detector this would
surface as a data race; in practice the t.Cleanup ordering could leak
a closed pipe into another test.

Add an `out io.Writer` field to reportPushFailureArgs and have
doPushBranch pass pushStderr at the two call sites. The test now
injects a local strings.Builder and never touches the global, so
t.Parallel() is sound. Verified with `go test -race`.

pushStderr itself remains as the writer used directly by doPushBranch
(dot lines, timed-out warning, hint). Plumbing it through to every
call site is a larger refactor and out of scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: c87f60db628c
Copilot AI review requested due to automatic review settings May 12, 2026 20:08
@Soph Soph marked this pull request as ready for review May 12, 2026 20:08
@Soph Soph requested a review from a team as a code owner May 12, 2026 20:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes checkpoint branch pushing (entire/checkpoints/v1) bounded and diagnosable by ensuring git subprocess cancellation reliably terminates descendant helper processes (preventing indefinite hangs), and by improving timeout/error messaging plus INFO-level logging so doctor bundles capture push attempts.

Changes:

  • Add execx.KillOnCancel to kill the full descendant process tree on context cancellation (Unix via process groups; Windows via taskkill /T), with WaitDelay as a pipe-drain backstop.
  • Improve doPushBranch push UX and control flow: distinct push/fetch timeout sentinels, clearer dot-line suffixes, skip sync→retry cascade on push timeout, and avoid empty fetch failed: messaging.
  • Add INFO logging for push attempts and add regression/unit tests covering the cancellation and messaging behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
cmd/entire/cli/strategy/push_messaging_test.go New tests covering timeout UX, cascade suppression, cancellation bail-out, INFO logging, and error classification.
cmd/entire/cli/strategy/push_common.go Adds timeout sentinels, improved messaging/control flow, INFO logs, and shared failure-reporting helper.
cmd/entire/cli/strategy/push_common_test.go Updates stderr capture helper to also redirect pushStderr for new output plumbing.
cmd/entire/cli/execx/killoncancel_windows.go Windows implementation of KillOnCancel (process-group isolation + taskkill /T).
cmd/entire/cli/execx/killoncancel_unix.go Unix implementation of KillOnCancel (Setpgid + kill process group).
cmd/entire/cli/execx/killoncancel_unix_test.go Unix tests pinning the stdlib hang and verifying KillOnCancel unblocks promptly.
cmd/entire/cli/checkpoint/remote/git.go Wires KillOnCancel into all git subprocesses created by newCommand.
cmd/entire/cli/checkpoint/remote/git_killoncancel_unix_test.go Verifies newCommand applies KillOnCancel wiring on Unix.
Comments suppressed due to low confidence (5)

cmd/entire/cli/execx/killoncancel_unix_test.go:59

  • Same portability issue as above: the test guards with LookPath("sh") but still invokes /bin/sh. Prefer using the LookPath result (or sh by name) so the test doesn’t depend on /bin/sh existing.
	if _, err := exec.LookPath("sh"); err != nil {
		t.Skip("sh not available")
	}

	ctx, cancel := context.WithTimeout(context.Background(), 200*time.Millisecond)
	defer cancel()

	cmd := exec.CommandContext(ctx, "/bin/sh", "-c", grandchildHoldingStdoutCmd)
	KillOnCancel(cmd, 500*time.Millisecond)

cmd/entire/cli/strategy/push_messaging_test.go:167

  • Same flakiness risk here: pushAttemptTimeout = 1*time.Millisecond may not reliably expire before remote.Push returns when pushing to a local bare repo. Use an immediate timeout (0 duration) or another deterministic approach so the test consistently exercises the errPushTimedOut path.
	withShortPushTimeout(t, 1*time.Millisecond)

	_, err := tryPushSessionsCommon(context.Background(), bareDir, paths.MetadataBranchName)
	require.Error(t, err)
	assert.ErrorIs(t, err, errPushTimedOut,
		"err must wrap errPushTimedOut; got %v (class=%q)", err, classifyForLog(err))
}

cmd/entire/cli/strategy/push_common.go:137

  • push attempt failed logs omit the target alias. Adding the same redacted/aliased target field used in push attempt start would make failure events diagnosable without relying on surrounding context.
	logging.Info(ctx, "push attempt failed",
		slog.String("branch", branchName),
		slog.Duration("elapsed", time.Since(pushStart)),
		slog.String("class", classifyForLog(err)),
	)

cmd/entire/cli/strategy/push_common.go:209

  • push retry completed logs omit the target alias, which makes it harder to interpret retries in aggregated logs. Consider logging the same redacted/aliased target field as the initial attempt.
	logging.Info(ctx, "push retry completed",
		slog.String("branch", branchName),
		slog.Duration("elapsed", time.Since(retryStart)),
		slog.Bool("up_to_date", retryResult.upToDate),
	)

cmd/entire/cli/strategy/push_common.go:252

  • reportPushFailure INFO logs include branch/elapsed/class but not the push target. Since reportPushFailureArgs already carries target, consider logging a redacted/aliased target value here as well to make sync/retry failures attributable in doctor bundles.
	logging.Info(ctx, a.logMsg,
		slog.String("branch", a.branch),
		slog.Duration("elapsed", time.Since(a.start)),
		slog.String("class", classifyForLog(a.err)),
	)

Comment on lines +27 to +35
if _, err := exec.LookPath("sh"); err != nil {
t.Skip("sh not available")
}

ctx, cancel := context.WithTimeout(context.Background(), 200*time.Millisecond)
defer cancel()

cmd := exec.CommandContext(ctx, "/bin/sh", "-c", grandchildHoldingStdoutCmd)

Comment on lines +58 to +65
// Force every push attempt to time out immediately. KillOnCancel +
// WaitDelay ensures the call returns within milliseconds.
withShortPushTimeout(t, 1*time.Millisecond)

restore := captureStderr(t)
err := doPushBranch(context.Background(), bareDir, paths.MetadataBranchName)
output := restore()

Comment on lines +124 to +128
logging.Info(ctx, "push attempt completed",
slog.String("branch", branchName),
slog.Duration("elapsed", time.Since(pushStart)),
slog.Bool("up_to_date", result.upToDate),
)
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 8246e78. Configure here.

slog.String("error", fetchErr.Error()),
slog.Duration("after", fetchAttemptTimeout),
)
return errFetchTimedOut
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fetch timeout check can't distinguish inner from outer deadline

Medium Severity

In fetchAndRebaseSessionsCommon, ctx is shadowed at line 538 by context.WithTimeout, so the new timeout check at line 572 (errors.Is(ctx.Err(), context.DeadlineExceeded)) cannot distinguish whether the inner fetch deadline or the outer context deadline fired. In contrast, tryPushSessionsCommon correctly avoids this by using localCtx and guarding with ctx.Err() == nil. If the outer context's deadline fires during the fetch (e.g., hook-level timeout shorter than fetchAttemptTimeout), this incorrectly returns errFetchTimedOut instead of letting the outer cancellation propagate naturally.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8246e78. Configure here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like we should fix it 😅

computermode
computermode previously approved these changes May 12, 2026
Soph added 4 commits May 13, 2026 14:00
In fetchAndRebaseSessionsCommon, the timeout-wrapped context shadowed the
outer ctx, so a hook-level deadline shorter than fetchAttemptTimeout was
misreported as errFetchTimedOut. Mirror tryPushSessionsCommon's pattern:
hold the wrapped context as localCtx and gate the sentinel on the outer
ctx still being alive.

Entire-Checkpoint: cce2e7fcfcd8
The tests guard with exec.LookPath("sh") but were invoking /bin/sh
directly, which would skip the guard's intent on systems where sh
lives outside /bin (some non-FHS layouts). Use the PATH-resolved
form so the guard's outcome actually governs the test.

Entire-Checkpoint: bd7c6a83418a
Forcing pushAttemptTimeout to 1ms still raced against a fast local-bare-repo
push that could complete before the deadline fired. Setting it to 0 makes
localCtx fire DeadlineExceeded synchronously, so the test always reaches
the errPushTimedOut path.

Entire-Checkpoint: faf57c42d164
Only "push attempt start" carried the target field; completion, failure,
retry, sync-failure, and retry-failure entries omitted it. Doctor bundles
collecting these had to infer the target from surrounding context, which
breaks when multiple remotes (origin + checkpoint remote) push in the same
hook invocation. Add target (redacted via displayPushTarget for URL targets)
to every push-related log line.

Entire-Checkpoint: 2c5cd80e2128
@Soph Soph force-pushed the soph/handle-failing-push-better branch from 3dcadb8 to 7d7276b Compare May 13, 2026 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants