Skip to content

fix(platform): bound Windows shell-outs so wire status/up/doctor can't hang (#284.1)#298

Merged
laulpogan merged 1 commit into
mainfrom
fix/284-1-bounded-status-up-doctor-timeouts
Jun 15, 2026
Merged

fix(platform): bound Windows shell-outs so wire status/up/doctor can't hang (#284.1)#298
laulpogan merged 1 commit into
mainfrom
fix/284-1-bounded-status-up-doctor-timeouts

Conversation

@laulpogan

Copy link
Copy Markdown
Collaborator

Part 1 of #284 (Willard's Windows report).

Summary

wire status, wire up, and wire doctor hung indefinitely when the Windows process-enumeration shell-outs they rely on wedged. On Willard's box, 254 stale wire.exe processes under heavy WMI contention made Get-CimInstance Win32_Process legitimately slow; a corrupted CIM repository would hang it outright. With no timeout, every CLI surface that ran a process probe blocked forever.

Fix

A new crate::platform::run_with_timeout(cmd, dur):

  1. Spawn the child with piped stdout/stderr.
  2. Hand wait_with_output to a reader thread that pushes the result through an mpsc::channel.
  3. Main thread recv_timeouts.
  4. On timeout, kill the wedged child by PID via the OS-native tool — taskkill /F /T /PID on Windows, kill -9 on POSIX — so the reader thread unblocks and the child tree exits with the wrapper.

Applied to every Windows shell-out in this module:

Call What Used by
process_alive tasklist /FI "PID eq <pid>" wire status, doctor health checks, daemon liveness
find_processes_by_cmdline PowerShell Get-CimInstance Win32_Process wire status orphan-pid scan, wire upgrade daemon kill, identity-collision check (#247.4)
pid_cmdline PowerShell Get-CimInstance Win32_Process filtered to one pid orphan-pid annotation in wire status

Default timeout 5s (well past the ≤100ms a healthy host needs for any of these probes), overridable via WIRE_PLATFORM_TIMEOUT_SECS. On timeout each call falls through to its existing tool-error fallback (false for liveness, empty Vec for the enumerator, None for cmdline) — same shape the old Err(_) | Ok(non-success) arms produced, so callers don't need to handle a new "timed out" state. wire status / wire doctor now return promptly with whatever local state is readable instead of blocking on a wedged probe.

POSIX shell-outs (pgrep, kill, /proc/<pid> reads) are unchanged — they're either pure-fs reads or known-fast tool calls with their own timeouts.

Tests

3 new in platform::tests, all green on x86_64-pc-windows-msvc (rustc 1.96.0):

  • run_with_timeout_returns_some_on_fast_commandecho / cmd.exe /C echo completes inside 5s and stdout is captured.
  • run_with_timeout_returns_none_and_kills_on_slow_commandsleep 60 / Start-Sleep -Seconds 60 is killed inside a 500ms timeout, and the wrapper returns inside 10s (not 60s). Verifies the kill actually fires.
  • platform_shell_timeout_default_is_5s — env var override works and the default is 5s.

Full lib suite: 489 passed; 0 failed; 7 ignored.

Out of scope (left for follow-ups)

Stack

Stacks on top of #294 (Windows test/clippy hygiene). Rebase onto main once #294 lands.

Sibling Windows-cluster PRs already open: #296 (#284.5 stale relay.lock reclaim), #297 (#247.4 Windows identity-collision adapter).

Test plan

  • cargo fmt --check clean on Windows.
  • cargo clippy --all-targets -- -D warnings clean on Windows.
  • cargo test --lib 489/0/7 on Windows.
  • CI green (install-smoke-windows, demo, docs-lint).
  • Manual repro: stage a host with hundreds of wire.exe procs (or run the wrapper directly against powershell.exe Start-Sleep 60 via the new unit test), confirm wire status / wire doctor finish promptly.

🤖 Generated with Claude Code

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 15, 2026

Copy link
Copy Markdown

Deploying wireup-landing with  Cloudflare Pages  Cloudflare Pages

Latest commit: f2d19b6
Status: ✅  Deploy successful!
Preview URL: https://ecb651c8.wireup-landing.pages.dev
Branch Preview URL: https://fix-284-1-bounded-status-up.wireup-landing.pages.dev

View logs

@laulpogan laulpogan force-pushed the fix/284-1-bounded-status-up-doctor-timeouts branch 4 times, most recently from 2631838 to 5723d6f Compare June 15, 2026 01:54
…t hang (#284.1)

Issue #284 part 1 (from Willard's Windows report): `wire status`,
`wire up`, and `wire doctor` hang indefinitely when the relay stream
long-poll is down OR when the Windows process-enumeration shell-outs
themselves wedge. Per Willard's 254-stale-wire.exe pile-up, the
PowerShell + `Get-CimInstance Win32_Process` query the daemon-liveness
helper and the doctor's orphan scan rely on can take many seconds —
and on a corrupted CIM repository it can hang outright. With no
timeout wrapping that shell-out, every CLI surface that touches
process probes was unbounded.

Fix: a new `crate::platform::run_with_timeout(cmd, dur)`. Spawns the
child with piped stdout/stderr, hands `wait_with_output` to a reader
thread, `recv_timeout`s on the main thread, and on timeout kills the
wedged child by PID via the OS-native tool (`taskkill /F /T /PID` on
Windows, `kill -9` on POSIX) so the reader thread unblocks and the
child tree exits with the wrapper.

Applied to every Windows shell-out in this module:

- `process_alive` → `tasklist /FI "PID eq <pid>"`
- `find_processes_by_cmdline` → PowerShell `Get-CimInstance Win32_Process`
  with the `wire*` / cmdline filter
- `pid_cmdline` → PowerShell `Get-CimInstance Win32_Process` with the
  pid filter

Default timeout 5s (well past the ≤100ms a healthy host needs for
any of these probes), overridable via `WIRE_PLATFORM_TIMEOUT_SECS`.
On timeout each call falls through to its existing tool-error
fallback (`false` for liveness, empty `Vec` for the enumerator,
`None` for cmdline) — same shape the old `Err(_) | Ok(non-success)`
arms produced, so callers don't need to handle a new "timed out"
state. `wire status` / `wire doctor` now return promptly with
whatever local state is readable instead of blocking on a wedged
probe.

POSIX shell-outs (`pgrep`, `kill`, `/proc/<pid>` reads) are
unchanged — they're either pure-fs reads or known-fast tool calls
with their own timeouts.

Tests: 3 new in `platform::tests`:

- `run_with_timeout_returns_some_on_fast_command` — `echo` /
  `cmd.exe /C echo` completes inside 5s and the stdout is captured.
- `run_with_timeout_returns_none_and_kills_on_slow_command` —
  `sleep 60` / `Start-Sleep -Seconds 60` is killed inside a 500ms
  timeout, and the wrapper returns inside 10s (not 60s).
- `platform_shell_timeout_default_is_5s` — env var override works.

Full lib suite: 489 passed; 0 failed; 7 ignored on
`x86_64-pc-windows-msvc` (rustc 1.96.0).

Stacks on top of #294 (Windows test/clippy hygiene); rebase onto main
once #294 lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: laul.pogan <paul@zaibatsuheavy.industries>
@laulpogan laulpogan force-pushed the fix/284-1-bounded-status-up-doctor-timeouts branch from 5723d6f to f2d19b6 Compare June 15, 2026 02:01
@laulpogan laulpogan merged commit 00edb89 into main Jun 15, 2026
12 checks passed
@laulpogan laulpogan deleted the fix/284-1-bounded-status-up-doctor-timeouts branch June 15, 2026 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant