fix(platform): bound Windows shell-outs so wire status/up/doctor can't hang (#284.1)#298
Merged
Merged
Conversation
This was referenced Jun 14, 2026
7423ef4 to
166e19a
Compare
Deploying wireup-landing with
|
| Latest commit: |
f2d19b6
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://ecb651c8.wireup-landing.pages.dev |
| Branch Preview URL: | https://fix-284-1-bounded-status-up.wireup-landing.pages.dev |
2631838 to
5723d6f
Compare
…t hang (#284.1) Issue #284 part 1 (from Willard's Windows report): `wire status`, `wire up`, and `wire doctor` hang indefinitely when the relay stream long-poll is down OR when the Windows process-enumeration shell-outs themselves wedge. Per Willard's 254-stale-wire.exe pile-up, the PowerShell + `Get-CimInstance Win32_Process` query the daemon-liveness helper and the doctor's orphan scan rely on can take many seconds — and on a corrupted CIM repository it can hang outright. With no timeout wrapping that shell-out, every CLI surface that touches process probes was unbounded. Fix: a new `crate::platform::run_with_timeout(cmd, dur)`. Spawns the child with piped stdout/stderr, hands `wait_with_output` to a reader thread, `recv_timeout`s on the main thread, and on timeout kills the wedged child by PID via the OS-native tool (`taskkill /F /T /PID` on Windows, `kill -9` on POSIX) so the reader thread unblocks and the child tree exits with the wrapper. Applied to every Windows shell-out in this module: - `process_alive` → `tasklist /FI "PID eq <pid>"` - `find_processes_by_cmdline` → PowerShell `Get-CimInstance Win32_Process` with the `wire*` / cmdline filter - `pid_cmdline` → PowerShell `Get-CimInstance Win32_Process` with the pid filter Default timeout 5s (well past the ≤100ms a healthy host needs for any of these probes), overridable via `WIRE_PLATFORM_TIMEOUT_SECS`. On timeout each call falls through to its existing tool-error fallback (`false` for liveness, empty `Vec` for the enumerator, `None` for cmdline) — same shape the old `Err(_) | Ok(non-success)` arms produced, so callers don't need to handle a new "timed out" state. `wire status` / `wire doctor` now return promptly with whatever local state is readable instead of blocking on a wedged probe. POSIX shell-outs (`pgrep`, `kill`, `/proc/<pid>` reads) are unchanged — they're either pure-fs reads or known-fast tool calls with their own timeouts. Tests: 3 new in `platform::tests`: - `run_with_timeout_returns_some_on_fast_command` — `echo` / `cmd.exe /C echo` completes inside 5s and the stdout is captured. - `run_with_timeout_returns_none_and_kills_on_slow_command` — `sleep 60` / `Start-Sleep -Seconds 60` is killed inside a 500ms timeout, and the wrapper returns inside 10s (not 60s). - `platform_shell_timeout_default_is_5s` — env var override works. Full lib suite: 489 passed; 0 failed; 7 ignored on `x86_64-pc-windows-msvc` (rustc 1.96.0). Stacks on top of #294 (Windows test/clippy hygiene); rebase onto main once #294 lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: laul.pogan <paul@zaibatsuheavy.industries>
5723d6f to
f2d19b6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part 1 of #284 (Willard's Windows report).
Summary
wire status,wire up, andwire doctorhung indefinitely when the Windows process-enumeration shell-outs they rely on wedged. On Willard's box, 254 stalewire.exeprocesses under heavy WMI contention madeGet-CimInstance Win32_Processlegitimately slow; a corrupted CIM repository would hang it outright. With no timeout, every CLI surface that ran a process probe blocked forever.Fix
A new
crate::platform::run_with_timeout(cmd, dur):wait_with_outputto a reader thread that pushes the result through anmpsc::channel.recv_timeouts.taskkill /F /T /PIDon Windows,kill -9on POSIX — so the reader thread unblocks and the child tree exits with the wrapper.Applied to every Windows shell-out in this module:
process_alivetasklist /FI "PID eq <pid>"wire status, doctor health checks, daemon livenessfind_processes_by_cmdlineGet-CimInstance Win32_Processwire statusorphan-pid scan,wire upgradedaemon kill, identity-collision check (#247.4)pid_cmdlineGet-CimInstance Win32_Processfiltered to one pidwire statusDefault timeout 5s (well past the ≤100ms a healthy host needs for any of these probes), overridable via
WIRE_PLATFORM_TIMEOUT_SECS. On timeout each call falls through to its existing tool-error fallback (falsefor liveness, emptyVecfor the enumerator,Nonefor cmdline) — same shape the oldErr(_) | Ok(non-success)arms produced, so callers don't need to handle a new "timed out" state.wire status/wire doctornow return promptly with whatever local state is readable instead of blocking on a wedged probe.POSIX shell-outs (
pgrep,kill,/proc/<pid>reads) are unchanged — they're either pure-fs reads or known-fast tool calls with their own timeouts.Tests
3 new in
platform::tests, all green onx86_64-pc-windows-msvc(rustc 1.96.0):run_with_timeout_returns_some_on_fast_command—echo/cmd.exe /C echocompletes inside 5s and stdout is captured.run_with_timeout_returns_none_and_kills_on_slow_command—sleep 60/Start-Sleep -Seconds 60is killed inside a 500ms timeout, and the wrapper returns inside 10s (not 60s). Verifies the kill actually fires.platform_shell_timeout_default_is_5s— env var override works and the default is 5s.Full lib suite: 489 passed; 0 failed; 7 ignored.
Out of scope (left for follow-ups)
daemon-stream: error error decoding response body: operation timed outlog spam is a daemon-side bug, not a CLI hang — separate issue.wire upgrade's prebuilt-download hang (no progress bar, just silent hang when crates.io + github are both reachable) is Windows: status/up/doctor + upgrade hang, SessionStart wait-loop pileup (254 procs), and wire_send peer_unknown on VERIFIED peer #284.3.until wire status | grep daemon_running:true; do sleep 3; doneSessionStart loop that piled up the 254 wire.exe procs is Windows: status/up/doctor + upgrade hang, SessionStart wait-loop pileup (254 procs), and wire_send peer_unknown on VERIFIED peer #284.2. With this PRwire statusreturns promptly even with that loop still running, but the loop should also become bounded so a never-healthy state doesn't accumulate procs forever.Stack
Stacks on top of #294 (Windows test/clippy hygiene). Rebase onto main once #294 lands.
Sibling Windows-cluster PRs already open: #296 (#284.5 stale
relay.lockreclaim), #297 (#247.4 Windows identity-collision adapter).Test plan
cargo fmt --checkclean on Windows.cargo clippy --all-targets -- -D warningsclean on Windows.cargo test --lib489/0/7 on Windows.install-smoke-windows, demo, docs-lint).wire.exeprocs (or run the wrapper directly againstpowershell.exe Start-Sleep 60via the new unit test), confirmwire status/wire doctorfinish promptly.🤖 Generated with Claude Code