Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ the PR description linked in each section.

- **`relay.lock` acquisition is now bounded + stale-owner-reclaimable** (#284.5): a hung wire daemon (or any wire process stuck in a relay long-poll — see #284.1) held `relay.lock` indefinitely, since the kernel only releases the flock at PID exit and the wedged process never exited. Every subsequent `wire status` / `wire send` / `wire daemon` then blocked on `lock_exclusive()` forever; on Windows this was the engine behind the 254-`wire.exe`-process pile-up Willard's SessionStart `until wire status …` loop spawned. The acquire path now (1) opens + `try_lock_exclusive` non-blocking; (2) on contention, reads the owner PID from a new `relay.lock.owner` sidecar and consults a pure-logic classifier — if the owner is dead/absent it retries immediately (the OS auto-released the flock on PID exit), if the owner is live it bounded-exponential-backoffs up to `WIRE_RELAY_LOCK_TIMEOUT_SECS` (default 10s) and then fails with the holder PID surfaced in the error so `wire doctor` can name a target. The sidecar pattern is needed because Windows `LockFileEx` denies reads to other handles against the locked file, so "who holds this?" can't be read out of `relay.lock` itself. No behavior change in the uncontended fast path.
- **`warn_on_identity_collision` now actually catches collisions on Windows** (#247 finding 4): the v0.6.10 collision check shelled out to POSIX `pgrep` and read `/proc/<pid>/environ` to compare `WIRE_HOME` between live wire processes — both POSIX-only, so Windows silently no-op'd. Two `wire mcp` servers in the same cwd shared one identity, raced the inbox cursor, and operators burned hours on "they look identical." The enumerator now routes through `crate::platform::find_processes_by_cmdline` (the existing PowerShell + `Get-CimInstance Win32_Process` adapter), and the cross-process `WIRE_HOME` lookup on Windows walks `list_sessions()` × every inbox-owning role's pidfile (`<session_home>/state/wire/<role>.pid`) and reverse-maps the candidate PID to its serving home. To make that reverse-map work for every role — not just daemon — `wire mcp`, `wire monitor`, and `wire notify` now drop their own `<role>.pid` on startup via a new `ensure_up::write_self_role_pid`, mirroring what `write_self_daemon_pid` has always done for the daemon. Pure-logic `find_home_for_pid` is split out and unit-tested; the path-based reader is `session::session_role_pid(home, role)`. POSIX path is unchanged (still `/proc/<pid>/environ` / `ps -E`).
- **`wire status` / `wire up` / `wire doctor` are now bounded on a wedged probe path** (#284.1): the Windows process-enumeration shell-outs in `platform.rs` (`Get-CimInstance Win32_Process` and `tasklist`) had no timeout, so any wedged CIM call — observed on Willard's box where 254 stale `wire.exe` processes piled up under heavy WMI contention, but the same shape on any corrupted CIM repository — blocked every CLI command that ran a process probe forever. A new `crate::platform::run_with_timeout(cmd, dur)` wraps every Windows shell-out in this module: spawn, hand `wait_with_output` to a reader thread, `recv_timeout` on the main thread, kill the wedged child by PID on timeout via `taskkill /F /T /PID`. Default 5s, overridable via `WIRE_PLATFORM_TIMEOUT_SECS`. On timeout each call falls through to its existing tool-error fallback (empty Vec, `None`, `false`) so the caller sees "no answer" instead of hanging forever, and `wire status` / `wire doctor` return promptly with whatever local state is readable.
- **Windows local dev — `cargo test` / `cargo clippy` gates now compile clean** (developer-experience hygiene): three small Windows-only pre-existing breakages blocked any local TDD on a Windows host. (1) `config::tests::private_key_is_mode_0600` used unix-only `PermissionsExt::mode()` without a `cfg(unix)` guard, failing to compile under `cargo test --lib` on MSVC. (2) `cli::upgrade::tests::no_shadow_warning_when_active_symlink_resolves_to_current_exe` declared `let link` that's only used inside its `#[cfg(unix)]` symlink call, triggering `unused_variables` on Windows. (3) `cli::lifecycle` had a collapsible-if pattern in the Windows-only `taskkill` enumerator and an `unneeded_return` on the Windows-only `purge_binary_and_shell` branch, both firing clippy under `-D warnings` on MSVC. All three are test/scaffold-only, no runtime behavior change; CI's `install-smoke-windows` job is untouched.
- **`wire_whoami` (MCP) flags `stale_binary` when the server's code has drifted from the daemon** (#247 finding 5): a long-lived `wire mcp` server keeps serving the binary it was spawned from, so after `wire upgrade` swaps the daemon it silently runs pre-upgrade code in memory (the "ghost identity" drift behind today's 0.14.1-vs-0.16.0 confusion). `wire_whoami` now compares the baked `server_version` against the live daemon's recorded version and emits `server_version`, `daemon_version`, and on mismatch `stale_binary:true` + a note to `/mcp` reconnect.
- **`peer_unknown` on a send now says WHY, and the MCP `wire_send` schema exposes `queue`** (#284 parts 6-7): a VERIFIED/pinned peer whose relay slot had an empty `slot_token` (the `pair_drop_ack` hadn't landed — common right after a daemon/MCP restart) returned the misleading "peer not pinned — run wire dial." The reason is now classified against live state — *not pinned* vs *pinned but no endpoint* vs *pinned but slot has no token* — and points at the **full `<peer>@<relay>` dial** (the bare nickname reports `already_pinned` without re-registering the slot). Separately, the MCP `wire_send` tool documented `queue:true` but omitted it from its input schema (the handler already supported it) — now exposed.
Expand Down
213 changes: 199 additions & 14 deletions src/platform.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,104 @@
//! Each helper returns conservative defaults on tool failure (empty
//! Vec, `false`) so callers can chain them without aborting an upgrade
//! mid-flight when one query hiccups.
//!
//! ### Bounded shell-out (#284.1)
//!
//! Every Windows shell-out below is wrapped in [`run_with_timeout`].
//! PowerShell's `Get-CimInstance` can wedge — observed on a host with
//! 254 stale `wire.exe` processes piled up by a broken SessionStart
//! loop, but also any corrupted CIM repository — and any `wire status`
//! / `wire up` / `wire doctor` call that lands on a wedged enumeration
//! would block forever waiting on the child. The wrapper kills the
//! child after `WIRE_PLATFORM_TIMEOUT_SECS` (default 5s) and the
//! caller falls through to its existing tool-error fallback (empty
//! Vec, `None`, etc.), so a probe that can't answer in 5s reads as
//! "no answer" rather than "wedge the whole CLI".

use std::process::{Command, Output, Stdio};
use std::sync::mpsc;
use std::thread;
use std::time::Duration;

use std::process::Command;
/// Bounded timeout for Windows shell-outs in this module. Override via
/// `WIRE_PLATFORM_TIMEOUT_SECS`. Default 5s — every probe in this
/// module is a single PowerShell / tasklist call that completes in
/// well under 500ms on a healthy host. POSIX builds never call this
/// at runtime (the test module does, hence not `#[cfg(windows)]`-only),
/// so silence the dead-code lint there.
#[cfg_attr(not(windows), allow(dead_code))]
fn platform_shell_timeout() -> Duration {
std::env::var("WIRE_PLATFORM_TIMEOUT_SECS")
.ok()
.and_then(|s| s.parse::<u64>().ok())
.map(Duration::from_secs)
.unwrap_or_else(|| Duration::from_secs(5))
}

/// Run `cmd` with a wall-clock timeout. Returns `Some(Output)` on
/// completion, or `None` on timeout (or spawn failure / wait failure).
/// On timeout the child is killed best-effort via a platform-native
/// shell-out (`taskkill /F /T /PID` on Windows, `kill -9` on POSIX) so
/// the wedged process tree exits with the wrapper.
///
/// `Stdio` defaults: stdin null, stdout/stderr piped. Callers may
/// override `stdin` before calling but should leave the pipes alone —
/// the reader thread relies on them being captured to drain output
/// while we wait.
///
/// Implementation: spawn the child, hand `wait_with_output` to a
/// background thread that sends the result through a channel, then
/// `recv_timeout` on the main thread. On timeout we kill the PID via
/// the OS-native tool (we can't call `Child::kill` here because the
/// `Child` moved into the reader thread).
pub fn run_with_timeout(mut cmd: Command, timeout: Duration) -> Option<Output> {
cmd.stdin(Stdio::null())
.stdout(Stdio::piped())
.stderr(Stdio::piped());
let child = cmd.spawn().ok()?;
let pid = child.id();
let (tx, rx) = mpsc::channel::<Output>();
thread::spawn(move || {
if let Ok(out) = child.wait_with_output() {
let _ = tx.send(out);
}
});
match rx.recv_timeout(timeout) {
Ok(out) => Some(out),
Err(_) => {
// Kill the wedged child by PID. Best-effort: a failure here
// just means the reader thread keeps waiting; the main
// thread already moved on with `None`.
kill_pid_best_effort(pid);
None
}
}
}

fn kill_pid_best_effort(pid: u32) {
#[cfg(unix)]
{
let _ = Command::new("kill")
.args(["-9", &pid.to_string()])
.stdin(Stdio::null())
.stdout(Stdio::null())
.stderr(Stdio::null())
.status();
}
#[cfg(windows)]
{
let _ = Command::new("taskkill.exe")
.args(["/F", "/T", "/PID", &pid.to_string()])
.stdin(Stdio::null())
.stdout(Stdio::null())
.stderr(Stdio::null())
.status();
}
#[cfg(not(any(unix, windows)))]
{
let _ = pid;
}
}

/// True iff pid is alive.
///
Expand All @@ -44,15 +140,22 @@ pub fn process_alive(pid: u32) -> bool {
}
#[cfg(windows)]
{
let out = Command::new("tasklist.exe")
.args(["/FI", &format!("PID eq {pid}"), "/FO", "CSV", "/NH"])
.output();
match out {
Ok(o) if o.status.success() => {
// Bounded: a wedged `tasklist` would hang every `wire status` /
// `wire doctor` it touches. 5s default — `tasklist /FI "PID eq …"`
// completes in well under 100ms on a healthy host.
let mut cmd = Command::new("tasklist.exe");
cmd.args(["/FI", &format!("PID eq {pid}"), "/FO", "CSV", "/NH"]);
match run_with_timeout(cmd, platform_shell_timeout()) {
Some(o) if o.status.success() => {
let s = String::from_utf8_lossy(&o.stdout);
let trimmed = s.trim();
!trimmed.is_empty() && !trimmed.starts_with("INFO:")
}
// Timeout / failure → conservative `false` (treat as dead).
// Same fallback the old `Err(_) | Ok(non-success)` arm
// produced; `wire status` already handles "daemon missing"
// cleanly, and surfacing "timed out probing" is part of
// #284.1's bounded-but-loud story.
_ => false,
}
}
Expand Down Expand Up @@ -129,10 +232,13 @@ pub fn find_processes_by_cmdline(pattern: &str) -> Vec<u32> {
Where-Object {{ $_.Name -like 'wire*' -and $_.ProcessId -ne $PID -and $_.CommandLine -like '*{escaped}*' }} | \
Select-Object -ExpandProperty ProcessId"
);
Command::new("powershell.exe")
.args(["-NoProfile", "-NonInteractive", "-Command", &ps])
.output()
.ok()
// Bounded: a wedged `Get-CimInstance` (corrupted CIM repo, or
// simply slow under heavy WMI contention on a host with
// hundreds of stale `wire.exe` processes — see #284.1 / #284.2)
// would hang every CLI invocation it's reached from. 5s default.
let mut cmd = Command::new("powershell.exe");
cmd.args(["-NoProfile", "-NonInteractive", "-Command", &ps]);
run_with_timeout(cmd, platform_shell_timeout())
.filter(|o| o.status.success())
.map(|o| {
String::from_utf8_lossy(&o.stdout)
Expand Down Expand Up @@ -204,10 +310,9 @@ pub fn pid_cmdline(pid: u32) -> Option<String> {
Where-Object {{ $_.ProcessId -eq {pid} }} | \
Select-Object -ExpandProperty CommandLine"
);
let out = Command::new("powershell.exe")
.args(["-NoProfile", "-NonInteractive", "-Command", &ps])
.output()
.ok()?;
let mut cmd = Command::new("powershell.exe");
cmd.args(["-NoProfile", "-NonInteractive", "-Command", &ps]);
let out = run_with_timeout(cmd, platform_shell_timeout())?;
if !out.status.success() {
return None;
}
Expand Down Expand Up @@ -447,4 +552,84 @@ mod tests {
let dead = 4_000_000_002;
let _ = kill_process(dead, false);
}

// ---------- #284.1: run_with_timeout ----------

use std::time::Instant;

#[test]
fn run_with_timeout_returns_some_on_fast_command() {
// Pick a tiny command that exists on every platform.
#[cfg(unix)]
let cmd = {
let mut c = Command::new("echo");
c.arg("hello");
c
};
#[cfg(windows)]
let cmd = {
let mut c = Command::new("cmd.exe");
c.args(["/C", "echo hello"]);
c
};
let out = run_with_timeout(cmd, Duration::from_secs(5));
assert!(out.is_some(), "echo must complete inside 5s");
let out = out.unwrap();
assert!(out.status.success());
let s = String::from_utf8_lossy(&out.stdout);
assert!(
s.contains("hello"),
"stdout should contain `hello`; got {s:?}"
);
}

#[test]
fn run_with_timeout_returns_none_and_kills_on_slow_command() {
// Sleep WAY past the timeout so we can prove the wrapper
// returns inside the timeout window, not at sleep completion.
#[cfg(unix)]
let cmd = {
let mut c = Command::new("sleep");
c.arg("60");
c
};
#[cfg(windows)]
let cmd = {
let mut c = Command::new("powershell.exe");
c.args([
"-NoProfile",
"-NonInteractive",
"-Command",
"Start-Sleep -Seconds 60",
]);
c
};
let started = Instant::now();
let out = run_with_timeout(cmd, Duration::from_millis(500));
let elapsed = started.elapsed();
assert!(out.is_none(), "slow command must time out, got {out:?}");
// Generous upper bound — taskkill / kill spawning takes a beat,
// and CI runners are not real-time. The point is "not 60s".
assert!(
elapsed < Duration::from_secs(10),
"must return well inside the wedged child's runtime; elapsed={elapsed:?}"
);
}

#[test]
fn platform_shell_timeout_default_is_5s() {
// SAFETY: serial tests + this test only reads / restores its own var.
// Save and restore any existing value so a sibling test isn't
// perturbed (no global ENV_LOCK in this module).
let prev = std::env::var("WIRE_PLATFORM_TIMEOUT_SECS").ok();
unsafe { std::env::remove_var("WIRE_PLATFORM_TIMEOUT_SECS") };
assert_eq!(platform_shell_timeout(), Duration::from_secs(5));
unsafe { std::env::set_var("WIRE_PLATFORM_TIMEOUT_SECS", "12") };
assert_eq!(platform_shell_timeout(), Duration::from_secs(12));
// Restore.
match prev {
Some(v) => unsafe { std::env::set_var("WIRE_PLATFORM_TIMEOUT_SECS", v) },
None => unsafe { std::env::remove_var("WIRE_PLATFORM_TIMEOUT_SECS") },
}
}
}