fix(config): bound relay.lock acquisition + reclaim from stale owners (#284.5)#296
Merged
Merged
Conversation
This was referenced Jun 14, 2026
3a99093 to
4ea64ed
Compare
Deploying wireup-landing with
|
| Latest commit: |
21e6f22
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://7fd730db.wireup-landing.pages.dev |
| Branch Preview URL: | https://fix-284-5-stale-relay-lock-r.wireup-landing.pages.dev |
4ea64ed to
4589b67
Compare
…#284.5) Issue #284 part 5 (from Willard's Windows report): a hung `wire daemon` — or any wire process stuck in a relay long-poll (#284.1) — would hold `relay.lock` forever, since the kernel only releases the flock at PID exit and the wedged process never exited. Every subsequent `wire status` / `wire send` / `wire daemon` then blocked on `lock_exclusive()` indefinitely. On Windows that was the engine behind Willard's 254-`wire.exe`-process pile-up: the SessionStart `until wire status …` loop kept spawning fresh status invocations every 3s, each one blocking forever on the held lock. Acquisition is now bounded + reclaim-aware: 1. Open the lock file and try `try_lock_exclusive` non-blocking. 2. On success: stamp our PID into a new sidecar `relay.lock.owner` file. The sidecar is intentionally separate from the flock file — Windows `LockFileEx` denies reads to other handles against the locked file, so a waiter cannot read a "who owns this?" PID body off of `relay.lock` itself. The sidecar is plain-text and never byte-range-locked, so any waiter can read it without contending for the flock. 3. On contention: read the sidecar, consult `classify_contention`. - Dead/absent owner → retry immediately with a 1ms sleep. The OS has already released the underlying flock when the owning PID exited; the next try will succeed. - Live owner → exponential backoff (10ms → 200ms) until `WIRE_RELAY_LOCK_TIMEOUT_SECS` (default 10s) elapses, then fail with the holder PID surfaced in the error so `wire doctor` and the SessionStart loop can name a target to kill. The classification logic (`classify_contention`) is split out as a pure function over `(body: &[u8], is_alive: impl Fn(u32) -> bool)`, so the dead-PID-says-reclaim / live-PID-says-wait policy is unit-testable on every platform without spinning real subprocesses. Integration tests cover the stamps-PID-on-acquire, reclaim-when-owner-dead, and times-out-with-PID-when-owner-alive paths against the actual fs2 flock implementation. No behavior change in the uncontended fast path. The Windows lock reclaim relies on `crate::platform::process_alive`, which already has the Windows `tasklist`-based implementation that v0.7.3 hardened. Tests: 8 new (5 pure-logic `classify_contention_*`, 3 integration `acquire_relay_lock_*`), all green on `x86_64-pc-windows-msvc` (rustc 1.96.0). Full lib suite: 494 passed; 0 failed. Stacks on top of #294 (Windows test/clippy hygiene); rebase onto main once #294 lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: laul.pogan <paul@zaibatsuheavy.industries>
4589b67 to
21e6f22
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part 5 of #284 (Willard's Windows report).
Summary
A hung
wire daemon— or any wire process stuck in a relay long-poll (#284.1) — heldrelay.lockforever, since the kernel only releases the flock at PID exit and the wedged process never exited. Every subsequentwire status/wire send/wire daemonthen blocked onlock_exclusive()indefinitely. On Windows this was the engine behind Willard's 254-wire.exe-process pile-up: the SessionStartuntil wire status …loop kept spawning fresh status invocations every 3s, each one blocking forever on the held lock.Acquisition is now bounded + reclaim-aware. No behavior change in the uncontended fast path.
How it works
try_lock_exclusivenon-blocking.relay.lock.ownerfile. The sidecar is intentionally separate from the flock file — WindowsLockFileExdenies reads to other handles against the locked file, so a waiter cannot read a "who owns this?" PID body off ofrelay.lockitself. The sidecar is plain-text and never byte-range-locked, so any waiter can read it without contending for the flock.classify_contention:WIRE_RELAY_LOCK_TIMEOUT_SECS(default 10s) elapses, then fail with the holder PID surfaced in the error sowire doctorand the SessionStart loop can name a target to kill.The classification logic (
classify_contention) is split out as a pure function over(body: &[u8], is_alive: impl Fn(u32) -> bool), so the dead-PID-says-reclaim / live-PID-says-wait policy is unit-testable on every platform without spinning real subprocesses.Why the sidecar (and not the lock file body)
The first cut wrote the PID directly into
relay.lock's body. On Windows that fails:LockFileExis byte-range-mandatory against any other handle, so a waiter'sfs::readreturnsERROR_LOCK_VIOLATION (33). POSIX byte-range locks are advisory, so it works on Linux/macOS — but the half-platform answer is a footgun. Splitting the owner PID into a sidecar gives every platform the same "owner is dead → reclaim immediately" path and keeps the flock file purely an OS-level lock token.Tests
8 new, all green on
x86_64-pc-windows-msvc(rustc 1.96.0):Pure-logic (
classify_contention):classify_contention_dead_pid_says_reclaimclassify_contention_live_pid_says_waitclassify_contention_empty_body_says_reclaimclassify_contention_garbage_body_says_reclaimclassify_contention_trims_whitespaceIntegration (
acquire_relay_lockvs real fs2):acquire_relay_lock_stamps_our_pid_into_owner_sidecaracquire_relay_lock_reclaims_when_owner_pid_is_dead— writesu32::MAXinto the sidecar (never an assigned PID on Linux + Windows), confirms acquire wins well under the timeout.acquire_relay_lock_times_out_when_owner_is_alive— holds the lock from inside the test, writes our own (live) PID into the sidecar, confirms acquire respects the bounded timeout AND surfaces the holder PID in the error.Full lib suite: 494 passed; 0 failed; 7 ignored.
Out of scope (left for follow-ups)
trust.lock(Org membership: no member-cert expiry + no key rotation; write_trust not atomic #246) has the same acquisition pattern. Migrating it toacquire_lockwould be a natural follow-up but isn't strictly needed for the Windows: status/up/doctor + upgrade hang, SessionStart wait-loop pileup (254 procs), and wire_send peer_unknown on VERIFIED peer #284 cluster — keeping this PR narrowly to the relay-lock path saffron called out.until …wait-loop).This PR turns the symptom (every
wirecommand blocks forever once one holder wedges) into a bounded, diagnosable failure mode. #284.1 + #284.2 then attack the trigger.Stack
This PR stacks on top of #294 (Windows test/clippy hygiene — three tiny pre-existing breakages that block any local
cargo test --lib/cargo clippy -- -D warningson MSVC). When #294 lands I'll rebase this onto main; in the meantime CI should be fine since #294 only touches test/scaffold paths.Test plan
cargo fmt --checkclean on Windows.cargo clippy --all-targets -- -D warningsclean on Windows.cargo test --lib494/0/7 on Windows.install-smoke-windows, demo, docs-lint).wire statusreclaims immediately. (Difficult to fully simulate on a hung-but-alive daemon without Windows: status/up/doctor + upgrade hang, SessionStart wait-loop pileup (254 procs), and wire_send peer_unknown on VERIFIED peer #284.1, but the unit + integration coverage exercises both branches of the policy.)🤖 Generated with Claude Code