Skip to content

fix(config): bound relay.lock acquisition + reclaim from stale owners (#284.5)#296

Merged
laulpogan merged 1 commit into
mainfrom
fix/284-5-stale-relay-lock-reclaim
Jun 15, 2026
Merged

fix(config): bound relay.lock acquisition + reclaim from stale owners (#284.5)#296
laulpogan merged 1 commit into
mainfrom
fix/284-5-stale-relay-lock-reclaim

Conversation

@laulpogan

Copy link
Copy Markdown
Collaborator

Part 5 of #284 (Willard's Windows report).

Summary

A hung wire daemon — or any wire process stuck in a relay long-poll (#284.1) — held relay.lock forever, since the kernel only releases the flock at PID exit and the wedged process never exited. Every subsequent wire status / wire send / wire daemon then blocked on lock_exclusive() indefinitely. On Windows this was the engine behind Willard's 254-wire.exe-process pile-up: the SessionStart until wire status … loop kept spawning fresh status invocations every 3s, each one blocking forever on the held lock.

Acquisition is now bounded + reclaim-aware. No behavior change in the uncontended fast path.

How it works

  1. Open the lock file and try try_lock_exclusive non-blocking.
  2. On success: stamp our PID into a new sidecar relay.lock.owner file. The sidecar is intentionally separate from the flock file — Windows LockFileEx denies reads to other handles against the locked file, so a waiter cannot read a "who owns this?" PID body off of relay.lock itself. The sidecar is plain-text and never byte-range-locked, so any waiter can read it without contending for the flock.
  3. On contention: read the sidecar, consult classify_contention:
    • Dead/absent owner → retry immediately with a 1ms sleep. The OS has already released the underlying flock when the owning PID exited; the next try will succeed.
    • Live owner → exponential backoff (10ms → 200ms) until WIRE_RELAY_LOCK_TIMEOUT_SECS (default 10s) elapses, then fail with the holder PID surfaced in the error so wire doctor and the SessionStart loop can name a target to kill.

The classification logic (classify_contention) is split out as a pure function over (body: &[u8], is_alive: impl Fn(u32) -> bool), so the dead-PID-says-reclaim / live-PID-says-wait policy is unit-testable on every platform without spinning real subprocesses.

Why the sidecar (and not the lock file body)

The first cut wrote the PID directly into relay.lock's body. On Windows that fails: LockFileEx is byte-range-mandatory against any other handle, so a waiter's fs::read returns ERROR_LOCK_VIOLATION (33). POSIX byte-range locks are advisory, so it works on Linux/macOS — but the half-platform answer is a footgun. Splitting the owner PID into a sidecar gives every platform the same "owner is dead → reclaim immediately" path and keeps the flock file purely an OS-level lock token.

Tests

8 new, all green on x86_64-pc-windows-msvc (rustc 1.96.0):

Pure-logic (classify_contention):

  • classify_contention_dead_pid_says_reclaim
  • classify_contention_live_pid_says_wait
  • classify_contention_empty_body_says_reclaim
  • classify_contention_garbage_body_says_reclaim
  • classify_contention_trims_whitespace

Integration (acquire_relay_lock vs real fs2):

  • acquire_relay_lock_stamps_our_pid_into_owner_sidecar
  • acquire_relay_lock_reclaims_when_owner_pid_is_dead — writes u32::MAX into the sidecar (never an assigned PID on Linux + Windows), confirms acquire wins well under the timeout.
  • acquire_relay_lock_times_out_when_owner_is_alive — holds the lock from inside the test, writes our own (live) PID into the sidecar, confirms acquire respects the bounded timeout AND surfaces the holder PID in the error.

Full lib suite: 494 passed; 0 failed; 7 ignored.

Out of scope (left for follow-ups)

This PR turns the symptom (every wire command blocks forever once one holder wedges) into a bounded, diagnosable failure mode. #284.1 + #284.2 then attack the trigger.

Stack

This PR stacks on top of #294 (Windows test/clippy hygiene — three tiny pre-existing breakages that block any local cargo test --lib / cargo clippy -- -D warnings on MSVC). When #294 lands I'll rebase this onto main; in the meantime CI should be fine since #294 only touches test/scaffold paths.

Test plan

🤖 Generated with Claude Code

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 15, 2026

Copy link
Copy Markdown

Deploying wireup-landing with  Cloudflare Pages  Cloudflare Pages

Latest commit: 21e6f22
Status: ✅  Deploy successful!
Preview URL: https://7fd730db.wireup-landing.pages.dev
Branch Preview URL: https://fix-284-5-stale-relay-lock-r.wireup-landing.pages.dev

View logs

@laulpogan laulpogan force-pushed the fix/284-5-stale-relay-lock-reclaim branch from 4ea64ed to 4589b67 Compare June 15, 2026 00:37
…#284.5)

Issue #284 part 5 (from Willard's Windows report): a hung `wire daemon` —
or any wire process stuck in a relay long-poll (#284.1) — would hold
`relay.lock` forever, since the kernel only releases the flock at PID
exit and the wedged process never exited. Every subsequent `wire status`
/ `wire send` / `wire daemon` then blocked on `lock_exclusive()`
indefinitely. On Windows that was the engine behind Willard's
254-`wire.exe`-process pile-up: the SessionStart `until wire status …`
loop kept spawning fresh status invocations every 3s, each one blocking
forever on the held lock.

Acquisition is now bounded + reclaim-aware:

  1. Open the lock file and try `try_lock_exclusive` non-blocking.
  2. On success: stamp our PID into a new sidecar `relay.lock.owner`
     file. The sidecar is intentionally separate from the flock file —
     Windows `LockFileEx` denies reads to other handles against the
     locked file, so a waiter cannot read a "who owns this?" PID body
     off of `relay.lock` itself. The sidecar is plain-text and never
     byte-range-locked, so any waiter can read it without contending
     for the flock.
  3. On contention: read the sidecar, consult `classify_contention`.
     - Dead/absent owner → retry immediately with a 1ms sleep. The OS
       has already released the underlying flock when the owning PID
       exited; the next try will succeed.
     - Live owner → exponential backoff (10ms → 200ms) until
       `WIRE_RELAY_LOCK_TIMEOUT_SECS` (default 10s) elapses, then
       fail with the holder PID surfaced in the error so `wire doctor`
       and the SessionStart loop can name a target to kill.

The classification logic (`classify_contention`) is split out as a pure
function over `(body: &[u8], is_alive: impl Fn(u32) -> bool)`, so the
dead-PID-says-reclaim / live-PID-says-wait policy is unit-testable on
every platform without spinning real subprocesses. Integration tests
cover the stamps-PID-on-acquire, reclaim-when-owner-dead, and
times-out-with-PID-when-owner-alive paths against the actual fs2 flock
implementation.

No behavior change in the uncontended fast path. The Windows lock
reclaim relies on `crate::platform::process_alive`, which already has
the Windows `tasklist`-based implementation that v0.7.3 hardened.

Tests: 8 new (5 pure-logic `classify_contention_*`, 3 integration
`acquire_relay_lock_*`), all green on `x86_64-pc-windows-msvc`
(rustc 1.96.0). Full lib suite: 494 passed; 0 failed.

Stacks on top of #294 (Windows test/clippy hygiene); rebase onto main
once #294 lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: laul.pogan <paul@zaibatsuheavy.industries>
@laulpogan laulpogan force-pushed the fix/284-5-stale-relay-lock-reclaim branch from 4589b67 to 21e6f22 Compare June 15, 2026 01:48
@laulpogan laulpogan merged commit eb4012a into main Jun 15, 2026
12 checks passed
@laulpogan laulpogan deleted the fix/284-5-stale-relay-lock-reclaim branch June 15, 2026 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant