fix(config): bound relay.lock acquisition + reclaim from stale owners (#284.5) by laulpogan · Pull Request #296 · SlanchaAi/wire

laulpogan · 2026-06-14T23:14:10Z

Part 5 of #284 (Willard's Windows report).

Summary

A hung wire daemon — or any wire process stuck in a relay long-poll (#284.1) — held relay.lock forever, since the kernel only releases the flock at PID exit and the wedged process never exited. Every subsequent wire status / wire send / wire daemon then blocked on lock_exclusive() indefinitely. On Windows this was the engine behind Willard's 254-wire.exe-process pile-up: the SessionStart until wire status … loop kept spawning fresh status invocations every 3s, each one blocking forever on the held lock.

Acquisition is now bounded + reclaim-aware. No behavior change in the uncontended fast path.

How it works

Open the lock file and try try_lock_exclusive non-blocking.
On success: stamp our PID into a new sidecar relay.lock.owner file. The sidecar is intentionally separate from the flock file — Windows LockFileEx denies reads to other handles against the locked file, so a waiter cannot read a "who owns this?" PID body off of relay.lock itself. The sidecar is plain-text and never byte-range-locked, so any waiter can read it without contending for the flock.
On contention: read the sidecar, consult classify_contention:
- Dead/absent owner → retry immediately with a 1ms sleep. The OS has already released the underlying flock when the owning PID exited; the next try will succeed.
- Live owner → exponential backoff (10ms → 200ms) until WIRE_RELAY_LOCK_TIMEOUT_SECS (default 10s) elapses, then fail with the holder PID surfaced in the error so wire doctor and the SessionStart loop can name a target to kill.

The classification logic (classify_contention) is split out as a pure function over (body: &[u8], is_alive: impl Fn(u32) -> bool), so the dead-PID-says-reclaim / live-PID-says-wait policy is unit-testable on every platform without spinning real subprocesses.

Why the sidecar (and not the lock file body)

The first cut wrote the PID directly into relay.lock's body. On Windows that fails: LockFileEx is byte-range-mandatory against any other handle, so a waiter's fs::read returns ERROR_LOCK_VIOLATION (33). POSIX byte-range locks are advisory, so it works on Linux/macOS — but the half-platform answer is a footgun. Splitting the owner PID into a sidecar gives every platform the same "owner is dead → reclaim immediately" path and keeps the flock file purely an OS-level lock token.

Tests

8 new, all green on x86_64-pc-windows-msvc (rustc 1.96.0):

Pure-logic (classify_contention):

classify_contention_dead_pid_says_reclaim
classify_contention_live_pid_says_wait
classify_contention_empty_body_says_reclaim
classify_contention_garbage_body_says_reclaim
classify_contention_trims_whitespace

Integration (acquire_relay_lock vs real fs2):

acquire_relay_lock_stamps_our_pid_into_owner_sidecar
acquire_relay_lock_reclaims_when_owner_pid_is_dead — writes u32::MAX into the sidecar (never an assigned PID on Linux + Windows), confirms acquire wins well under the timeout.
acquire_relay_lock_times_out_when_owner_is_alive — holds the lock from inside the test, writes our own (live) PID into the sidecar, confirms acquire respects the bounded timeout AND surfaces the holder PID in the error.

Full lib suite: 494 passed; 0 failed; 7 ignored.

Out of scope (left for follow-ups)

trust.lock (Org membership: no member-cert expiry + no key rotation; write_trust not atomic #246) has the same acquisition pattern. Migrating it to acquire_lock would be a natural follow-up but isn't strictly needed for the Windows: status/up/doctor + upgrade hang, SessionStart wait-loop pileup (254 procs), and wire_send peer_unknown on VERIFIED peer #284 cluster — keeping this PR narrowly to the relay-lock path saffron called out.
The "spin up 254 procs in the first place" loop is Windows: status/up/doctor + upgrade hang, SessionStart wait-loop pileup (254 procs), and wire_send peer_unknown on VERIFIED peer #284.2 (a separate PR on bounding the SessionStart until … wait-loop).
The "long-poll hangs the daemon in the first place" root cause is Windows: status/up/doctor + upgrade hang, SessionStart wait-loop pileup (254 procs), and wire_send peer_unknown on VERIFIED peer #284.1.

This PR turns the symptom (every wire command blocks forever once one holder wedges) into a bounded, diagnosable failure mode. #284.1 + #284.2 then attack the trigger.

Stack

This PR stacks on top of #294 (Windows test/clippy hygiene — three tiny pre-existing breakages that block any local cargo test --lib / cargo clippy -- -D warnings on MSVC). When #294 lands I'll rebase this onto main; in the meantime CI should be fine since #294 only touches test/scaffold paths.

Test plan

cargo fmt --check clean on Windows.
cargo clippy --all-targets -- -D warnings clean on Windows.
cargo test --lib 494/0/7 on Windows.
CI green (install-smoke-windows, demo, docs-lint).
Manual repro: stage a held lock with a dead PID in the sidecar, confirm next wire status reclaims immediately. (Difficult to fully simulate on a hung-but-alive daemon without Windows: status/up/doctor + upgrade hang, SessionStart wait-loop pileup (254 procs), and wire_send peer_unknown on VERIFIED peer #284.1, but the unit + integration coverage exercises both branches of the policy.)

🤖 Generated with Claude Code

cloudflare-workers-and-pages · 2026-06-15T00:22:31Z

Deploying wireup-landing with Cloudflare Pages

Latest commit:	`21e6f22`
Status:	✅ Deploy successful!
Preview URL:	https://7fd730db.wireup-landing.pages.dev
Branch Preview URL:	https://fix-284-5-stale-relay-lock-r.wireup-landing.pages.dev

View logs

…#284.5) Issue #284 part 5 (from Willard's Windows report): a hung `wire daemon` — or any wire process stuck in a relay long-poll (#284.1) — would hold `relay.lock` forever, since the kernel only releases the flock at PID exit and the wedged process never exited. Every subsequent `wire status` / `wire send` / `wire daemon` then blocked on `lock_exclusive()` indefinitely. On Windows that was the engine behind Willard's 254-`wire.exe`-process pile-up: the SessionStart `until wire status …` loop kept spawning fresh status invocations every 3s, each one blocking forever on the held lock. Acquisition is now bounded + reclaim-aware: 1. Open the lock file and try `try_lock_exclusive` non-blocking. 2. On success: stamp our PID into a new sidecar `relay.lock.owner` file. The sidecar is intentionally separate from the flock file — Windows `LockFileEx` denies reads to other handles against the locked file, so a waiter cannot read a "who owns this?" PID body off of `relay.lock` itself. The sidecar is plain-text and never byte-range-locked, so any waiter can read it without contending for the flock. 3. On contention: read the sidecar, consult `classify_contention`. - Dead/absent owner → retry immediately with a 1ms sleep. The OS has already released the underlying flock when the owning PID exited; the next try will succeed. - Live owner → exponential backoff (10ms → 200ms) until `WIRE_RELAY_LOCK_TIMEOUT_SECS` (default 10s) elapses, then fail with the holder PID surfaced in the error so `wire doctor` and the SessionStart loop can name a target to kill. The classification logic (`classify_contention`) is split out as a pure function over `(body: &[u8], is_alive: impl Fn(u32) -> bool)`, so the dead-PID-says-reclaim / live-PID-says-wait policy is unit-testable on every platform without spinning real subprocesses. Integration tests cover the stamps-PID-on-acquire, reclaim-when-owner-dead, and times-out-with-PID-when-owner-alive paths against the actual fs2 flock implementation. No behavior change in the uncontended fast path. The Windows lock reclaim relies on `crate::platform::process_alive`, which already has the Windows `tasklist`-based implementation that v0.7.3 hardened. Tests: 8 new (5 pure-logic `classify_contention_*`, 3 integration `acquire_relay_lock_*`), all green on `x86_64-pc-windows-msvc` (rustc 1.96.0). Full lib suite: 494 passed; 0 failed. Stacks on top of #294 (Windows test/clippy hygiene); rebase onto main once #294 lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: laul.pogan <paul@zaibatsuheavy.industries>

loganclaw9000 mentioned this pull request Jun 15, 2026

🤖 hourly mesh-triage log SlanchaAi/slancha-mesh#30

Open

laulpogan force-pushed the fix/284-5-stale-relay-lock-reclaim branch from 3a99093 to 4ea64ed Compare June 15, 2026 00:22

laulpogan force-pushed the fix/284-5-stale-relay-lock-reclaim branch from 4ea64ed to 4589b67 Compare June 15, 2026 00:37

laulpogan force-pushed the fix/284-5-stale-relay-lock-reclaim branch from 4589b67 to 21e6f22 Compare June 15, 2026 01:48

laulpogan merged commit eb4012a into main Jun 15, 2026
12 checks passed

laulpogan deleted the fix/284-5-stale-relay-lock-reclaim branch June 15, 2026 01:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(config): bound relay.lock acquisition + reclaim from stale owners (#284.5)#296

fix(config): bound relay.lock acquisition + reclaim from stale owners (#284.5)#296
laulpogan merged 1 commit into
mainfrom
fix/284-5-stale-relay-lock-reclaim

laulpogan commented Jun 14, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

laulpogan commented Jun 14, 2026

Summary

How it works

Why the sidecar (and not the lock file body)

Tests

Out of scope (left for follow-ups)

Stack

Test plan

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying wireup-landing with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cloudflare-workers-and-pages Bot commented Jun 15, 2026 •

edited

Loading