Skip to content

Fix four file-watcher crash signatures (kevent panic, dangling watch paths, unlocked Windows scan, watchFile join overflow)#31695

Open
robobun wants to merge 5 commits into
mainfrom
farm/6cadb927/fix-watcher-crashes
Open

Fix four file-watcher crash signatures (kevent panic, dangling watch paths, unlocked Windows scan, watchFile join overflow)#31695
robobun wants to merge 5 commits into
mainfrom
farm/6cadb927/fix-watcher-crashes

Conversation

@robobun

@robobun robobun commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Fixes four crash signatures reported from 1.4 canary, all in or around bun_watcher. None have Zig analogs: two are checked-Rust versions of silent Zig UB, two are port-introduced lifetime/locking gaps.

1. BUN-3AVT — panic int cast: TryFromIntError(NegOverflow) in the watcher loop (macOS/FreeBSD)

Cause: KEventWatcher::watch_loop_cycle called raw kevent(2) and did usize::try_from(count).expect("int cast") on its return value. kevent returns -1 on failure (EINTR being the everyday case), so the watcher thread panicked. Zig's @intCast/@max(0, count) let the same -1 degrade into an empty event batch.

Fix: use the existing EINTR-retrying bun_sys::kevent wrapper. Real errors now propagate to watch_loopon_error instead of panicking; the 0.1 ms coalescing re-read is best-effort (deliver what we already have, a persistent kqueue error resurfaces on the next blocking call).

2. BUN-3EAD — SIGSEGV in DevServer::on_file_updateappend_dir/append_file

Cause: BundleV2::on_parse_task_complete appended watcher entries with CLONE_FILE_PATH = cfg!(windows), i.e. borrowed on POSIX. The borrow is only sound for paths interned process-lifetime in FilenameStore/DirnameStore. dupe_alloc's disjoint text/pretty branch (imports from outside the project root, whose pretty path starts with ../, and bake's ssr:-prefixed pretty paths — i.e. every SSR-graph module) instead allocates path.text in the per-build bundle arena (graph.heap), with an explicit comment that the consuming Path must not outlive the build. DevServer frees that arena at the end of every build, while the watcher keeps the pointer for the process lifetime and reads it on the watcher thread on every event for that file (and for its auto-watched parent directory — the append_dir flavor of the crash).

Fix: always clone the path into the watchlist (add_file::<true>); cloning only happens the first time a file is added (add_file early-returns on a hash hit).

Repro: test/bake/dev/bundle.test.ts "hot reload of a file imported from outside the project root". On an unfixed ASAN debug build the dev server dies with AddressSanitizer: SEGV ... READ memory access on the FileWatcher thread, 3/3 runs (a few rebuilds first so the freed arena pages get recycled); fixed build hot-reloads the file, 3/3 runs.

3. BUN-3D64 — SIGSEGV in is_parent_or_equal from the Windows watcher

Cause: windows_watcher::watch_loop_cycle scanned watchlist.items_file_path() without this.mutex, while the JS thread / transpiler workers / bundle thread append entries under that mutex. MultiArrayList::set_capacity frees the old backing slab on growth, so the unlocked scan could read a freed slab and hand is_parent_or_equal a garbage Cow pointer. INotifyWatcher already carries the locked-snapshot divergence for exactly this race (its PORT NOTE documents the realloc crash); the Windows backend never got it. Zig scanned unlocked too — same race, but Zig's column snapshot + interned strings made it quieter.

Fix: hold this.mutex for the watchlist scan, dropping/re-acquiring it around process_watch_event_batch (which locks the same non-recursive mutex internally). The lock is never held across a blocking wait: platform.next() (GetQueuedCompletionStatus) runs before the guard is taken, and iter.next() only parses the already-filled completion buffer. The item index is re-checked against the live length after relock since on_file_update can evict + compact.

4. BUN-3C2V — panic in slice inside NodeFS::watch_file

Cause: StatWatcher::init (fs.watchFile) joined the user path with the cwd via the unchecked join_abs_string_buf into a fixed pooled PathBuffer; when the joined result exceeds the buffer the join panics on a slice write. The sibling FSWatcher::init (fs.watch) was already hardened to join_abs_string_buf_checked + ENAMETOOLONG for exactly this shape; watchFile was missed. (Current fs.watchFile pre-resolves in JS and byte-validates in PathLike::from_js, which front-runs the overflow for the inputs I could construct — so this one is belt-and-braces at the exact panic-capable site rather than a reproducible-on-HEAD fix.)

Fix: use the checked join; on overflow fall back to a heap scratch (sized like join_abs_string_buf_checked's slow path) so the watcher is still created and its stat polls fail with ENAMETOOLONG → the listener observes zeroed stats, matching Node (which never throws for un-stat-able watchFile paths, verified against v24).

Verification

  • test/bake/dev/bundle.test.ts -t "outside the project root": fails 3/3 (ASAN SEGV on the watcher thread) with src/ stashed, passes 3/3 with the fix.
  • Full test/bake/dev/bundle.test.ts (21), test/bake/dev/hot.test.ts (9), test/js/node/watch/fs.watchFile.test.ts (7+2 skip) pass. (fs.watch.test.ts's two EACCES tests fail in this container because it runs as root — pre-existing, unrelated.)
  • cargo check/clippy -p bun_watcher clean for linux, darwin and windows-msvc targets; clippy -p bun_bundler -p bun_runtime clean.
  • Fix ?? operator  #1 and Copy source lines when generating error messages #3 are macOS-/Windows-only code paths that can't be exercised from this Linux container; existing watcher/hot-reload suites cover them on those CI lanes.

@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 67075ae1-0122-416a-a4bb-1bcdf50fa16f

📥 Commits

Reviewing files that changed from the base of the PR and between 73ab3dd and cf75042.

📒 Files selected for processing (5)
  • src/bundler/bundle_v2.rs
  • src/runtime/node/node_fs_stat_watcher.rs
  • src/watcher/KEventWatcher.rs
  • src/watcher/WindowsWatcher.rs
  • test/bake/dev/bundle.test.ts

Walkthrough

This PR hardens the watcher subsystem across four critical paths: ensuring all watcher paths are cloned to prevent use-after-free, handling buffer overflows gracefully in path normalization, eliminating unsafe FFI calls in kqueue polling, and protecting watchlist traversal with proper locking on Windows. A regression test validates the fixes for outside-root dependency hot reloading.

Changes

Watcher Safety and Concurrency Improvements

Layer / File(s) Summary
Force watcher path cloning for all platforms
src/bundler/bundle_v2.rs
File path registration now unconditionally clones paths for the watcher thread, changing from conditional cfg!(windows) to always-true cloning to ensure safe ownership and eliminate lifetime issues.
Stat watcher buffer overflow fallback
src/runtime/node/node_fs_stat_watcher.rs
Path normalization uses checked absolute-path joining with fallback heap allocation when the pooled buffer would overflow, preventing panics on oversized inputs.
KEvent watcher unsafe FFI elimination
src/watcher/KEventWatcher.rs
Polling logic refactored to call the bun_sys::kevent wrapper instead of direct unsafe syscalls, with error propagation via ? for real errors, best-effort coalescing, and RAII flush_guard for automatic output flushing.
Windows watcher mutex-guarded scan
src/watcher/WindowsWatcher.rs
Watch event scanning now acquires a mutex guard for consistent watchlist snapshots, uses a dynamic while loop to tolerate list changes between batch flushes, and explicitly drops/reacquires the lock around batch processing to avoid deadlock.
Regression test for outside-root hot reload
test/bake/dev/bundle.test.ts
New DevServer test exercises hot reloading when a route imports from a dependency outside the project root, with assertions on response value changes across file rewrites and dependency updates.

Possibly Related PRs

  • oven-sh/bun#31106: Changes watcher registration to force path ownership via unconditional cloning in add_directory::<true> for dev-server directory registration, mirroring the same path-lifetime fix applied here to add_file.

Suggested Reviewers

  • RiskyMH
  • dylan-conway
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the four distinct crash signatures being fixed: kevent panic, dangling watch paths, Windows unlocked scan, and watchFile join overflow.
Description check ✅ Passed The description exceeds template requirements with comprehensive detail on each crash cause, fix, and verification for all four issues, though it deviates from the minimal two-section template.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the claude label Jun 2, 2026
@robobun

robobun commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Found 2 issues this PR may fix:

  1. bug: when tried to run bun run dev (shadcn + tailwind) react template #21017 - Windows watcher crash with stack trace in Watcher.zig:219 threadMain; matches the BUN-3D64 fix for unlocked watchlist scan race condition
  2. bun dev server crashes when trying to restart the server under certain circumstances #23743 - Windows dev server crash in WindowsWatcher.zig:296 processWatchEventBatch / watchLoopCycle; matches the BUN-3D64 fix for the mutex race in Windows watcher

If this is helpful, copy the block below into the PR description to auto-close these issues on merge.

Fixes #21017
Fixes #23743

🤖 Generated with Claude Code

@robobun

robobun commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

Checked both suggested issues — neither is a confident match, so I'm not adding Fixes lines:

@robobun

robobun commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

Heads-up: #31697 (bake dev server crash fixes) contains the same one-line add_file::<{ cfg!(windows) }>add_file::<true> change for the dangling watcher paths (BUN-3EAD here; found independently via ASAN while fuzzing the SCB crash there). The change is semantically identical in both PRs — only the comment text differs — so whichever merges second gets a trivial comment-only conflict in bundle_v2.rs. The other three fixes in this PR and the other two in #31697 are fully disjoint.

#31697 keeps its copy of the hunk because its regression test ("hot update touching multiple client component boundaries") exercises file edits under separateSSRGraph and would be killed by the watcher UAF regardless of merge order.

Comment thread test/bake/dev/bundle.test.ts
Comment thread src/watcher/WindowsWatcher.rs
Comment thread src/watcher/KEventWatcher.rs

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my earlier findings are addressed and nothing new surfaced, but the Windows watcher's drop/re-acquire of the non-recursive mutex mid-scan and the kevent error-propagation change are concurrency-sensitive enough (and unverifiable from the Linux env) that a human pass on the lock discipline is worth having.

Extended reasoning...

Overview

Four independent crash fixes in bun_watcher plus one regression test: (1) KEventWatcher::watch_loop_cycle switches from raw kevent(2) + usize::try_from(count).expect(...) to the EINTR-retrying bun_sys::kevent wrapper, with the coalescing re-read made best-effort and Output::flush moved to an RAII guard; (2) bundle_v2.rs flips add_file::<{cfg!(windows)}> to add_file::<true> so watcher paths are always cloned rather than borrowed from the per-build arena; (3) WindowsWatcher::watch_loop_cycle now holds this.mutex over the watchlist scan, dropping/re-acquiring around process_watch_event_batch and converting the inner for to a while that re-reads the live length; (4) StatWatcher::init switches to join_abs_string_buf_checked with a heap-scratch fallback. A new bake dev test covers (2), now skip: ["win32"] per my earlier comment.

Security risks

None identified. No auth/crypto/permissions surface; the changes harden against panics and use-after-free in the watcher thread. The new locking cannot deadlock against process_watch_event_batch because the guard is explicitly dropped first, and is never held across the blocking platform.next() (taken after it returns). The watchFile overflow fallback allocates a bounded heap scratch (cwd.len + slice.len + 3) and is not attacker-amplifiable beyond what PathLike::from_js already admits.

Level of scrutiny

Medium-high. This is production watcher/hot-reload code that runs on a dedicated thread and coordinates with the JS thread, transpiler workers, and the bundle thread via a non-recursive mutex. Fix (3) in particular changes lock scope and introduces a drop/relock pattern whose correctness rests on invariants spelled out in comments (lock not held across blocking wait; item_idx re-checked against live length after relock; documented swap-behind-cursor trade-off). Fix (1) changes error semantics on the macOS/FreeBSD path. Both are platform-gated and cannot be exercised from the Linux container the PR was verified in — they rely on the macOS/Windows CI lanes.

Other factors

I previously flagged three issues on this PR (Windows test hang, swap-remove scan-skip trade-off, stale flush-justification comment); all were addressed in 663a055 and bb488e3 and the threads are resolved. The current bug-hunting pass found nothing further. The robobun build comment shows a musl LTO link failure that is an infrastructure issue unrelated to these source changes. Given the concurrency sensitivity and the two unverifiable-here platform paths, I'm deferring rather than approving — not because I see a problem, but because mutex drop/re-acquire patterns in a hot watcher loop are exactly where a human reviewer's pass on lock discipline pays for itself.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three of my earlier inline concerns are addressed (win32 skip, scan-compaction trade-off documented, RAII flush guard). I'm not auto-approving because the Windows watcher lock drop/re-acquire around process_watch_event_batch and the kevent error-propagation change are concurrency-sensitive and couldn't be exercised on their target platforms from the author's environment — worth human eyes.

Extended reasoning...

Overview

Four independent crash fixes in the file-watcher stack: (1) KEventWatcher::watch_loop_cycle switches from a raw kevent(2) call whose -1 return panicked a checked usize::try_from to the EINTR-retrying bun_sys::kevent wrapper with ? propagation; (2) bundle_v2.rs flips add_file::<{cfg!(windows)}> to add_file::<true> so watcher entries always clone the path instead of borrowing arena-backed bytes that DevServer frees per build; (3) WindowsWatcher::watch_loop_cycle now holds this.mutex while scanning watchlist.items_file_path(), dropping and re-acquiring it around the mid-batch process_watch_event_batch flush (which locks the same non-recursive mutex internally), and converts the for 0..n_items loop into a while item_idx < live_len loop; (4) StatWatcher::init uses join_abs_string_buf_checked with a heap-scratch fallback so an over-long cwd-joined path creates a watcher whose stat polls fail with ENAMETOOLONG instead of panicking. A new bake regression test covers (2) on POSIX. I previously flagged three issues (Windows test hang, swap_remove-behind-cursor skip, stale flush justification comment); all three were addressed in 663a055 / bb488e3.

Security risks

None apparent. No auth/crypto/permissions surface; the changes tighten lifetime and locking discipline in an internal watcher subsystem. The path-join fallback allocates a bounded Vec (cwd.len() + slice.len() + 3) and does not expose any new untrusted-input parsing.

Level of scrutiny

High. This is core infrastructure (dev-server hot reload, --watch, fs.watchFile) and three of the four fixes are concurrency- or lifetime-sensitive. The WindowsWatcher change in particular introduces a lock → drop → callback-that-relocks → re-acquire → continue pattern with a documented best-effort trade-off; correctness depends on process_watch_event_batch being the only thing that needs the lock dropped, on platform.next() / iter.next() not blocking under the lock, and on the re-read length guard being sufficient after compaction. The KEventWatcher change alters error semantics (propagate vs. swallow) for the watch loop. The author explicitly notes that (1) and (3) target macOS/Windows code paths that could not be run in their Linux verification environment, so CI on those lanes is the first real exercise.

Other factors

  • All of my earlier inline findings are resolved and the resolutions look correct in the current diff (skip: ["win32"] is present, the swap-behind-cursor comment is in place, Output::flush_guard() replaces the trailing flush + stale comment).
  • The bug-hunting system found nothing new on the current revision.
  • The robobun CI comment shows failures on bb488e3 followed by a ci: retrigger commit; I can't see the post-retrigger result, so a human should confirm green CI on macOS/Windows lanes before merge.
  • There is a known one-line overlap with PR #31697 in bundle_v2.rs (same add_file::<true> fix), noted by the author — comment-only conflict for whichever lands second.

Given the concurrency subtlety and the untested-on-target-platform caveat, this should get a human reviewer rather than an auto-approval.

@robobun

robobun commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

CI status for maintainers: the diff is green on every lane that ran — 70/73 checks pass, including all build lanes on every platform (darwin rust/cpp/bun builds compile the KEventWatcher change), every Windows test lane (exercises the new WindowsWatcher lock discipline via the watch/hot suites), and the Linux ASAN lane (runs the new bake regression test that reproduces the BUN-3EAD use-after-free).

The only 3 failures — darwin-14-aarch64-test-bun, darwin-14-x64-test-bun, darwin-26-aarch64-test-bun — are jobs that Expired in the Buildkite queue without running a single test, on two consecutive builds (#59828 and the #59862 retrigger). That's darwin agent starvation, not a test failure; there is nothing in the diff those lanes could be reacting to. They need a manual job retry once the darwin fleet has capacity, or a merge decision without them.

robobun and others added 5 commits June 6, 2026 11:36
- KEventWatcher: kevent(2) returning -1 (e.g. EINTR) panicked the watch
  loop on usize::try_from(count); use the EINTR-retrying bun_sys::kevent
  wrapper, treat the coalescing re-read as best-effort, and propagate
  real errors.
- bundle_v2: watcher paths were appended borrowed (CLONE_FILE_PATH=false
  on POSIX). Paths from dupe_alloc's disjoint text/pretty branch (imports
  outside the project root, bake's ssr:-prefixed pretty paths) live in
  the per-build bundle arena, which DevServer frees after every build,
  leaving dangling watchlist entries the watcher thread later reads in
  on_file_update. Always clone the path into the watchlist.
- WindowsWatcher: the watchlist scan in watch_loop_cycle ran without
  this.mutex while other threads append entries under it; MultiArrayList
  growth frees the old backing slab mid-scan and is_parent_or_equal
  dereferenced freed memory. Scan under the mutex, dropping it around
  process_watch_event_batch (mirrors INotifyWatcher's locked snapshot).
- StatWatcher (fs.watchFile): switch the cwd-join to
  join_abs_string_buf_checked with a heap-scratch fallback so an
  over-long joined path polls stat (ENAMETOOLONG -> zeroed stats, like
  Node) instead of overflowing the pooled path buffer.
…n trade-off

The Windows watcher refuses outside-root paths ("will not be watched"),
so the test's dev.write would wait forever for a watch event there — and
the dangling-borrow bug it guards was POSIX-only (Windows always cloned).

Also note in WindowsWatcher::watch_loop_cycle why the scan cursor is not
rewound after a mid-batch flush compacts the watchlist via swap_remove.
@robobun robobun force-pushed the farm/6cadb927/fix-watcher-crashes branch from cbc6890 to cf75042 Compare June 6, 2026 11:43
@robobun

robobun commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

Updated CI status after the rebase (build 61112): all watcher-related lanes are green — the new bake regression test passes everywhere including linux-x64-asan, and the Windows/macOS lanes build and run their test suites against the changed code.

The remaining failures are not from this diff:

  • test/cli/install/bunx.test.ts fails identically on ~12 lanes (Windows, macOS, Linux glibc/musl, ASAN). These tests call the live npm registry and GitHub: "should handle package that requires node 24" runs bunx @angular/cli@latest --help and a new upstream release now exits 3; the "github repository" tests fetch live tarballs. Reproduced locally with the identical failure set; nothing in this PR executes on the bun x path (no watcher, no fs.watchFile).
  • test/cli/update_interactive_install.test.ts on one Windows lane, same registry-dependent suite family.
  • darwin-26-aarch64-test-bun expired in queue again (agent starvation; darwin-14 lanes ran and only hit the bunx flake).

One ci: retrigger was already spent on the earlier darwin queue expiry, so I'm not re-rolling again. Needs either a maintainer retry once the upstream flake settles, or a merge decision based on the watcher-relevant lanes being green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant