fs: reuse DirEntry in bustEntriesCache instead of orphaning it#29919
fs: reuse DirEntry in bustEntriesCache instead of orphaning it#29919robobun wants to merge 2 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughDirectory-entry caching gains a ChangesDirectory-entry stale invalidation + Router reload memory fix
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
|
Found 4 issues this PR may fix:
🤖 Generated with Claude Code |
There was a problem hiding this comment.
The fix looks correct and well-reasoned to me, but since it changes cache-invalidation semantics in the resolver/fs directory cache (shared by HMR, watch, bundler, and install), I'd like a human to confirm the stale-flag approach has no edge cases I'm missing.
Extended reasoning...
Overview
This PR fixes a memory leak in RealFS.bustEntriesCache() by replacing entries.remove(path) with a stale flag on the existing DirEntry, so the next read takes the pre-existing in_place re-scan path (reusing the DirEntry heap allocation, EntryMap, and Entry/FilenameStore slots) instead of orphaning them and allocating fresh ones. It threads the stale check through the three cache-hit sites (fs.zig:entriesAt, fs.zig:readDirectoryWithIterator, and two spots in resolver.zig). It also fixes a real use-after-free in FileSystemRouter.reload() where extension strings were shallow-duped from an arena that gets freed in the same call. A new memory-growth regression test is added.
Security risks
None. This is internal cache lifecycle management with no auth/crypto/input-handling implications.
Level of scrutiny
High. bustEntriesCache is called via Resolver.bustDirCache, which is on the hot path for HMR, --watch, --hot, bun install, and FileSystemRouter.reload(). The change is semantically equivalent to the existing generation-bump path (the in_place reuse logic already exists and is exercised), and I verified that all three re-scan sites overwrite the DirEntry via ptr.* = new_entry (from DirEntry.init()), which resets stale back to false. The .err/NotFound fallback to remove() is preserved. Still, this is core resolver/fs caching shared across every build/resolve operation on every platform, so a subtle regression here would have broad impact.
Other factors
- The UAF fix in
filesystem_router.zig(deep-copying extension strings into the new arena) is straightforward and correct. - The new test follows the pattern of the adjacent
MatchedRoute.params does not leaktest (subprocess + RSS delta), with a generous 32 MB threshold vs. the ~1 MB observed post-fix. - Existing test
reload() works with new dirs/filescovers the functional correctness of the stale flag (directory is actually re-read from disk). - Only a nit was flagged (explicit test timeout, which follows local file precedent).
- No CODEOWNERS cover these files.
|
Heads-up: the |
bcd3e71 to
ba8181e
Compare
StatusLatest: Rebase notes:
Fix (
|
| RSS growth | |
|---|---|
| without fix | ~65+ MB, eventually aborts (BSSMap overflow list exhausted) |
| with fix | ~17 MB |
Re-ran test/js/bun/util/filesystem_router.test.ts after this rebase: 23/24 pass including my leak test, and MatchedRoute.params does not leak (previously failing on plain main in debug builds) now passes — that main regression has been fixed. The single local failure is #31221's hash-collision test timing out on my slow debug box only: its pure-JS 600k-iteration wyhash search alone takes ~6.4 s there, past the 5 s default timeout, before any router code runs. It passes in CI.
Notes
- Deferred follow-ups (pre-existing):
entries_atwriter-writer locking,FrameworkRouter.scanInnersnapshot, closing oldDirEntry.fdat the otherin_placesites understore_fd. - Previous full CI run on this diff (build 56945) was green: 242 jobs, 0 test failures. The later run (57877) was red only on the main-wide Windows
Bun.Transpiler stack overflowstest breakage, which hit every PR build at the time.
CI on the rebased push (build 61139)
The only error-level failure is test/cli/install/bunx.test.ts ("should handle package that requires node 24"), which fails identically on every current PR build (61098, 61103, 61104, 61130, 61134, 61135, 61138, ...) — it depends on a live npm package's engine requirements, so it's main-wide breakage, not this diff. One flaky-then-passed retry: test/cli/hot/hot.test.ts "deleted and rewritten" got 4 reloads instead of 3 once on debian-asan. Since this PR touches the hot reloader I checked it specifically: the test passes 8/8 locally under the ASAN debug build of this branch, the same file flakes on unrelated PR builds (61096, 61117, 61127), and it has a history of deflake commits (a04817c, 1e75cd5). The extra-reload mode is fs-event timing (delete and rewrite landing in separate watcher wakeups on a slow machine), which predates this change.
No unresolved review threads; claude[bot]'s latest pass reports no new issues and defers to human sign-off on the locking invariants. The diff itself is green — ready for human review/merge.
ba8181e to
72b3853
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/resolver/resolver.zig (1)
2972-2978:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winGuard the union tag before reading
cached_entry.entries.This stale-aware branch is missing the
cached_entry.* == .entriescheck that Line 2266 already has. If a prior lookup cached.err, Line 2973 will read the wrong tagged-union field and trip a safety check instead of falling back to the in-place refresh path.Suggested fix
if (rfs.entries.atIndex(cached_dir_entry_result.index)) |cached_entry| { - if (!cached_entry.entries.stale and cached_entry.entries.generation >= r.generation) { - dir_entries_option = cached_entry; - needs_iter = false; - } else { - in_place = cached_entry.entries; + if (cached_entry.* == .entries) { + if (!cached_entry.entries.stale and cached_entry.entries.generation >= r.generation) { + dir_entries_option = cached_entry; + needs_iter = false; + } else { + in_place = cached_entry.entries; + } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/resolver/resolver.zig` around lines 2972 - 2978, The code reads cached_entry.entries without checking the union tag, which can panic if cached_entry contains .err; update the branch around rfs.entries.atIndex(cached_dir_entry_result.index) to first test the tag (cached_entry.* == .entries) before accessing cached_entry.entries: when the tag is .entries and not stale and generation >= r.generation set dir_entries_option and needs_iter=false; when the tag is .entries but stale set in_place = cached_entry.entries; otherwise (tag != .entries) treat it as a cache miss/fallback and proceed with the in-place refresh path (do not read cached_entry.entries).src/fs.zig (1)
621-643:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve the cached directory FD when refreshing a stale entry.
Now that Line 621 routes invalidations through
entriesAt(), this helper also needs the same FD lifecycle asreadDirectoryWithIterator(). Today it always callsreaddir(false, ...)and always closeshandle, so refreshing a staleDirEntrythat already hadfdset will overwrite it with.invalidat Line 643 and leak the old descriptor. That breaks callers that rely onentries.fdafter invalidation and turns repeated busts into FD churn.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/fs.zig` around lines 621 - 643, When refreshing a stale DirEntry you must preserve any previously-set directory FD instead of overwriting and leaking it: if existing.entries.fd is valid, avoid closing or discarding that descriptor when calling bun.openDirForIteration() / readdir(), and after readdir returns ensure the new_entry retains the original FD (assign new_entry.fd = existing.entries.fd when readdir did not produce a new FD) before you clear/free existing.entries.data and overwrite existing.entries.*; only close the handle if you opened a fresh descriptor for this refresh and did not reuse the existing.entries.fd.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/bun.js/api/filesystem_router.zig`:
- Around line 265-268: The new Router is being initialized with
.asset_prefix_path pointing at this.router.config.asset_prefix_path which lives
in the old arena that is freed right after reload; update the Router.init call
to deep-copy the asset prefix by duplicating
this.router.config.asset_prefix_path via the allocator (similar to how .dir is
created with allocator.dupe) so Router.config.asset_prefix_path owns its own
memory and won't dangle when the old arena is freed; ensure you use
allocator.dupe(u8, this.router.config.asset_prefix_path) (or equivalent) and
handle errors consistently with the existing catch/unreachable pattern.
In `@src/fs.zig`:
- Around line 783-801: bustEntriesCache currently mutates rfs.entries (setting
entry.entries.stale and calling rfs.entries.remove) without taking
entries_mutex, racing with readDirectoryWithIterator and dirInfoCachedMaybeLog
which expect the map to be mutex-protected; fix by either acquiring
entries_mutex at the start of bustEntriesCache and releasing it after the
mutation, or require/assert that the caller holds entries_mutex (e.g., add an
assertion using entries_mutex.isHeld() or similar) and document that contract;
ensure the lock covers both the lookup (rfs.entries.get) and the subsequent
mutation/removal to prevent races on entry.entries.stale and
rfs.entries.remove().
In `@test/js/bun/util/filesystem_router.test.ts`:
- Around line 613-628: The test currently asserts stderr is exactly empty
(expect(stderr).toBe("")) but debug/ASAN builds can emit a known warning from
the spawned process; update the assertion in the test that spawns proc (using
Bun.spawn with bunExe() and bunEnv) to either remove the strict stderr equality
check or filter out the ASAN warning string (e.g., lines containing "WARNING:
ASAN interferes with JSC signal handlers" or similar) from stderr before
asserting the remainder is empty so unrelated ASAN noise doesn't fail the
leak/regression assertions for proc, stdout, stderr, and exitCode.
---
Outside diff comments:
In `@src/fs.zig`:
- Around line 621-643: When refreshing a stale DirEntry you must preserve any
previously-set directory FD instead of overwriting and leaking it: if
existing.entries.fd is valid, avoid closing or discarding that descriptor when
calling bun.openDirForIteration() / readdir(), and after readdir returns ensure
the new_entry retains the original FD (assign new_entry.fd = existing.entries.fd
when readdir did not produce a new FD) before you clear/free
existing.entries.data and overwrite existing.entries.*; only close the handle if
you opened a fresh descriptor for this refresh and did not reuse the
existing.entries.fd.
In `@src/resolver/resolver.zig`:
- Around line 2972-2978: The code reads cached_entry.entries without checking
the union tag, which can panic if cached_entry contains .err; update the branch
around rfs.entries.atIndex(cached_dir_entry_result.index) to first test the tag
(cached_entry.* == .entries) before accessing cached_entry.entries: when the tag
is .entries and not stale and generation >= r.generation set dir_entries_option
and needs_iter=false; when the tag is .entries but stale set in_place =
cached_entry.entries; otherwise (tag != .entries) treat it as a cache
miss/fallback and proceed with the in-place refresh path (do not read
cached_entry.entries).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 282efbb6-841f-47da-8d06-959d3431a34e
📒 Files selected for processing (4)
src/bun.js/api/filesystem_router.zigsrc/fs.zigsrc/resolver/resolver.zigtest/js/bun/util/filesystem_router.test.ts
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/fs.zig (2)
621-644:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftRefresh-to-error still drops the reused
DirEntry.Both stale refresh branches clear the old
EntryMapand then route failures throughreadDirectoryError(). If that replaces the cache slot with.err/NotFound, the heap*DirEntryis no longer referenced, so delete/recreate or permission-flap loops can still growEntryStore/FilenameStoreagain on the next successful read.Also applies to: 1095-1139
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/fs.zig` around lines 621 - 644, The bug is that existing.entries.data.clearAndFree(...) is called before calling readDirectoryError() and before the new DirEntry is successfully assigned, which can drop the only heap reference to the old DirEntry when readDirectoryError() replaces the cache with an error, allowing resource-growth on subsequent reads; fix by deferring clearing/freeing of existing.entries.data until after new_entry is successfully obtained and assigned (i.e., remove the pre-error clearAndFree calls and only call existing.entries.data.clearAndFree(bun.default_allocator) after existing.entries.* = new_entry), and apply the same change to the other stale-refresh branch (lines ~1095-1139) so that the old DirEntry remains referenced if readDirectoryError() winds up writing an error into the cache.
618-644:⚠️ Potential issue | 🟠 Major | ⚡ Quick winLock
entriesAt()while it rewrites cached entries.
bustEntriesCache()now serializes.stalewrites withentries_mutex, butentriesAt()still clearsexisting.entries.dataand overwritesexisting.entries.*without taking that mutex. That leaves the stale-refresh path open to lost invalidations and concurrent cache mutation. If callers are supposed to hold the lock already, please assert/document that contract here instead of leaving it implicit.🔒 Suggested fix
pub fn entriesAt(this: *RealFS, index: allocators.IndexType, generation: bun.Generation) ?*EntriesOption { + this.entries_mutex.lock(); + defer this.entries_mutex.unlock(); + var existing = this.entries.atIndex(index) orelse return null; if (existing.* == .entries) { if (existing.entries.stale or existing.entries.generation < generation) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/fs.zig` around lines 618 - 644, entriesAt() updates cached entries (clearing existing.entries.data and writing existing.entries.*) without holding entries_mutex, which races with bustEntriesCache() that serializes .stale writes; fix by taking the same entries_mutex around the stale-refresh section in entriesAt() (or assert the lock is already held) so cache mutation is serialized: acquire entries_mutex before calling bun.openDirForIteration/readdir and before clearAndFree/assignment, release after assigning new_entry, and/or add an explicit debug assert in entriesAt() if the locking contract is that callers must hold entries_mutex (referencing entriesAt, bustEntriesCache, entries_mutex, existing.entries.data and existing.entries.*).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/fs.zig`:
- Around line 621-644: The bug is that existing.entries.data.clearAndFree(...)
is called before calling readDirectoryError() and before the new DirEntry is
successfully assigned, which can drop the only heap reference to the old
DirEntry when readDirectoryError() replaces the cache with an error, allowing
resource-growth on subsequent reads; fix by deferring clearing/freeing of
existing.entries.data until after new_entry is successfully obtained and
assigned (i.e., remove the pre-error clearAndFree calls and only call
existing.entries.data.clearAndFree(bun.default_allocator) after
existing.entries.* = new_entry), and apply the same change to the other
stale-refresh branch (lines ~1095-1139) so that the old DirEntry remains
referenced if readDirectoryError() winds up writing an error into the cache.
- Around line 618-644: entriesAt() updates cached entries (clearing
existing.entries.data and writing existing.entries.*) without holding
entries_mutex, which races with bustEntriesCache() that serializes .stale
writes; fix by taking the same entries_mutex around the stale-refresh section in
entriesAt() (or assert the lock is already held) so cache mutation is
serialized: acquire entries_mutex before calling bun.openDirForIteration/readdir
and before clearAndFree/assignment, release after assigning new_entry, and/or
add an explicit debug assert in entriesAt() if the locking contract is that
callers must hold entries_mutex (referencing entriesAt, bustEntriesCache,
entries_mutex, existing.entries.data and existing.entries.*).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a4a7b9b2-b800-452d-a1cd-069f268a9cad
📒 Files selected for processing (2)
src/bun.js/api/filesystem_router.zigsrc/fs.zig
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/fs.zig (1)
621-648:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winDon't free the cached directory path on a failed in-place refresh.
These two refresh paths now pass
existing.entries.dir/existing.dirback intoreaddir()for reuse, butreaddir()still doeserrdefer dir.deinit(...). If the rescan fails, that cleanup frees the borrowed path, and the subsequentreadDirectoryError(dir, err)runs with a dangling slice while the cachedDirEntrystill points at freed memory. SincebustEntriesCache()now invalidates by settingstale, transient watch/HMR read failures will hit this path regularly.🛠️ Suggested fix
fn readdir( fs: *RealFS, store_fd: bool, prev_map: ?*DirEntry.EntryMap, _dir: string, generation: bun.Generation, handle: std.fs.Dir, comptime Iterator: type, iterator: Iterator, ) !DirEntry { _ = fs; var iter = bun.iterateDir(.fromStdDir(handle)); var dir = DirEntry.init(_dir, generation); const allocator = bun.default_allocator; - errdefer dir.deinit(allocator); + errdefer { + dir.data.deinit(allocator); + if (prev_map == null) allocator.free(dir.dir); + } if (store_fd) { FileSystem.setMaxFd(handle.fd); dir.fd = .fromStdDir(handle); }As per coding guidelines, "In Zig code, be careful with allocators and use defer for cleanup."
Also applies to: 1098-1151
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/fs.zig` around lines 621 - 648, The in-place refresh path calls readdir(...) with an owned dir slice borrowed from existing.entries.dir but still lets readdir() errdefer dir.deinit(...) free that slice on error, leaving existing.entries pointing at freed memory; update the error paths so you do not free the cached path when readdir fails: when calling readdir from the refresh branches (the block using existing.entries.dir / existing.dir and function readdir), remove or conditionally skip the cleanup that deallocates the borrowed dir slice on error (i.e., avoid calling existing.entries.data.clearAndFree or dir.deinit/free of the cached path before calling readDirectoryError), and instead ensure readDirectoryError receives a valid borrowed slice or clone the path into a temporary owned buffer before calling readdir so cleanup is safe; apply the same fix to the other refresh region (around lines 1098–1151) referencing the same symbols.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/fs.zig`:
- Around line 621-648: The in-place refresh path calls readdir(...) with an
owned dir slice borrowed from existing.entries.dir but still lets readdir()
errdefer dir.deinit(...) free that slice on error, leaving existing.entries
pointing at freed memory; update the error paths so you do not free the cached
path when readdir fails: when calling readdir from the refresh branches (the
block using existing.entries.dir / existing.dir and function readdir), remove or
conditionally skip the cleanup that deallocates the borrowed dir slice on error
(i.e., avoid calling existing.entries.data.clearAndFree or dir.deinit/free of
the cached path before calling readDirectoryError), and instead ensure
readDirectoryError receives a valid borrowed slice or clone the path into a
temporary owned buffer before calling readdir so cleanup is safe; apply the same
fix to the other refresh region (around lines 1098–1151) referencing the same
symbols.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 16da5bf4-cbc9-496e-9ff0-117a43b83081
📒 Files selected for processing (2)
src/fs.zigsrc/resolver/resolver.zig
…chedMaybeLog (#30170) ## Problem `EntriesOption` is a tagged union: ```zig pub const EntriesOption = union(Tag) { entries: *DirEntry, err: DirEntry.Err, // { original_err: anyerror, canonical_error: anyerror } }; ``` When `readDirectory()` fails with a non-ENOENT error (EACCES, EMFILE, …), `readDirectoryError` stores `.err` in `rfs.entries` (fs.zig:1002). If the directory later becomes openable and `dirInfoCachedMaybeLog` processes it as queue slot [0] (the target path itself, which is never pre-checked against `rfs.entries` during queue construction), it reached: ```zig if (rfs.entries.atIndex(cached_dir_entry_result.index)) |cached_entry| { if (cached_entry.entries.generation >= r.generation) { // no tag check ``` Reading `.entries` while `.err` is active reinterprets the two `anyerror` values as a `*DirEntry` pointer and dereferences it. In debug this is a safety panic; in release it's a segfault at e.g. `0x1D401D401F8`. ## Repro ```js chmodSync(bad, 0o000); try { Bun.resolveSync('./bad/index.js', root); } catch {} // caches .err for 'bad' chmodSync(bad, 0o755); Bun.resolveSync('./bad', root); // dirInfoCached(bad): open OK, cache has .err -> crash ``` Before: ``` panic(main thread): access of union field 'entries' while field 'err' is active resolver.zig:2973:33 in dirInfoCachedMaybeLog ``` ## Fix Add the same `if (cached_entry.* == .entries)` guard that `dirInfoForResolution` already has (resolver.zig:2265). When the cached entry is `.err`, fall through with `needs_iter = true` so the directory is re-iterated now that it can be opened. ## Verification Added a test to `test/js/bun/resolve/resolve.test.ts` that reproduces the exact sequence. Runs as the current user when non-root, or drops to `nobody` via `runuser` when root (since root bypasses DAC). Skipped on Windows. - Without fix: panics with `access of union field 'entries' while field 'err' is active` - With fix: resolves to `bad/index.js` --- Note: #29919 independently adds this same guard as part of a larger memory-leak fix; this PR is the minimal targeted crash fix with a dedicated regression test. --------- Co-authored-by: robobun <robobun@users.noreply.github.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
c05672a to
dd26f56
Compare
7328c03 to
87be2a7
Compare
There was a problem hiding this comment.
No new issues found and all prior review threads are resolved — but this swaps an RCU-via-leak for explicit synchronization across the watcher/JS/bundle threads on the shared rfs.entries cache, and took several rounds of race-condition fixes to settle (with two related items deferred to follow-ups), so it's worth a human pass on the concurrency reasoning before merge.
Extended reasoning...
Overview
This PR fixes a memory leak in RealFS.bustEntriesCache() where each cache bust orphaned a heap *DirEntry (plus its EntryMap) and forced append-only growth of EntryStore/FilenameStore. The fix adds a stale: bool flag to DirEntry and has bustEntriesCache mark the existing entry stale (under entries_mutex) instead of removing the key, so the next read takes the pre-existing in_place rebuild path and reuses all allocations. Touches: src/resolver/fs.zig (core cache + entriesAt fd-preservation), src/resolver/resolver.zig (two !stale guards), src/jsc/hot_reloader.zig (mutex around affected-files reads + bust reordering), src/router/router.zig and src/runtime/api/filesystem_router.zig (snapshot subdirs before recursing; deep-copy extensions/asset_prefix_path on reload), plus a new RSS-leak regression test.
Security risks
None. No auth/crypto/permissions/parsing of untrusted input is touched. The risk surface here is memory-safety (UAF / torn reads on a shared hashmap), not security.
Level of scrutiny
High. The leak being removed was — as the PR's own review threads document — the de-facto RCU that made several unlocked reads of *DirEntry sound. Removing it required compensating synchronization that was iterated on across three separate race findings during review (within-event reader/writer race → reorder bust; cross-event race → lock the affected-files loop; same-thread reentrancy in RouteLoader.load/bustDirCacheRecursive → snapshot subdirs before recursing). rfs.entries is a process-global singleton shared across the JS-thread resolver, the watcher thread, and BundleThread, so the blast radius of a mistake here is HMR/--hot/--watch crashes under load. Two related issues were explicitly deferred to follow-ups (the pre-existing unlocked entriesAt writer-writer race, and the FrameworkRouter.scanInner snapshot). That's reasonable scoping, but it means a human should confirm the boundary drawn is acceptable.
Other factors
All seven of my prior inline review threads are resolved with specific commits and rationale; coderabbit's three findings are also resolved. CI build #50578 passes on 71 jobs including all ASAN shards; the only failures are pre-existing fleet-wide flakes unrelated to this change. The new regression test is solid (subprocess RSS-delta assertion with a generous 32 MB bound vs. ~72 MB pre-fix). The bug-hunting system found no new issues on the current revision. I'm deferring rather than approving solely because the concurrency reasoning is subtle enough — and was wrong in non-obvious ways multiple times during this review cycle — that a human sign-off on the final synchronization story is warranted.
87be2a7 to
081f6c2
Compare
There was a problem hiding this comment.
All my concerns from earlier rounds have been addressed and I have nothing further to flag, but given how subtle the in-place-reuse concurrency invariants turned out to be (three separate race fixes during review), this one warrants a human pass before merge.
Extended reasoning...
Overview
This PR changes RealFS.bustEntriesCache() from removing the BSSMap key (orphaning the backing *DirEntry slot) to marking the existing DirEntry as .stale so the next read takes the existing in_place re-scan path and reuses the allocation. It touches six files: the core fs cache (src/resolver/fs.zig), the resolver's stale/generation checks (src/resolver/resolver.zig), the hot-reloader watcher thread (src/jsc/hot_reloader.zig), the route loader (src/router/router.zig), the JS FileSystemRouter API (src/runtime/api/filesystem_router.zig), and adds a leak regression test.
Security risks
None identified. The change is internal to the resolver's directory-entry cache and HMR watcher; no auth, crypto, network, or untrusted-input parsing is involved. The risk surface is memory-safety/concurrency (UAF, torn reads), not security.
Level of scrutiny
High. The previous orphaning behavior was effectively poor-man's RCU: leaked *DirEntry slots were never written again, which made several unlocked reads of entries.data (hot_reloader watcher thread, RouteLoader.load recursion, bustDirCacheRecursive) accidentally sound. Switching to in-place reuse removes that implicit guarantee, and the PR had to re-establish it explicitly via entries_mutex locking in the watcher and snapshot-before-recurse in the route loaders. Three distinct races were found and fixed during review (within-event, cross-event, and same-thread reentrancy), and two related issues were explicitly deferred to follow-ups (entriesAt unlocked writer-writer, FrameworkRouter.scanInner). Code that needed three rounds of concurrency fixes to converge is not something I should approve without a human look.
Other factors
- All review threads (mine and coderabbit's) are resolved; the current revision has no outstanding bug findings.
- CI passes on 71 lanes including all ASAN shards; the 3 failures are documented pre-existing fleet-wide flakes unrelated to this change.
- The
filesystem_router.zigdeep-copy ofextensions/asset_prefix_pathfixes a real pre-existing UAF that the new test surfaced under ASAN. - The resolver and fs cache are on the hot path for every module resolution in Bun; a regression here would be high-impact.
081f6c2 to
9eecfda
Compare
There was a problem hiding this comment.
All prior threads are resolved and the bug hunter found nothing new on this revision, but this reworks resolver cache invalidation from orphan-as-RCU to in-place reuse across the watcher/JS-thread boundary (with two acknowledged follow-ups deferred), so it warrants a human sign-off.
Extended reasoning...
Overview
This PR changes RealFS.bustEntriesCache from removing the BSSMap key (orphaning the slot) to marking the existing DirEntry as stale so the next read reuses it via the in_place path. That core change ripples into fs.zig (entriesAt, readDirectoryWithIterator), resolver.zig (two in_place trigger sites), hot_reloader.zig (mutex + bust ordering on the watcher thread), and router.zig / filesystem_router.zig (snapshot-before-recurse to avoid same-thread iterator invalidation). It also fixes a pre-existing UAF in reload() (shallow-duped extensions/asset_prefix_path) and adds an RSS-growth regression test.
Security risks
None identified — no auth, crypto, network, or untrusted-input parsing is touched. The risk class here is memory safety (UAF / iterator invalidation) under concurrency, not security.
Level of scrutiny
High. The orphaned-slot leak being fixed was the de-facto RCU that made several unlocked reads of *DirEntry sound; replacing it with in-place reuse required three rounds of race-condition fixes during review (within-event, cross-event, and same-thread reentrancy via RouteLoader.load/bustDirCacheRecursive). Two related issues were explicitly deferred to follow-ups: the pre-existing entriesAt unlocked writer-writer race (can't trivially lock due to non-recursive entries_mutex reentry from dirInfoUncached), and the structurally-identical FrameworkRouter.scanInner site. This is production-critical-path code (bun --hot, --watch, Bun.FileSystemRouter, bundler watch) where concurrency bugs surface as rare, hard-to-reproduce crashes.
Other factors
All inline review threads (mine and CodeRabbit's) are resolved; the per-author status comment reports CI green (build 50578, 71/71 modulo fleet-wide flakes) including ASAN lanes; the new test exercises the leak fix and the existing reload() works with new dirs/files test covers correctness of the stale-flag re-read. None of the touched files are under CODEOWNERS. Given the iteration history and the deferred follow-ups, a human reviewer should confirm the final mutex/ordering story before merge rather than relying on auto-approval.
…chedMaybeLog (oven-sh#30170) ## Problem `EntriesOption` is a tagged union: ```zig pub const EntriesOption = union(Tag) { entries: *DirEntry, err: DirEntry.Err, // { original_err: anyerror, canonical_error: anyerror } }; ``` When `readDirectory()` fails with a non-ENOENT error (EACCES, EMFILE, …), `readDirectoryError` stores `.err` in `rfs.entries` (fs.zig:1002). If the directory later becomes openable and `dirInfoCachedMaybeLog` processes it as queue slot [0] (the target path itself, which is never pre-checked against `rfs.entries` during queue construction), it reached: ```zig if (rfs.entries.atIndex(cached_dir_entry_result.index)) |cached_entry| { if (cached_entry.entries.generation >= r.generation) { // no tag check ``` Reading `.entries` while `.err` is active reinterprets the two `anyerror` values as a `*DirEntry` pointer and dereferences it. In debug this is a safety panic; in release it's a segfault at e.g. `0x1D401D401F8`. ## Repro ```js chmodSync(bad, 0o000); try { Bun.resolveSync('./bad/index.js', root); } catch {} // caches .err for 'bad' chmodSync(bad, 0o755); Bun.resolveSync('./bad', root); // dirInfoCached(bad): open OK, cache has .err -> crash ``` Before: ``` panic(main thread): access of union field 'entries' while field 'err' is active resolver.zig:2973:33 in dirInfoCachedMaybeLog ``` ## Fix Add the same `if (cached_entry.* == .entries)` guard that `dirInfoForResolution` already has (resolver.zig:2265). When the cached entry is `.err`, fall through with `needs_iter = true` so the directory is re-iterated now that it can be opened. ## Verification Added a test to `test/js/bun/resolve/resolve.test.ts` that reproduces the exact sequence. Runs as the current user when non-root, or drops to `nobody` via `runuser` when root (since root bypasses DAC). Skipped on Windows. - Without fix: panics with `access of union field 'entries' while field 'err' is active` - With fix: resolves to `bad/index.js` --- Note: oven-sh#29919 independently adds this same guard as part of a larger memory-leak fix; this PR is the minimal targeted crash fix with a dedicated regression test. --------- Co-authored-by: robobun <robobun@users.noreply.github.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
9eecfda to
6a91fbd
Compare
9d43bef to
539ddee
Compare
There was a problem hiding this comment.
No new issues on this revision, but this changes the aliasing/lifetime contract of a process-global cache (the prior leak was load-bearing as accidental RCU) and threads new entries_mutex synchronization + unsafe pointer snapshots through the hot-reloader and router — worth a human pass, and CI is currently red on build-rust across most platforms for 539ddee.
Extended reasoning...
Overview
This PR fixes a per-reload memory leak in RealFS::bust_entries_cache by replacing BSSMap::remove() (which orphaned the backing DirEntry slot) with an in-place stale flag so the next read reuses the allocation via the existing in_place path. Because the old orphaning behavior was effectively an accidental RCU that made several unlocked readers sound, the fix required compensating synchronization at every site that previously captured a *DirEntry and read it across a potential rewrite: hot_reloader::on_file_update now holds entries_mutex around the affected-files loop and busts after releasing it; RouteLoader::load and FileSystemRouter::bust_dir_cache_recursive snapshot subdir *Entry pointers before recursing; entries_at carries over the prior fd. Touches 6 Rust files across resolver/router/hot-reloader plus a regression test.
Security risks
No direct security surface (no auth/crypto/network input parsing). The risk class here is memory safety: the change rewires lifetime/aliasing invariants for a process-global cache shared across the watcher thread and JS thread, with several new unsafe { &*entry_ptr } dereferences whose validity depends on EntryStore being append-only and on the new lock ordering. The review history found and fixed three distinct UAF/iterator-invalidation races introduced by earlier revisions; the current revision looks correct to the bug hunter, but the blast radius of a missed case is a UAF in --hot/reload().
Level of scrutiny
High. This is concurrency- and aliasing-sensitive core infrastructure (module resolver directory cache, HMR watcher thread). The PR itself documents that correctness now hinges on subtle invariants ("previously never written again (accidental RCU)", "bust after releasing the lock since bust_entries_cache takes it internally", "*Entry values live in the append-only EntryStore"). Three follow-up items were explicitly deferred as pre-existing (entries_at writer-writer locking, FrameworkRouter.scanInner snapshot, fd close at other in_place sites). A human reviewer should confirm the lock ordering and the EntryStore-stability argument, and decide whether the deferred items are acceptable to ship separately.
Other factors
- CI on the head commit (539ddee) shows build-rust/build-cpp failures across linux x64/aarch64/musl, freebsd, android, and macOS x64 — needs to go green before merge regardless.
- The PR went through ~8 revision rounds with multiple 🔴 findings (all resolved); that history itself signals this is not a mechanical change.
- Regression test added and passing locally per the author's status comment (20/21, the 1 failure is a pre-existing main regression).
843746d to
ffb9993
Compare
BSSMap::remove() only drops the hash→index mapping; the backing slot and the heap DirEntry it points at are orphaned. The next lookup allocates a fresh slot + fresh DirEntry, which skips the in_place re-scan path — so every bust_dir_cache (HMR/watch, FileSystemRouter .reload()) leaked one DirEntry + EntryMap per directory and appended N Entry/FilenameStore slots per file, unbounded. Keep the slot and flag the DirEntry as stale instead. The existing in_place logic in read_directory_with_iterator/entries_at/ dir_info_cached_maybe_log then re-reads the directory while reusing the DirEntry allocation, its EntryMap, and the Entry/FilenameStore slots via prev_map. Since the orphaned slot was previously never written again (accidental RCU), callers that captured a *DirEntry and read its EntryMap across calls that could now trigger an in_place rewrite need explicit synchronization: - hot_reloader on_file_update: hold entries_mutex around the dir_ent.entries reads (covers both within-event and cross-event races with a concurrent JS-thread resolve); bust after releasing the lock since bust_entries_cache takes it internally. - RouteLoader::load / bust_dir_cache_recursive: snapshot subdir *Entry pointers (stable in the append-only EntryStore) before recursing, so read_dir_info(child) can safely rewrite the parent. - entries_at: carry over the previous DirEntry.fd through the refresh so DirInfo::get_file_descriptor keeps working. 500 reloads of 4 dirs × 20 files: ~50 MB RSS growth → ~8 MB.
ffb9993 to
f3f83c3
Compare
There was a problem hiding this comment.
No new issues found on this revision, but given the change swaps the resolver's de-facto RCU (orphan-on-bust) for in-place DirEntry rewrites under entries_mutex — with several rounds of cross-thread race fixes already applied — this warrants a human pass over the locking/snapshot invariants before merge.
Extended reasoning...
Overview
This PR changes RealFS::bust_entries_cache from removing the BSSMap key (orphaning the heap DirEntry and its EntryMap) to flagging the existing DirEntry as stale so the next read takes the pre-existing in_place re-scan path. Touches src/resolver/{fs,lib,resolver}.rs (the stale flag, entries_at fd carry-over, both in_place guard sites), src/jsc/hot_reloader.rs (watcher-thread entries_mutex hold + bust reordering), and src/router/lib.rs + src/runtime/api/filesystem_router.rs (snapshot subdir *Entry pointers before recursing). Adds an RSS-leak regression test.
Security risks
None in the auth/crypto sense. The risk surface is memory-safety: the old orphaning behavior was an accidental RCU that made several unlocked *DirEntry reads sound; reusing the slot in place required adding explicit synchronization at three caller sites (hot_reloader affected-files loop, RouteLoader::load, bust_dir_cache_recursive). The PR went through three rounds of race-condition fixes (within-event, cross-event, and same-thread iterator-invalidation via recursive read_dir_info) and explicitly defers three pre-existing issues (entries_at writer-writer locking, FrameworkRouter.scanInner snapshot, fd close at other in_place sites).
Level of scrutiny
High. This is production-critical resolver/HMR code with unsafe raw-pointer dereferences, non-recursive mutex ordering constraints, and cross-thread invariants whose correctness depends on which caller holds entries_mutex vs. dir_cache state at each step. The fix is well-reasoned and well-documented in-source, but the class of bug it's been iterating on (captured *DirEntry rewritten under the reader) is exactly the kind that merits a maintainer's eyes on the final locking topology and the deferred follow-up list.
Other factors
All prior inline review threads are resolved; CI was green on a previous full run and the latest push is just a rebase + rustfmt. The new test exercises the single-threaded reload() path but not the --hot watcher-thread races the synchronization changes target. The author has flagged the PR as ready for human review.
What
RealFS::bust_entries_cache()leaked a heapDirEntry(plus itsEntryMap) per call and forced append-only growth ofEntryStore/FilenameStore. Triggered on every HMR/watch file change and everyFileSystemRouter.reload().Repro
4 subdirs × 20 files → ~50 MB RSS growth over 500 reloads.
Cause
BSSMap::remove()only drops the hash→index mapping. The backing slot — and theEntriesOption::Entries(&mut DirEntry)it holds — stays, orphaned. On the nextget_or_put()the key is gone so.index = Unassigned,at_index()returnsNone,in_place = None, andread_directory_with_iterator/dir_info_cached_maybe_logallocate a freshDirEntryand consume a freshBSSMapslot. The oldDirEntry(with itsEntryMaphashmap) is never freed, and becauseprev_mapisNoneevery file in the directory gets a newEntryappended toEntryStoreand (for long names) new strings appended toFilenameStore.Fix
Don't remove the key. Flag the existing
DirEntryasstaleand let the next read take thein_placepath that already exists for the generation-bump case — it reuses theDirEntryallocation, passes the oldEntryMapasprev_mapsoEntrystructs and filename strings are reused, and writes back to the sameBSSMapslot.Err/NotFoundentries (noDirEntry) still fall back toremove().Since the orphaned slot was previously never written again (accidental RCU), callers that captured a
*DirEntryand read itsEntryMapacross calls that could now trigger anin_placerewrite need explicit synchronization:hot_reloader::on_file_update: holdentries_mutexaround thedir_ent.entriesreads (covers both within-event and cross-event races with a concurrent JS-thread resolve); bust after releasing the lock sincebust_entries_cachetakes it internally.RouteLoader::load/FileSystemRouter::bust_dir_cache_recursive: snapshot subdir*Entrypointers (stable in the append-onlyEntryStore) before recursing, soread_dir_info(child)can safely rewrite the parent.entries_at: carry over the previousDirEntry.fdthrough the refresh soDirInfo::get_file_descriptorkeeps working.Verification
reload()s(The residual is
dir_cache.remove()—DirInfoslots — a separate, smaller leak not addressed here because recycling those slots would invalidateDirInfo.parentindex references held by children.)All 21
filesystem_router.test.tstests pass, includingreload() works with new dirs/fileswhich verifies the stale flag still forces a disk re-read.Rebase note: this PR was originally implemented against the Zig resolver and re-targeted at the Rust port after #30412. The Rust
reload()already deep-copiesextensions/asset_prefix_pathvia.clone(), and the resolver sites already pattern-match onEntriesOption::Entries(_), so those follow-ups from the Zig review history are not needed here.