Skip to content

worker: make directory-cache entries already-writable#2357

Closed
erneestoc wants to merge 3 commits into
TraceMachina:mainfrom
erneestoc:ec/pr2243-cache-writable-dirs
Closed

worker: make directory-cache entries already-writable#2357
erneestoc wants to merge 3 commits into
TraceMachina:mainfrom
erneestoc:ec/pr2243-cache-writable-dirs

Conversation

@erneestoc

@erneestoc erneestoc commented May 22, 2026

Copy link
Copy Markdown
Contributor

Problem

DirectoryCache locks each cache entry down with set_readonly_recursive after
construction. That helper previously made the entire entry tree mode 0o555
directories included — so every materialization had to follow up with a separate
set_dir_writable_recursive recursive chmod walk in prepare_action_inputs to
re-add write permission to directories (Bazel actions declare outputs at paths
nested inside input subdirectories).

That per-materialization post-walk is redundant work.

Fix

Directories are not hardlink-shared between cache entries — only file content
inodes are — so directory mode can safely be set once, at the cache entry,
instead of on every materialization.

set_readonly_recursive now makes only files read-only (0o555) and leaves
directories writable (0o755). Both materialization paths then produce a
directly-usable tree:

  • macOS clonefile(2) copies the source's modes verbatim → writable dirs,
    read-only files.
  • The Linux per-file hardlink walk creates fresh (writable) directories and
    hardlinks files (which keep the source inode's read-only mode).

Files stay read-only on both paths, so the hermeticity contract and the
CAS-hardlink shared-inode invariant (#2347) are preserved. The now-redundant
set_dir_writable_recursive call is removed from prepare_action_inputs;
set_dir_writable_recursive itself is unchanged and still used by the cache
eviction cleanup path.

Testing

  • fs_util: test_set_readonly_recursive now also asserts directories stay
    writable; the macOS clonefile tests assert cloned subdirs are writable and
    that a nested output file can be created with no set_dir_writable_recursive
    walk.
  • directory_cache: new test_materialized_tree_dirs_writable_files_readonly
    asserts that after get_or_create on both the fresh-materialize and
    cache-hit paths, every directory is writable and every file read-only.
  • bazel test //nativelink-worker/... //nativelink-util/... — 27/27 pass;
    clippy + rustfmt aspects clean.

Note

Companion to #2358 (util: make permission walks symlink-safe). Both PRs touch
nativelink-util/src/fs_util.rs (hardlink_directory_tree_recursive);
whichever lands second needs a trivial rebase.

🤖 Generated with Claude Code


This change is Reviewable

`DirectoryCache` locks each cache entry down with `set_readonly_recursive`
after construction. Previously that helper made the entire entry tree mode
0o555 — directories included — so every materialization had to follow up
with a separate `set_dir_writable_recursive` recursive chmod walk in
`prepare_action_inputs` to re-add write permission to directories (Bazel
actions declare outputs at paths nested inside input subdirectories).

That post-walk is redundant work. Directories are not hardlink-shared
between cache entries — only file content inodes are — so directory mode
can safely be made writable once, at the cache entry, instead of on every
materialization.

`set_readonly_recursive` now locks a tree down as a cache entry by making
only FILES read-only (0o555) and leaving DIRECTORIES writable (0o755).
Both materialization paths then produce a directly-usable tree:

- macOS `clonefile(2)` copies the source's modes verbatim, so the clone's
  directories are writable and its files read-only.
- The Linux per-file hardlink walk creates fresh directories (writable)
  and hardlinks files (which keep the source inode's read-only mode).

Files stay read-only on both paths, so the hermeticity contract and the
CAS-hardlink shared-inode invariant (PR TraceMachina#2347) are preserved. With the
materialized tree already correct, the `set_dir_writable_recursive` call
is removed from `prepare_action_inputs`. `set_dir_writable_recursive`
itself is unchanged and still used by the cache eviction cleanup path.

Tests:
- fs_util: `test_set_readonly_recursive` now also asserts directories stay
  writable; the macOS clonefile tests assert cloned subdirs are writable
  and that a nested output can be created with no `set_dir_writable_recursive`
  walk; `test_set_dir_writable_recursive_walks_nested_dirs` keeps covering
  the eviction-cleanup helper.
- directory_cache: new `test_materialized_tree_dirs_writable_files_readonly`
  builds a nested tree and asserts that, after `get_or_create` on both the
  fresh-materialize and cache-hit paths, every directory is writable and
  every file is read-only, with no separate chmod walk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erneestoc erneestoc force-pushed the ec/pr2243-cache-writable-dirs branch from 8ad0ce7 to 2bc32f9 Compare May 22, 2026 03:16
@MarcusSorealheis

Copy link
Copy Markdown
Member

@erneestoc I need time to test it. Also, could you increment the versions of all the top-level Cargo.toml's in line with the instructions in CONTRIBUTING.md#creating-releases. We are basically going to release patch versions a lot more frequently to tighten the feedback loop.

@MarcusSorealheis

Copy link
Copy Markdown
Member

Also, for the past two versions Claude Code has been able to follow the instructions quite well so you could hand it off there.

Bump the workspace version 1.3.0 -> 1.3.1 in MODULE.bazel, the root Cargo.toml, and every nativelink-* crate manifest, per CONTRIBUTING.md step 1 (Creating releases). Cargo.lock refreshed for the 13 workspace members.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erneestoc

Copy link
Copy Markdown
Contributor Author

@MarcusSorealheis bumped the workspace version 1.3.01.3.1 in MODULE.bazel, the root Cargo.toml, and every nativelink-* crate manifest, per CONTRIBUTING.md#creating-releases step 1; Cargo.lock refreshed for the 13 workspace members. The branch had already picked up the 1.3.0 release via the main merge, so this is the patch bump on top. Bazel build is green.

@MarcusSorealheis

Copy link
Copy Markdown
Member

this one was merged by @2362, or will be.

MarcusSorealheis added a commit that referenced this pull request May 25, 2026
* util: make permission walks symlink-safe

The recursive permission walk `set_perms_recursive_impl` (driving both
`set_readonly_recursive` and `set_dir_writable_recursive`) used
`fs::metadata` (stat), which follows symlinks. On input trees containing
symlinks - e.g. `.venv/bin/python3` produced by rules_python /
rules_apple venv tooling - this had two failure modes:

  * A symlink to a directory reported `is_dir() == true`, so the walk
    recursed *through* the link, escaping the materialized tree or
    descending into an unrelated directory.
  * A symlink was passed to `set_permissions`; `chmod` follows symlinks,
    so it mutated the link's target. When the target did not exist (a
    dangling link - common when a venv points outside the action's
    input set) the `chmod` returned ENOENT and failed the entire walk.

That ENOENT failure surfaced as `set_readonly_recursive` erroring inside
`DirectoryCache::get_or_create`, which made `prepare_action_inputs` log
"Directory cache failed, falling back to traditional download" and take
the slow `download_to_directory` path.

Fix: `set_perms_recursive_impl` now uses `symlink_metadata` (lstat) and
returns early on symlink entries - it never chmods a symlink and never
recurses through one. Regular files keep their existing read-only
(0o555) treatment, so the CAS-hardlinked-inode hermeticity contract
(PR #2347) is unchanged.

`hardlink_directory_tree_recursive` already recreated symlinks as
symlinks; its symlink branch is reordered ahead of the `is_dir()` /
`is_file()` branches to make the symlink-first intent explicit and
robust.

Adds regression tests covering set-readonly, set-dir-writable, and
hardlink/clone walks over a tree containing a symlink to an in-tree
file, a dangling relative symlink, and a symlink to an in-tree
directory, asserting each walk succeeds and the symlinks are preserved
with their targets intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* worker: make directory-cache entries already-writable

`DirectoryCache` locks each cache entry down with `set_readonly_recursive`
after construction. Previously that helper made the entire entry tree mode
0o555 — directories included — so every materialization had to follow up
with a separate `set_dir_writable_recursive` recursive chmod walk in
`prepare_action_inputs` to re-add write permission to directories (Bazel
actions declare outputs at paths nested inside input subdirectories).

That post-walk is redundant work. Directories are not hardlink-shared
between cache entries — only file content inodes are — so directory mode
can safely be made writable once, at the cache entry, instead of on every
materialization.

`set_readonly_recursive` now locks a tree down as a cache entry by making
only FILES read-only (0o555) and leaving DIRECTORIES writable (0o755).
Both materialization paths then produce a directly-usable tree:

- macOS `clonefile(2)` copies the source's modes verbatim, so the clone's
  directories are writable and its files read-only.
- The Linux per-file hardlink walk creates fresh directories (writable)
  and hardlinks files (which keep the source inode's read-only mode).

Files stay read-only on both paths, so the hermeticity contract and the
CAS-hardlink shared-inode invariant (PR #2347) are preserved. With the
materialized tree already correct, the `set_dir_writable_recursive` call
is removed from `prepare_action_inputs`. `set_dir_writable_recursive`
itself is unchanged and still used by the cache eviction cleanup path.

Tests:
- fs_util: `test_set_readonly_recursive` now also asserts directories stay
  writable; the macOS clonefile tests assert cloned subdirs are writable
  and that a nested output can be created with no `set_dir_writable_recursive`
  walk; `test_set_dir_writable_recursive_walks_nested_dirs` keeps covering
  the eviction-cleanup helper.
- directory_cache: new `test_materialized_tree_dirs_writable_files_readonly`
  builds a nested tree and asserts that, after `get_or_create` on both the
  fresh-materialize and cache-hit paths, every directory is writable and
  every file is read-only, with no separate chmod walk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* worker: hardlink CAS blobs in directory-cache construct

`DirectoryCache::construct_directory` previously materialized every file by
fetching the whole blob into RAM (`get_part_unchunked`) and writing a full
copy (`fs::write`). For a cache that exists to avoid re-fetching from the
CAS, this is the dominant cost on a miss.

Switch the cache-entry file build to hardlink the FilesystemStore CAS blob
directly into the cache entry — zero-copy, metadata-only — exactly the way
`download_to_directory` already does on the fallback path:
`populate_fast_store` then `get_file_entry_for_digest` /
`get_file_path_locked` / `fs::hard_link`.

Correctness:
  * A hardlinked CAS blob shares its inode with the CAS store and every
    other action that hardlinked the same blob, so it must never be
    chmod'd (the inode-corruption bug PR #2347 fixed). Executable files
    (`FileNode.is_executable`) therefore get their own private inode via
    fetch+write and are chmod'd 0o555 on that unshared copy — never
    hardlinked.
  * When the blob is not locally hardlinkable (the fast tier is not a
    FilesystemStore, or the blob is absent / evicted from it), the file
    falls back to fetch+write rather than failing the build.
  * Zero-byte files keep their existing direct-write special case.
  * The post-construction lockdown switches from `set_readonly_recursive`
    (which chmods files, and would corrupt the shared CAS inode) to
    `set_dir_writable_recursive`, which only touches directories.

`DirectoryCache::new` now takes the worker's `Arc<FastSlowStore>` so it can
reach `populate_fast_store` and downcast the fast tier to FilesystemStore.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* worker: drop the two redundant full-tree walks in directory-cache build

After `construct_directory`, the cache-miss path walked the materialized
tree twice more: `calculate_directory_size` (an `fs::metadata` per file)
to compute the LRU size, and a recursive permission pass to normalize
directory modes. Both are now folded into construction itself.

Size: `construct_directory` returns the total tree size, accumulated from
`FileNode.digest.size_bytes` in the `Directory` protos it already decodes.
This is also more correct than the old filesystem walk — it counts each
file once by its CAS size and never follows symlinks into possibly-shared
or external targets. Symlinks contribute nothing.

Directory mode: each cache-entry directory is chmod'd 0o755 the moment it
is created (`create_dir_writable`), umask-independent. The directory is
writable while it is populated and that is its stable final mode, so the
separate post-construction `set_dir_writable_recursive` walk is gone.

Cache-entry files are still never chmod'd here — they may be CAS-blob
hardlinks (OPT #1) and mutating their mode would corrupt the shared inode.

Reconciliation with PR #2357: that PR reworks `set_readonly_recursive` so
the recursive walk leaves dirs 0o755 / files 0o555. This commit removes
the directory-cache build's dependence on any such recursive walk
entirely — modes are set at creation. Whichever lands second, the rebase
is a straight delete of the now-unused call site; there is no semantic
conflict because both converge on 0o755 directories, and #2357's file
handling is irrelevant here since the cache build no longer touches file
modes at all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* worker: narrow the directory-cache lock and single-flight construction

The cache write lock was held across syscall-heavy I/O, serializing every
concurrent `get_or_create`:

  * On the cache-hit paths, `cache.write()` was held across the whole
    `hardlink_directory_tree` (clonefile / per-file hardlink) materialization.
  * `evict_lru` ran `set_dir_writable_recursive` + `remove_dir_all` on the
    evicted tree under the write lock during a cache miss.

Lock narrowing:
  * `acquire_entry` / `release_entry` take the write lock only to bump and
    drop a `ref_count` pin and snapshot the entry path; the
    `hardlink_directory_tree` materialization runs fully unlocked. The pin is
    what makes this safe — `evict_lru` never selects an entry with
    `ref_count > 0`, so the cache tree cannot be deleted mid-hardlink. The
    newly constructed entry is likewise inserted pre-pinned (`ref_count: 1`)
    and unpinned only after its destination hardlink completes; otherwise a
    concurrent miss for an unrelated digest could evict the brand-new entry
    (its `last_access` is recent but it is the only unpinned one) while this
    caller is still hardlinking from it.
  * `evict_if_needed` / `evict_lru` are now pure in-memory: they select
    victims and remove them from the map under the lock, returning the
    victim paths. `dispatch_evictions` then performs the chmod + removal on
    a `background_spawn` task, off the lock.

Single-flight: the existing per-digest construction mutex already ensures a
digest is constructed once while N callers wait; this commit additionally
unmaps the per-digest mutex (`forget_construction_lock`) once construction
finishes so `construction_locks` no longer grows unbounded over the worker's
lifetime. Unmapping is race-free: a waiter has already cloned the `Arc<Mutex>`
before blocking, and a late arrival that creates a fresh mutex still re-checks
the cache, finds the entry, and takes the fast hardlink path — never a
redundant construct.

`ref_count` / `CachedDirectoryMetadata` semantics are unchanged; the
hit/miss return contract is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* address merge interactions with read-only CAS

* remove inode stat in test

* update dependencies to 1.3.1

* worker: materialize executable inputs by hardlink to a created-once 0o555 variant (fix ETXTBSY)

Materializing an executable input via a per-action `std::fs::copy` opened a
writable fd in the worker's hot prepare path. Under fork-heavy concurrency a
sibling action's forked child could inherit that fd, and a concurrent `execve`
of the executable then failed with `ETXTBSY` ("Text file busy", os error 26) —
seen on Linux RBE (k8) building rules_go's `builder`. macOS was largely shielded
because its directory-cache path uses APFS `clonefile(2)` (a distinct COW inode
per action), but the per-file `download_to_directory` fallback hardlinks on both
platforms, so the regression spanned both.

Fix (keep the hot path hardlink-only — no writable fd):
- nativelink-store: add `FilesystemStore::get_executable_hardlink_source`. The
  CAS blob is read-only 0o444 and shared by hardlink, so it cannot carry +x and
  must never be chmod'd (#2347). This creates a per-digest 0o555 variant exactly
  once (single-flight), copy -> chmod -> fsync -> atomic rename, so the writer fd
  is closed before the inode is ever hardlinked or executed. Stored in a sibling
  `{content_path}.exec` dir (ignored by the content/temp scan + prune) and
  cleared on startup. On APFS the copy is itself a `clonefile`.
- download_to_directory: executables now hardlink that shared 0o555 variant and
  non-executables hardlink the 0o444 CAS blob. A private copy is used only for
  the rare custom unix_mode / mtime case, applied to a private inode.

The macOS `clonefile` materialization (`hardlink_directory_tree`, #2349) and the
directory cache's executable handling are left untouched, preserving the macOS
speedup.

Test: executable_hardlink_source_created_once_and_readonly asserts the variant is
0o555, a separate inode from the 0o444 blob, stable across calls, leaves the blob
untouched, and hardlinks into an executable. nativelink-store 243/0,
nativelink-worker 88/0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* filesystem_store: use nativelink spawn_blocking! macro (clippy disallowed_methods)

tokio::task::spawn_blocking is banned by clippy.toml in favor of
nativelink-util's spawn_blocking! macro (adds the tracing span +
JoinHandleDropGuard). Fixes the -D clippy::disallowed-methods CI failure on
get_executable_hardlink_source's executable-variant creation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* filesystem_store: gate executable-variant machinery to unix (fix Windows build)

The executable 0o555 variant (and its single-flight map, variant path, .exec
dir, and spawn_blocking copy) only exists to carry the unix executable bit and
dodge the unix ETXTBSY race. On Windows it was dead code, failing the build
under -D warnings (unused import spawn_blocking, never-read executable_locks,
never-used executable_variant_path). Gate all of it (and the HashMap / Mutex /
EXECUTABLE_DIR_SUFFIX it pulls in) behind #[cfg(unix)]; the existing
#[cfg(not(unix))] get_executable_hardlink_source just hardlinks the CAS blob.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Ernesto Cambuston <e.cambuston@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MarcusSorealheis

MarcusSorealheis commented May 27, 2026

Copy link
Copy Markdown
Member

merged as a part of #2366

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants