Skip to content

fast_slow_store: has_with_results consults fast store and in-flight slow writes#2343

Merged
MarcusSorealheis merged 6 commits into
TraceMachina:mainfrom
erneestoc:ec/pr2243-has-with-results-correctness
May 17, 2026
Merged

fast_slow_store: has_with_results consults fast store and in-flight slow writes#2343
MarcusSorealheis merged 6 commits into
TraceMachina:mainfrom
erneestoc:ec/pr2243-has-with-results-correctness

Conversation

@erneestoc

@erneestoc erneestoc commented May 17, 2026

Copy link
Copy Markdown
Contributor

Summary

`FastSlowStore::has_with_results` previously only consulted the slow store. This PR adds two layers of awareness so existence checks reflect what the store actually has:

  1. In-flight slow writes — a new `in_flight_slow_writes: Mutex<HashMap<StoreKey, u64>>` map tracked via a cancel-safe RAII guard (`InFlightSlowWriteGuard`) registered at the start of every slow-store write path.
  2. Fast-store fallback — for any keys still missing after slow + in-flight consultation.

Order: slow store first (authoritative for downstream consumers) → in-flight map → fast store.

The bugs being fixed

Two distinct correctness gaps that both surfaced as redundant work under parallel load:

  1. Fast-only / pre-slow-write blobs are invisible. A blob in the fast store but not yet pushed to slow (or in a fast-only configuration) returns `NotFound`, causing callers to trigger a redundant slow-store fetch for content that's already local.

  2. Concurrent writers can't see each other. Two parallel uploaders of the same digest both see `NotFound` in the slow store and both upload — wasting bandwidth and CPU, and potentially racing in the slow-store backend. Common pattern: parallel SwiftCompile actions uploading the same shared SDK module.

Cancel safety

`InFlightSlowWriteGuard` uses `Weak` (no Arc cycle) and removes its entry on `Drop`, so a cancelled `update()` future cleans up the map even if the slow-store write never completes. There is a dedicated test for this property.

Test plan

  • `cargo test -p nativelink-store --test fast_slow_store_test` — 21 passed
  • Empirical regression proof: reverted `fast_slow_store.rs` to its pre-fix state while keeping the new tests; all 4 fix-related tests failed (including the two cancel-safety / mixed-key gap-closures). Re-applied the fix → 21/21 pass.

New tests (all deterministic, no sleeps — concurrency via `oneshot` channels and a `GatedSlowStore`):

  • `has_does_not_consult_fast_store_when_slow_store_hits` — performance: confirms slow-hit short-circuits

  • `fast_store_only_value_is_reported_by_has` — fast-only fallback

  • `has_checks_fast_store_when_noop` — `NoopDownloads` slow store fall-through

  • `has_sees_in_flight_slow_writes` — gated slow write blocks; concurrent `has_with_results` returns Found via the in-flight map

  • `dropping_update_future_cleans_up_in_flight_entry` — uses `NoopStore` fast + `ReadOnly` direction so the in-flight map is the only presence signal; proves the Drop guard removes the entry

  • `has_with_results_handles_mixed_key_sources` — single call with 4 keys spread across slow/in-flight/fast-only/missing; asserts each is independently resolved

  • `cargo build -p nativelink-store` clean

  • `cargo clippy -p nativelink-store --lib --tests -- -D warnings` clean

  • `cargo fmt --check` clean

Provenance

Equivalent to commits `f69aaf8e8` ("Fix has_with_results to check fast store, downgrade CRITICAL to warn") and `2d770d913e` ("Fix has_with_results to check in-flight slow writes") from #2243, combined into one atomic fix because they share the same observable contract.

🤖 Generated with Claude Code


This change is Reviewable

erneestoc and others added 5 commits May 16, 2026 19:00
…ght slow writes

`FastSlowStore::has_with_results` previously only consulted the slow store,
which caused two known bugs:

  1. Fast-only writes (and the brief window between fast-store insert and
     slow-store write start) were invisible — callers got NotFound for blobs
     that were locally present, triggering redundant fetches.

  2. Concurrent writers racing on the same digest could not see each other's
     in-flight slow-store writes, so the second writer re-uploaded what the
     first had nearly finished pushing.

The fix layers three consultations in order: slow store first (authoritative
for downstream consumers), then a `in_flight_slow_writes: Mutex<HashMap>`
tracking active slow writes, then fast store as a final fallback for
fast-only / pre-slow-write hits.

In-flight tracking uses a cancel-safe RAII guard (`InFlightSlowWriteGuard`)
registered at the start of each `update`/`update_with_whole_file` path that
writes to the slow store, so a cancelled write future correctly removes
itself from the map on Drop.

Tests cover all four reachable cases: slow-hit short-circuit, fast-only
hit, in-flight slow write visibility, and noop-slow-store fall-through.

Provenance: equivalent to upstream commits f69aaf8 and 2d770d9 from
TraceMachina/nativelink PR TraceMachina#2243, ported atomically to current main.
…s tests

The previous commit added tests for the four reachable cases of the
fast/slow/in-flight has() lookup, but two correctness properties were
asserted only indirectly:

  1. Cancel safety of `InFlightSlowWriteGuard`. The fix's central claim
     is that aborting an in-progress update() removes the key from the
     in_flight_slow_writes map via Drop. The previous in-flight test
     only verified the happy-path (writer completes normally) and the
     fast-store fallback masked any leak.

  2. Per-key independence in batched has_with_results. With multiple
     keys spanning different storage tiers, an off-by-one in the
     missing_indices fallback or an over-broad in-flight match would
     silently corrupt results without any existing test noticing.

New tests:

* dropping_update_future_cleans_up_in_flight_entry: uses a
  gated-slow-store with NoopStore as fast (so the fast-store fallback
  cannot mask a leak), spawns a writer, waits on a oneshot started
  signal, aborts the writer mid-update, then asserts has() returns
  None — proving the guard's Drop ran.

* has_with_results_handles_mixed_key_sources: builds a request with
  four keys — slow-only, in-flight, fast-only, and missing — and
  asserts each result independently matches its source. Catches
  index-mapping regressions in the batched fallback path.

Both tests are deterministic (oneshot channels, no sleeps). Verified
that reverting the fix in fast_slow_store.rs causes both new tests
plus the existing has_sees_in_flight_slow_writes and
fast_store_only_value_is_reported_by_has to fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…possible_truncation

Hoist the `MapBackedSlow` test helper (struct, StoreDriver impl, and
`default_health_status_indicator!` macro invocation) out of
`has_with_results_handles_mixed_key_sources` to module scope so that
items are declared before statements, satisfying
`clippy::items_after_statements`.

Replace `as usize` casts of the per-key `u64` sizes with
`usize::try_from(...).unwrap()` to satisfy
`clippy::cast_possible_truncation`, matching the existing conversion
pattern elsewhere in this test file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MarcusSorealheis

Copy link
Copy Markdown
Member

Wow @erneestoc this is a pretty great PR. I can help with the Nix test path as that can be quite annoying. Maybe the clankers can help.

@MarcusSorealheis

Copy link
Copy Markdown
Member

actually that was a Nix infra flakiness instance

@MarcusSorealheis MarcusSorealheis enabled auto-merge (squash) May 17, 2026 10:17
@MarcusSorealheis MarcusSorealheis merged commit 2f42265 into TraceMachina:main May 17, 2026
29 of 31 checks passed
@erneestoc erneestoc deleted the ec/pr2243-has-with-results-correctness branch May 17, 2026 22:29
chrisstaite-menlo added a commit that referenced this pull request May 29, 2026
The PR #2343 updated the fast-slow store such that it ignored
files missing in the slow store for the `has` calls and also
said that a file was in the slow store while it was still being
uploaded to it.

This causes significant regressions to the function of the
fast-slow store such that files missing in the slow store do
not get re-uploaded when they have been purged from an upstream
cache and also starts actions downloading files from the slow
store before they've been committed.

Here we fix both of those by reverting the logic and putting
back the regression test for the former issue and also adding
in a async Mutex which only reports the slow store upload after
it's completed.  In order to perform this correctly the store
trait is updated to return the uploaded size in the successful
state.
chrisstaite-menlo added a commit that referenced this pull request May 29, 2026
The PR #2343 updated the fast-slow store such that it ignored
files missing in the slow store for the `has` calls and also
said that a file was in the slow store while it was still being
uploaded to it.

This causes significant regressions to the function of the
fast-slow store such that files missing in the slow store do
not get re-uploaded when they have been purged from an upstream
cache and also starts actions downloading files from the slow
store before they've been committed.

Here we fix both of those by reverting the logic and putting
back the regression test for the former issue and also adding
in a async Mutex which only reports the slow store upload after
it's completed.  In order to perform this correctly the store
trait is updated to return the uploaded size in the successful
state.
chrisstaite-menlo added a commit that referenced this pull request May 29, 2026
The PR #2343 updated the fast-slow store such that it ignored
files missing in the slow store for the `has` calls and also
said that a file was in the slow store while it was still being
uploaded to it.

This causes significant regressions to the function of the
fast-slow store such that files missing in the slow store do
not get re-uploaded when they have been purged from an upstream
cache and also starts actions downloading files from the slow
store before they've been committed.

Here we fix both of those by reverting the logic and putting
back the regression test for the former issue and also adding
in a async Mutex which only reports the slow store upload after
it's completed.  In order to perform this correctly the store
trait is updated to return the uploaded size in the successful
state.
chrisstaite-menlo added a commit that referenced this pull request May 29, 2026
The PR #2343 updated the fast-slow store such that it ignored
files missing in the slow store for the `has` calls and also
said that a file was in the slow store while it was still being
uploaded to it.

This causes significant regressions to the function of the
fast-slow store such that files missing in the slow store do
not get re-uploaded when they have been purged from an upstream
cache and also starts actions downloading files from the slow
store before they've been committed.

Here we fix both of those by reverting the logic and putting
back the regression test for the former issue and also adding
in a async Mutex which only reports the slow store upload after
it's completed.  In order to perform this correctly the store
trait is updated to return the uploaded size in the successful
state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants