fast_slow_store: has_with_results consults fast store and in-flight slow writes#2343
Merged
MarcusSorealheis merged 6 commits intoMay 17, 2026
Conversation
…ght slow writes
`FastSlowStore::has_with_results` previously only consulted the slow store,
which caused two known bugs:
1. Fast-only writes (and the brief window between fast-store insert and
slow-store write start) were invisible — callers got NotFound for blobs
that were locally present, triggering redundant fetches.
2. Concurrent writers racing on the same digest could not see each other's
in-flight slow-store writes, so the second writer re-uploaded what the
first had nearly finished pushing.
The fix layers three consultations in order: slow store first (authoritative
for downstream consumers), then a `in_flight_slow_writes: Mutex<HashMap>`
tracking active slow writes, then fast store as a final fallback for
fast-only / pre-slow-write hits.
In-flight tracking uses a cancel-safe RAII guard (`InFlightSlowWriteGuard`)
registered at the start of each `update`/`update_with_whole_file` path that
writes to the slow store, so a cancelled write future correctly removes
itself from the map on Drop.
Tests cover all four reachable cases: slow-hit short-circuit, fast-only
hit, in-flight slow write visibility, and noop-slow-store fall-through.
Provenance: equivalent to upstream commits f69aaf8 and 2d770d9 from
TraceMachina/nativelink PR TraceMachina#2243, ported atomically to current main.
…s tests
The previous commit added tests for the four reachable cases of the
fast/slow/in-flight has() lookup, but two correctness properties were
asserted only indirectly:
1. Cancel safety of `InFlightSlowWriteGuard`. The fix's central claim
is that aborting an in-progress update() removes the key from the
in_flight_slow_writes map via Drop. The previous in-flight test
only verified the happy-path (writer completes normally) and the
fast-store fallback masked any leak.
2. Per-key independence in batched has_with_results. With multiple
keys spanning different storage tiers, an off-by-one in the
missing_indices fallback or an over-broad in-flight match would
silently corrupt results without any existing test noticing.
New tests:
* dropping_update_future_cleans_up_in_flight_entry: uses a
gated-slow-store with NoopStore as fast (so the fast-store fallback
cannot mask a leak), spawns a writer, waits on a oneshot started
signal, aborts the writer mid-update, then asserts has() returns
None — proving the guard's Drop ran.
* has_with_results_handles_mixed_key_sources: builds a request with
four keys — slow-only, in-flight, fast-only, and missing — and
asserts each result independently matches its source. Catches
index-mapping regressions in the batched fallback path.
Both tests are deterministic (oneshot channels, no sleeps). Verified
that reverting the fix in fast_slow_store.rs causes both new tests
plus the existing has_sees_in_flight_slow_writes and
fast_store_only_value_is_reported_by_has to fail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…possible_truncation Hoist the `MapBackedSlow` test helper (struct, StoreDriver impl, and `default_health_status_indicator!` macro invocation) out of `has_with_results_handles_mixed_key_sources` to module scope so that items are declared before statements, satisfying `clippy::items_after_statements`. Replace `as usize` casts of the per-key `u64` sizes with `usize::try_from(...).unwrap()` to satisfy `clippy::cast_possible_truncation`, matching the existing conversion pattern elsewhere in this test file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Member
|
Wow @erneestoc this is a pretty great PR. I can help with the Nix test path as that can be quite annoying. Maybe the clankers can help. |
Member
|
actually that was a Nix infra flakiness instance |
MarcusSorealheis
approved these changes
May 17, 2026
5 tasks
chrisstaite-menlo
added a commit
that referenced
this pull request
May 29, 2026
The PR #2343 updated the fast-slow store such that it ignored files missing in the slow store for the `has` calls and also said that a file was in the slow store while it was still being uploaded to it. This causes significant regressions to the function of the fast-slow store such that files missing in the slow store do not get re-uploaded when they have been purged from an upstream cache and also starts actions downloading files from the slow store before they've been committed. Here we fix both of those by reverting the logic and putting back the regression test for the former issue and also adding in a async Mutex which only reports the slow store upload after it's completed. In order to perform this correctly the store trait is updated to return the uploaded size in the successful state.
chrisstaite-menlo
added a commit
that referenced
this pull request
May 29, 2026
The PR #2343 updated the fast-slow store such that it ignored files missing in the slow store for the `has` calls and also said that a file was in the slow store while it was still being uploaded to it. This causes significant regressions to the function of the fast-slow store such that files missing in the slow store do not get re-uploaded when they have been purged from an upstream cache and also starts actions downloading files from the slow store before they've been committed. Here we fix both of those by reverting the logic and putting back the regression test for the former issue and also adding in a async Mutex which only reports the slow store upload after it's completed. In order to perform this correctly the store trait is updated to return the uploaded size in the successful state.
chrisstaite-menlo
added a commit
that referenced
this pull request
May 29, 2026
The PR #2343 updated the fast-slow store such that it ignored files missing in the slow store for the `has` calls and also said that a file was in the slow store while it was still being uploaded to it. This causes significant regressions to the function of the fast-slow store such that files missing in the slow store do not get re-uploaded when they have been purged from an upstream cache and also starts actions downloading files from the slow store before they've been committed. Here we fix both of those by reverting the logic and putting back the regression test for the former issue and also adding in a async Mutex which only reports the slow store upload after it's completed. In order to perform this correctly the store trait is updated to return the uploaded size in the successful state.
chrisstaite-menlo
added a commit
that referenced
this pull request
May 29, 2026
The PR #2343 updated the fast-slow store such that it ignored files missing in the slow store for the `has` calls and also said that a file was in the slow store while it was still being uploaded to it. This causes significant regressions to the function of the fast-slow store such that files missing in the slow store do not get re-uploaded when they have been purged from an upstream cache and also starts actions downloading files from the slow store before they've been committed. Here we fix both of those by reverting the logic and putting back the regression test for the former issue and also adding in a async Mutex which only reports the slow store upload after it's completed. In order to perform this correctly the store trait is updated to return the uploaded size in the successful state.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
`FastSlowStore::has_with_results` previously only consulted the slow store. This PR adds two layers of awareness so existence checks reflect what the store actually has:
Order: slow store first (authoritative for downstream consumers) → in-flight map → fast store.
The bugs being fixed
Two distinct correctness gaps that both surfaced as redundant work under parallel load:
Fast-only / pre-slow-write blobs are invisible. A blob in the fast store but not yet pushed to slow (or in a fast-only configuration) returns `NotFound`, causing callers to trigger a redundant slow-store fetch for content that's already local.
Concurrent writers can't see each other. Two parallel uploaders of the same digest both see `NotFound` in the slow store and both upload — wasting bandwidth and CPU, and potentially racing in the slow-store backend. Common pattern: parallel SwiftCompile actions uploading the same shared SDK module.
Cancel safety
`InFlightSlowWriteGuard` uses `Weak` (no Arc cycle) and removes its entry on `Drop`, so a cancelled `update()` future cleans up the map even if the slow-store write never completes. There is a dedicated test for this property.
Test plan
New tests (all deterministic, no sleeps — concurrency via `oneshot` channels and a `GatedSlowStore`):
`has_does_not_consult_fast_store_when_slow_store_hits` — performance: confirms slow-hit short-circuits
`fast_store_only_value_is_reported_by_has` — fast-only fallback
`has_checks_fast_store_when_noop` — `NoopDownloads` slow store fall-through
`has_sees_in_flight_slow_writes` — gated slow write blocks; concurrent `has_with_results` returns Found via the in-flight map
`dropping_update_future_cleans_up_in_flight_entry` — uses `NoopStore` fast + `ReadOnly` direction so the in-flight map is the only presence signal; proves the Drop guard removes the entry
`has_with_results_handles_mixed_key_sources` — single call with 4 keys spread across slow/in-flight/fast-only/missing; asserts each is independently resolved
`cargo build -p nativelink-store` clean
`cargo clippy -p nativelink-store --lib --tests -- -D warnings` clean
`cargo fmt --check` clean
Provenance
Equivalent to commits `f69aaf8e8` ("Fix has_with_results to check fast store, downgrade CRITICAL to warn") and `2d770d913e` ("Fix has_with_results to check in-flight slow writes") from #2243, combined into one atomic fix because they share the same observable contract.
🤖 Generated with Claude Code
This change is