Retriable reads: check version ref before retrying on pruned-key errors#3145
Retriable reads: check version ref before retrying on pruned-key errors#3145jamesblackburn wants to merge 3 commits into
Conversation
| node_futures.reserve(keys.size()); | ||
| for (const auto& key : keys) { | ||
| node_futures.emplace_back(read_frame_for_version(store(), key, read_query, read_options, handler_data)); | ||
| } catch (const storage::NoDataFoundException&) { |
There was a problem hiding this comment.
This is the change - the rest is whitespace.
4218f6d to
21eb476
Compare
When a read races with a concurrent eager prune (writer deletes a version's keys immediately after superseding it), the reader catches NoDataFoundException / KeyNotFoundException and retries. Previously the retry called invalidate_cached_entry(), which forced a full version-chain reload (O(N) storage reads for N live versions). This change replaces that with reload_from_version_ref_if_changed(): - Reads the VERSION_REF key once (outside the lock) to get the current head from storage. - If the head is unchanged the data is genuinely missing (not a race), so the exception is re-raised without consuming a retry slot. - If the head changed, the cache entry is repopulated in-place from the ref data (mirroring the LOAD_LATEST shortcut in follow_version_chain), so the subsequent retry is a cache hit requiring 0 additional reads. Net cost per retry: exactly 1 VERSION_REF read and 0 VERSION reads, regardless of how many live versions exist in storage. New tests: - Python unit tests in test_read_retry.py covering the happy path, the no-retry error path, non-retriable versions, per-symbol scoping, and O(1) read count with N=15 live versions. - C++ unit test ReloadFromVersionRefIfChangedUpdatesCacheAndReturnsFalseWhenUnchanged. - Stress test test_concurrent_read_write_eager_prune using two LMDB handles with concurrent readers and eager-pruning writers.
21eb476 to
d6c151e
Compare
|
@- |
maxim-morozov
left a comment
There was a problem hiding this comment.
This has a real potential to DDos the storage. In case of storage slowdown or some network blip, we should check what exception we are getting, to make sure we dont retry in those cases. Otherwise, we will make things pretty bad in situations like this. The AWS sdk has retries as well, so it will quickly scale exponentially in terms of storage requests.
| for (const auto& key : keys) { | ||
| node_futures.emplace_back(read_frame_for_version(store(), key, read_query, read_options, handler_data)); | ||
| } catch (const storage::NoDataFoundException&) { | ||
| if (attempt >= max_attempts || !version_map()->invalidate_if_version_ref_changed(store(), stream_id)) |
There was a problem hiding this comment.
On a retry we'll end up reading the version ref twice:
- Once in
invalidate_if_version_ref_changedto check whether it changed - Once in
read_frame_for_versionto read the version chain
Probably not a big deal since retries will be fairly infrequent.
Ideally we could read it only once and short circuit if the same, but I think the current is good enough
There was a problem hiding this comment.
Yeh invalidate_if_version_ref_changed isn't the right way to do this. If the ref key has changed, we should continue LOAD_LATEST_UNDELETED and make this the new entry
| // raises E_NO_SUCH_VERSION (not caught here) and still fails fast. Preloaded-index reads carry | ||
| // their own index segment, so re-resolution cannot help them. | ||
| const bool is_preloaded = std::holds_alternative<std::shared_ptr<PreloadedIndexQuery>>(version_query.content_); | ||
| const int64_t max_attempts = is_preloaded ? 1 : ConfigsMap::instance()->get_int("VersionStore.ReadRetries", 3) + 1; |
There was a problem hiding this comment.
What do you think about retrying only on latest version queries?
It is unlikely for a failing a version chain v2->v1->v0 a read(as_of=0) to get deleted. People usually either always prune previous or never do it.
There was a problem hiding this comment.
Agreed, reading as_of specific versions, timestamps and snapshots will not benefit from retrying
| // raises E_NO_SUCH_VERSION (not caught here) and still fails fast. Preloaded-index reads carry | ||
| // their own index segment, so re-resolution cannot help them. | ||
| const bool is_preloaded = std::holds_alternative<std::shared_ptr<PreloadedIndexQuery>>(version_query.content_); | ||
| const int64_t max_attempts = is_preloaded ? 1 : ConfigsMap::instance()->get_int("VersionStore.ReadRetries", 3) + 1; |
There was a problem hiding this comment.
Also I think the default retries should be fewer. Even just 1.
If someone races with their reads so frequently people are unlikely to get the result they want either way.
There was a problem hiding this comment.
Also agreed, if writes are happening so fast that a single retry doesn't help, then we create a thundering herd problem by retrying more times
|
|
||
| std::lock_guard lock(map_mutex_); | ||
| auto it = map_.find(stream_id); | ||
| const std::optional<AtomKey> cached_head = (it != map_.end()) ? it->second->head_ : std::nullopt; |
There was a problem hiding this comment.
I think this logic would be easier to follow if we short circuit the case where it == map_.end() with return true. There was no ref key to have changed.
Theoretically it is possible for the invalidate_if_version_ref_changed to return false if there was no cached entry for the symbol and the version ref contains no link to a version key (which should not be possible) but still makes reasoning about this harder imo
| writer.terminate() | ||
| reader.terminate() | ||
|
|
||
| assert exceptions_in_reader.empty() |
There was a problem hiding this comment.
It feels like this test might be flaky. It doesn't seem impossible for the writer process to invalidate the reader multiple times.
There was a problem hiding this comment.
Agreed, this test will not be reliable, better to use the storage failure simulator for this sort of thing
| node_futures.reserve(keys.size()); | ||
| for (const auto& key : keys) { | ||
| node_futures.emplace_back(read_frame_for_version(store(), key, read_query, read_options, handler_data)); | ||
| } catch (const storage::NoDataFoundException&) { |
There was a problem hiding this comment.
I guess the NoDataFoundException is needed because of exception reraises like this
I agree it's the correct thing for this PR but I think we should leave the more precise KeyNotFound exception in those reraises.
There was a problem hiding this comment.
Unfortunately the current API can raise either depending on exactly when the failure occurs
| # One retry: one ref read for the changed-ref check, one more for storage_reload's LOAD_LATEST | ||
| # shortcut. Both are VERSION_REF reads; no full VERSION-chain traversal. | ||
| assert _version_ref_reads(raced_stats) == 2 | ||
| assert _version_reads(raced_stats) == 0 |
There was a problem hiding this comment.
It would be useful to show that the index key is read just once
|
|
||
| qs.enable() | ||
| qs.reset_stats() | ||
| result = reader.read(sym) # stale cache -> one retry |
There was a problem hiding this comment.
It would be nice to also test a similar version chain with read(as_of=0) depending on what we decide for this comment this could mean no retries or a different exception being raised.
| std::make_move_iterator(node_trys.end()), | ||
| std::back_inserter(node_results), | ||
| [](auto&& try_result) { return std::move(try_result).value(); } | ||
| ARCTICDB_DEBUG( |
There was a problem hiding this comment.
This macro gets compiled out of release builds. I would make the log info level - we want to know when this is happening
| return false; | ||
|
|
||
| if (it != map_.end()) | ||
| map_.erase(it); |
There was a problem hiding this comment.
The symbol should also be erased from the lock table if it is present
| from arcticdb.util.test import config_context, query_stats_operation_count | ||
|
|
||
| # Large enough that the reader's cached version chain never expires during a test. | ||
| STICKY_RELOAD_INTERVAL = 2_000_000_000_000 |
There was a problem hiding this comment.
This is in nanoseconds, so isn't very long at all
| with ( | ||
| config_context("VersionMap.ReloadInterval", STICKY_RELOAD_INTERVAL), | ||
| config_context("VersionStore.ReadRetries", 0), | ||
| ): |
| @@ -0,0 +1,152 @@ | |||
| """Deterministic tests for the read-retry behaviour in read_dataframe_version_internal. | |||
|
This doesn't handle |
- Retry only for latest-version reads (std::monostate): pinned queries (as_of=N, timestamp, snapshot, preloaded) use max_attempts=1 to avoid silently returning a different version. - Reduce default VersionStore.ReadRetries from 3 to 1. - Promote retry log line from ARCTICDB_DEBUG to info so it appears in release builds. - Refactor invalidate_if_version_ref_changed: early-return true when no cached entry exists; after detecting a changed ref, proactively call follow_version_chain (LATEST/UNDELETED_ONLY) with the already- read ref_entry and stamp last_reload_time_, so the retry's check_reload is a pure cache hit with no second VERSION_REF read. - Re-add test_concurrent_read_write_eager_prune stress test skipped in CI (RUNS_ON_GITHUB) to avoid non-deterministic failures there. - test_read_retry.py: copyright header, config_context_multi, larger STICKY_RELOAD_INTERVAL (2**62 ns), _index_reads helper, updated _version_ref_reads assertions 2→1, new pinned-query no-retry tests.
f168015 to
83a0bb4
Compare
ArcticDB Code Review SummaryRe-reviewed the latest commits (delta
One item still needs attention: Documentation
|
…fy retry Extend the single-retry-on-concurrent-prune behaviour to every version-resolving read path, and address review of the initial batch implementation. - Batch reads (read/metadata/descriptor) previously retried inside a .thenTry continuation that blocked on nested .get()s while running on an async::cpu_executor thread - the deadlock anti-pattern, since the nested read also needs that pool. Retries now run in a post-collectAll().get() loop on the caller thread, where blocking is safe. The iteration + gate + invalidate + log lives once in retry_failed_reads_on_concurrent_prune; each batch site passes only a retry_fn(idx). - batch_read_and_join: replaced the bespoke for/should_retry_join/try-catch loop with a multi-symbol overload of retry_read_on_concurrent_prune, matching the single-symbol paths. Retries once if any latest-version symbol's version ref changed. - read_column_stats_impl / get_column_stats_info_impl now throw KeyNotFoundException rather than the generic E_KEY_NOT_FOUND raise (which throws the base StorageException). The retry primitive catches KeyNotFoundException, so the column-stats wrapping was previously a no-op. Both still surface as E_KEY_NOT_FOUND to callers. - Added deterministic race tests for read_modify_write (recovers source + writes target exactly once), read_column_stats and get_column_stats_info.
What does this implement or fix?
When a reader resolves a symbol from its cached version chain just before a concurrent writer
supersedes and eagerly prunes the old version's keys, the read fails with
KeyNotFoundExceptionor
NoDataFoundException. Rather than surfacing this error, the retry loop now recoverstransparently:
VERSION_REFkey to compare the storage head against the cached head.immediately without consuming a retry slot.
LOAD_LATESTshortcut, reading only theVERSION_REF— not the full version chain.Net cost per retry: 2
VERSION_REFreads and 0VERSIONreads, regardless of how manylive versions exist for the symbol.
Tests added
test_read_retry.py— five Python unit tests covering the happy path, the no-retry path(genuinely missing version), per-symbol cache scoping (unrelated symbols unaffected), and O(1)
read count with N=15 live versions asserted via
query_stats.InvalidateIfVersionRefChangedReturnsTrueWhenChangedFalseWhenUnchanged— C++ unit testverifying the method returns
trueand invalidates the cache on a ref change, andfalsewhenunchanged.
test_concurrent_read_write_eager_prune— stress test using two LMDB handles with concurrentreaders and eager-pruning writers.