Reduce symbol list peak heap usage#3165
Conversation
| // (e.g. when a previous compaction round failed in the deletion step), we don't want to delete the former | ||
| if (atom_key != exclude) | ||
| variant_keys.emplace_back(atom_key); | ||
| to_remove.emplace_back(std::move(atom_key)); |
There was a problem hiding this comment.
This refactor broke delete_keys. Kept keys are now pushed into to_remove — the same vector being iterated by the range-for — instead of variant_keys. This is undefined behaviour (modifying the container during iteration can reallocate/invalidate the iterator) and, separately, variant_keys stays empty so remove_keys_sync(variant_keys) deletes nothing.
| to_remove.emplace_back(std::move(atom_key)); | |
| if (atom_key != exclude) | |
| variant_keys.emplace_back(std::move(atom_key)); |
delete_keys is still called from backwards_compat_compact (test helper), so the backwards-compat path no longer deletes old keys.
ArcticDB Code Review SummaryA well-structured memory optimization for symbol-list loading. The single-pass streaming load, the 32-byte Latest commit ( Correctness
PR Title & Description
Note
|
JournalEntryData referenced ActionType before the enum was declared, causing 'unknown type name' on GCC/Clang and cascading std::sort errors. Move the enum above JournalEntryData.
The deep-nesting tests hardcoded 256/255 levels and a '255 levels' message, all derived from msgpack's DEFAULT_RECURSE_LIMIT being 511. msgpack 1.2.0 bumped it to 512, so the flattener's limit (LIMIT//2) became 256 and the over-limit test stopped raising. Derive the bounds from DEFAULT_RECURSE_LIMIT so the tests track msgpack's value.
1.
b73841165— Change load approach to stream the entries rather than keeping all of themWhat it does:
Replaces the two-pass approach (
get_all_symbol_list_keys+load_journal_keys) with a singlestreaming pass (
load_journal_streaming). Instead of materialising all NAtomKeys into avector<AtomKey>and then building the update map from it, the streaming pass builds theMapType(symbol →vector<SymbolEntryData>) directly during iteration.Result:
Scan phase no longer requires a temporary
vector<AtomKey>. The update map is built in oneiteration rather than two. The
all_keyslist (compaction path only) still holds N×AtomKey(~160 B each), so peak memory during compaction is unchanged at this stage.
2.
1dc284d78— Bug fixes of the first commitWhat it does:
A set of correctness fixes and clean-up on top of commit 1:
total_key_counttracking toJournalResultand wires it intoLoadResult.total_key_count_needs_compactionto usetotal_key_count_instead ofsymbol_list_keys_.size(),so the compaction threshold counts all journal keys (not just those kept for deletion)
std::get<StringIndex>(key.start_index())which would crash on numeric symbols; the newcode uses
symbols_in_merge.count(...)and handles both string and numericStreamIdLoadResult.timestamp_(unused)write_symbolsto takeCollectionTypeby value3.
faa552974— Refactor the SL update map to store a more compacted version of the SL keysWhat it does:
The main memory reduction commit. Replaces all
SymbolEntryData/MapType/vector<AtomKey>storage for journal entries with a single 32-byte
JournalEntryDatastruct and unifiedJournalMapType.Key changes:
JournalEntryData(32 B): storeskey_version_id,creation_ts,content_hash,action,is_new_style— enough to reconstruct the fullAtomKeyfor deletion withoutholding the symbol string (the map key provides that)
MapType(symbol →vector<SymbolEntryData>) replaced byJournalMapType(symbol →
vector<JournalEntryData>) on all paths; removes thecollect_keysboolean andits branching throughout
load_journal_streamingandattempt_loadLoadResult:symbol_list_keys_: vector<VariantKey>(N×~160 B) replaced byupdate_map_: JournalMapType(N×32 B) +old_compaction_keys_: vector<VariantKey>(0–1 entries)compact_internal: reconstructsAtomKeys fromJournalEntryDatain batches of 10 000, erasing each symbol's entry after dispatch to free memory incrementally
key_sort_comparator,add_update_map_entry,sort_update_map_entries,and the old
merge_existing_with_journal_map(MapType version);merge_existing_with_journal_mapis now the single merge function operating on the
JournalMapTypeResult (measured with
BM_symbol_list_compactionrelease benchmarks):The 1K×1K peak of 33.5 MB is close to the theoretical N×32 B = 32 MB floor (small overhead
from
unordered_mapbuckets and theseen_in_existingset in the merge path).4.
a3cff203a— Add C++ benchmarksWhat it does:
Adds
cpp/arcticdb/version/test/benchmark_symbol_list.cppwith theBM_symbol_list_compactionGoogle Benchmark suite. The benchmark:list_symbols(which triggers compaction) and records wall time and peak RSS memory(N_symbols, N_versions)pairs: (100×100), (1K×100), (1K×1K), (10K×10),(10K×100), plus an S3-mock variant for (1K×100) - most are commented to not interfere with the CI but still be usable for local benchmarking