Skip to content

Reduce symbol list peak heap usage#3165

Open
G-D-Petrov wants to merge 7 commits into
masterfrom
gpetrov/fix_sl_refactor
Open

Reduce symbol list peak heap usage#3165
G-D-Petrov wants to merge 7 commits into
masterfrom
gpetrov/fix_sl_refactor

Conversation

@G-D-Petrov

Copy link
Copy Markdown
Collaborator

1. b73841165 — Change load approach to stream the entries rather than keeping all of them

What it does:

Replaces the two-pass approach (get_all_symbol_list_keys + load_journal_keys) with a single
streaming pass (load_journal_streaming). Instead of materialising all N AtomKeys into a
vector<AtomKey> and then building the update map from it, the streaming pass builds the
MapType (symbol → vector<SymbolEntryData>) directly during iteration.

Result:

Scan phase no longer requires a temporary vector<AtomKey>. The update map is built in one
iteration rather than two. The all_keys list (compaction path only) still holds N×AtomKey
(~160 B each), so peak memory during compaction is unchanged at this stage.


2. 1dc284d78 — Bug fixes of the first commit

What it does:

A set of correctness fixes and clean-up on top of commit 1:

  • Adds total_key_count tracking to JournalResult and wires it into LoadResult.total_key_count_
  • Changes needs_compaction to use total_key_count_ instead of symbol_list_keys_.size(),
    so the compaction threshold counts all journal keys (not just those kept for deletion)
  • Fixes the "Would delete unseen key" check: the old code called
    std::get<StringIndex>(key.start_index()) which would crash on numeric symbols; the new
    code uses symbols_in_merge.count(...) and handles both string and numeric StreamId
  • Removes LoadResult.timestamp_ (unused)
  • Changes write_symbols to take CollectionType by value

3. faa552974 — Refactor the SL update map to store a more compacted version of the SL keys

What it does:

The main memory reduction commit. Replaces all SymbolEntryData / MapType / vector<AtomKey>
storage for journal entries with a single 32-byte JournalEntryData struct and unified
JournalMapType.

Key changes:

  • JournalEntryData (32 B): stores key_version_id, creation_ts, content_hash,
    action, is_new_style — enough to reconstruct the full AtomKey for deletion without
    holding the symbol string (the map key provides that)
  • Unified map: MapType (symbol → vector<SymbolEntryData>) replaced by JournalMapType
    (symbol → vector<JournalEntryData>) on all paths; removes the collect_keys boolean and
    its branching throughout load_journal_streaming and attempt_load
  • LoadResult: symbol_list_keys_: vector<VariantKey> (N×~160 B) replaced by
    update_map_: JournalMapType (N×32 B) + old_compaction_keys_: vector<VariantKey> (0–1 entries)
  • Lazy batch deletion in compact_internal: reconstructs AtomKeys from JournalEntryData
    in batches of 10 000, erasing each symbol's entry after dispatch to free memory incrementally
  • Removed dead code: key_sort_comparator, add_update_map_entry, sort_update_map_entries,
    and the old merge_existing_with_journal_map (MapType version); merge_existing_with_journal_map
    is now the single merge function operating on the JournalMapType

Result (measured with BM_symbol_list_compaction release benchmarks):

Scenario Before After Reduction
1 000 symbols × 1 000 versions (1 M entries) 68.4 MB 33.5 MB −51 %
10 000 symbols × 100 versions (1 M entries) 76.5 MB 43.7 MB −43 %

The 1K×1K peak of 33.5 MB is close to the theoretical N×32 B = 32 MB floor (small overhead
from unordered_map buckets and the seen_in_existing set in the merge path).


4. a3cff203a — Add C++ benchmarks

What it does:

Adds cpp/arcticdb/version/test/benchmark_symbol_list.cpp with the
BM_symbol_list_compaction Google Benchmark suite. The benchmark:

  • Writes N symbols × V versions of journal entries to an in-memory LMDB store
  • Runs list_symbols (which triggers compaction) and records wall time and peak RSS memory
  • Parameterised over (N_symbols, N_versions) pairs: (100×100), (1K×100), (1K×1K), (10K×10),
    (10K×100), plus an S3-mock variant for (1K×100) - most are commented to not interfere with the CI but still be usable for local benchmarking

Comment thread cpp/arcticdb/version/symbol_list.cpp Outdated
// (e.g. when a previous compaction round failed in the deletion step), we don't want to delete the former
if (atom_key != exclude)
variant_keys.emplace_back(atom_key);
to_remove.emplace_back(std::move(atom_key));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refactor broke delete_keys. Kept keys are now pushed into to_remove — the same vector being iterated by the range-for — instead of variant_keys. This is undefined behaviour (modifying the container during iteration can reallocate/invalidate the iterator) and, separately, variant_keys stays empty so remove_keys_sync(variant_keys) deletes nothing.

Suggested change
to_remove.emplace_back(std::move(atom_key));
if (atom_key != exclude)
variant_keys.emplace_back(std::move(atom_key));

delete_keys is still called from backwards_compat_compact (test helper), so the backwards-compat path no longer deletes old keys.

@claude

claude Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

ArcticDB Code Review Summary

A well-structured memory optimization for symbol-list loading. The single-pass streaming load, the 32-byte JournalEntryData, and the lazy batched deletion are sound, and the AtomKey reconstruction for deletion correctly preserves version_id/content_hash/end_index across old- and new-style keys. The new DirectPathEquivalence tests are a good addition.

Latest commit (209250e) is test-only: test_recursive_normalizers.py now derives the nesting depth from DEFAULT_RECURSE_LIMIT (imported from msgpack.fallback) instead of hardcoded 255/256, mirroring the production check depth > DEFAULT_RECURSE_LIMIT // 2. This makes the test robust to msgpack-version differences in the recursion limit. No issues; all previously raised issues remain resolved.

Correctness

  • delete_keys fixed (symbol_list.cpp:802). Kept keys are now pushed into variant_keys (not the to_remove vector being iterated), so the UB / iterator-invalidation is gone and remove_keys_sync(variant_keys) deletes the right keys. A test in test_symbol_list.cpp covers the backwards_compat_compact path.

PR Title & Description

  • Label is patch; no no-release-notes label is set. This is an internal C++ refactor / peak-memory reduction with no public-API or on-disk-format change — confirm no-release-notes is intended, or add release notes if the peak-memory reduction should be announced.

Note

  • The benchmark file polls mallinfo2() from a sampler thread at 100µs intervals (benchmark_symbol_list.cpp). This is test-only / Linux-glibc-gated and non-blocking, so fine, but be aware peak figures are sampled, not exact.

Georgi Petrov added 3 commits June 11, 2026 11:25
JournalEntryData referenced ActionType before the enum was declared,
causing 'unknown type name' on GCC/Clang and cascading std::sort errors.
Move the enum above JournalEntryData.
The deep-nesting tests hardcoded 256/255 levels and a '255 levels'
message, all derived from msgpack's DEFAULT_RECURSE_LIMIT being 511.
msgpack 1.2.0 bumped it to 512, so the flattener's limit (LIMIT//2)
became 256 and the over-limit test stopped raising. Derive the bounds
from DEFAULT_RECURSE_LIMIT so the tests track msgpack's value.
@G-D-Petrov G-D-Petrov added the patch Small change, should increase patch version label Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

patch Small change, should increase patch version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant