mssmt: batched descent InsertMany#2188
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request optimizes the MS-SMT tree insertion process by introducing a batched approach for InsertMany. By decoupling the tree traversal from the storage write operations, the system can now perform atomic overflow checks and significantly reduce the number of database round-trips and memory allocations. This architectural change provides substantial performance improvements for both in-memory operations and persistent database stores. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
250b789 to
c9b377c
Compare
There was a problem hiding this comment.
Code Review
This pull request significantly improves the performance of batched MS-SMT insertions (InsertMany) for both FullTree and CompactedTree by splitting the insertion into a read-only descent phase and a deferred write-out phase. This change allows for atomic overflow checks on the effective delta before writing to storage. Additionally, memory allocations are reduced by caching node hashes and sums by value and using stack-resident buffers during hashing. Property tests and benchmarks have been added to verify these changes. The review feedback suggests adding a missing function comment to comply with the style guide, optimizing partitionByBit to run in-place to eliminate heap allocations, and reusing this helper in FullTree.batchInsert to reduce code duplication.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
InsertMany on both FullTree and CompactedTree was a transaction- batched loop over single-leaf insert: each leaf paid a fresh walkDown from the root, a fresh walkUp materialising one BranchNode per level, and an UpdateRoot — the same cost as N calls to Insert wrapped in one tx.Update. The form bore the name but not the act. Reform InsertMany on both trees to apply the whole batch in a single recursive descent: at each touched internal node, fetch children once, partition items by the next key bit, recurse into each non-empty side, build one new BranchNode for the level, write one delete/insert pair. For CompactedTree, descending into an empty subtree with items short-circuits via buildSubtree, which compacts to a single CompactedLeafNode when the batch reduces to one non-empty leaf — preserving the existing compaction semantics. Single-leaf Insert is untouched; the change is contained to InsertMany and one new pair of helpers per tree (batchInsert, buildSubtree on CompactedTree; batchInsert on FullTree). Benchmark (BenchmarkMssmtInsertMany on a fresh CompactedTree, -benchtime=3x): - leaves=100: 6.31ms / 219k allocs -> 3.24ms / 127k allocs - leaves=1000: 64.8ms / 2.17M allocs -> 32.0ms / 1.24M allocs - leaves=10000: 597ms / 21.6M allocs -> 326ms / 12.2M allocs The companion BenchmarkMssmtBulkInsert (Insert-in-loop, untouched) moves only within noise, confirming the single-insert path was not regressed. For random 256-bit keys, paths only begin to converge in the top ~log2(N) levels — the bottom ~240 levels see disjoint paths and pay the per-leaf cost regardless. The measured ~2x time and ~1.7x alloc reduction comes from amortised walkDown reads, CompactedTree empty-subtree shortcuts, and the single UpdateRoot call. Property suite in mssmt/insert_many_property_test.go (rapid): TestInsertManyEquivalence pins the new path to Insert-loop semantics across both Tree implementations; TestInsertManyRoundTrip verifies every inserted leaf is Get-able and its MerkleProof verifies against the resulting root; TestInsertManyOverPopulated exercises CompactedLeafNode-bearing trees by chaining two InsertMany calls; TestInsertManySumOverflow asserts cumulative- sum overflow surfaces as ErrIntegerOverflow rather than wrapping.
BranchNode.NodeHash allocated four times per call: the sha256 hasher
state, the digest slice from h.Sum(nil), the cached *NodeHash, and
the cached *uint64 sum. The hasher allocation is unnecessary — the
input is always exactly 72 bytes (left hash || right hash || sum)
— and the pointer cache fields escape to the heap on every branch
construction purely as "is computed yet" sentinels.
Replace the cache with value-typed fields plus ok flags, and compute
the hash via sha256.Sum256 over a stack-resident 72-byte buffer. The
same pointer-escape fix applies to LeafNode.nodeHash; its hasher
allocation stays because the leaf value is variable-length, but
h.Sum is rewritten to write into a stack buffer and the sum encoding
moves from binary.Write to PutUint64 + Write.
Benchmark deltas vs the prior InsertMany reform on the same fresh
CompactedTree workload, -benchtime=3x:
- geomean wall-clock: -22%
- geomean B/op: -63%
- geomean allocs/op: -80%
InsertMany/leaves=10000: 326ms / 539MB / 12.2M allocs ->
251ms / 205MB / 2.5M allocs
BulkInsert/leaves=10000: 592ms / 929MB / 21.6M allocs ->
452ms / 336MB / 4.4M allocs
The win hits BulkInsert (Insert-in-loop) and InsertMany identically
since both paths share the BranchNode hashing surface.
Insert and InsertMany historically checked overflow differently.
InsertMany used a final-state-correct check (currentRoot +
batchSum - existingBatchSum must fit uint64). Single Insert used
the more conservative currentRoot + leafSum, which rejects valid
replacements where the prior leaf's sum drops out of the root.
The asymmetry showed up on the seeded case: with existing
k=MaxUint64-5, Insert(k, 10) was rejected while InsertMany({k:10})
was accepted, even though both produce the same final state (root
sum = 10).
The reason Insert couldn't just adopt the correct check was
structural: both APIs interleaved reads (walkDown, GetChildren)
and writes (Insert/DeleteBranch, Insert/DeleteLeaf,
Insert/DeleteCompactedLeaf) in one recursive descent. To learn
the prior leaf's sum before touching storage — as the correct
check requires — the descent would have had to walk the tree
twice.
Split the descent into a read phase and a write phase:
1. The recursion takes a *[]mutation slice. Every tx.InsertX /
tx.DeleteX call is replaced with an append to that slice.
Reads still go through tx. The recursion returns the new
node and an existingSum — the sum of any leaves at batch
keys (or the single Insert key) being replaced within this
subtree — threaded up through the call stack.
2. The public Insert/InsertMany wrapper runs the descent, uses
the returned existingSum to compute the effective batch
delta, checks for uint64 overflow, and only then flushes the
mutation queue via applyAll(tx, muts). Overflow rejection
happens atomically: nothing has touched storage yet.
Net effects:
- FullTree.Insert and CompactedTree.Insert now share the
replacement-aware overflow check with InsertMany. The prior
leaf's sum rides on the existing walkDown's return value
(FullTree) or is captured at the CompactedLeafNode+matching-key
case (CompactedTree); no second walk needed.
- InsertMany no longer pre-walks the tree O(N) times to compute
existingBatchSum; the batchInsert descent accumulates it.
- sumExistingLeavesFull and the inline equivalent in
CompactedTree.InsertMany are removed.
- Eliminates the interleaved-read-and-write anti-pattern: the
recursion is now a pure (modulo reads) plan computation, and
applyAll is the explicit effectful flush.
Bench (BenchmarkMssmtBulkInsert/leaves=10000 on a fresh
CompactedTree, -benchtime=3x):
- before: 529ms / 7.00M allocs
- after: 458ms / 4.42M allocs (-13% time, -37% allocs)
InsertMany on fresh trees is unchanged within noise (was already
empty-tree fast-pathed). InsertMany on populated trees pays
strictly fewer reads now (the per-key pre-walk is gone), though
the existing bench doesn't exercise that shape.
Bundled in the same commit:
- New file mssmt/mutation.go for the mutation type and helpers
(insertBranch/deleteBranch/... wrappers + applyAll).
- FullTree.Insert switched from t.Root(ctx) (which opened a
separate View tx inside the Update closure) to tx.RootNode().
- CompactedTree.batchInsert fast-path for the no-op compacted-leaf
case: when a batch contains only deletes of absent keys and the
current node is a CompactedLeafNode none of them target, return
the leaf unchanged instead of deleting and re-inserting it
through buildSubtree.
- buildSubtree's items[:0] in-place compaction pinned with a
comment documenting the caller-owns-this-slice invariant.
- Property suite extensions:
- drawBatchLeaf draws EmptyLeafNode ~25% of the time, exercising
the deletion paths in batchInsert and buildSubtree.
- TestExistingSumGroundTruth: for any random batch on a random
tree, the existingSum derived from the observed root-sum
delta must equal sum of tree.Get(k).NodeSum() over k in the
batch — the descent's accounting must be honest, since the
overflow gate depends on it.
- TestInsertOverflowReplacementParity: pins Insert and
InsertMany to the same acceptance set on the seeded
replacement case.
The in-tree InsertMany bench so far only ran against a fresh
CompactedTree over an in-memory NewDefaultStore, so it didn't
exercise the two cases the branch's perf claims care about most:
populated-tree InsertMany, and InsertMany over a persistent
SQL-backed store where each fanned-out tx.InsertX / DeleteX pays
real I/O.
Add two benchmarks that cover both.
mssmt/bench_test.go: BenchmarkInsertManyPopulated seeds a
CompactedTree with N leaves and then measures a single InsertMany
call with M new leaves. On an Apple M4, -benchtime=1x:
seed=1000 / batch=100 -> 4.0 ms / 42k allocs
seed=1000 / batch=1000 -> 35.7 ms / 357k allocs
seed=10000 / batch=100 -> 4.4 ms / 44k allocs
seed=10000 / batch=1000 -> 44.3 ms / 426k allocs
Per-key cost rises only ~10% going from a 1k- to a 10k-populated
tree, consistent with the design: the bottom ~240 levels of a
256-bit-keyspace tree are per-leaf regardless of batch shape, so
amortization gains come only from the top ~log2(seed+batch) levels.
tapdb/mssmt_bench_test.go: BenchmarkTreeInsertMany contrasts N
single tree.Insert calls against one tree.InsertMany call, both
inside the same outer transaction and wrapped with the
treeStoreWrapperTx pattern that real callers (burn_tree,
ignore_tree, supply_tree) use — so the delta isolates the batching
win from any fresh-tx overhead.
SQLite (local file):
batch=100 -> InsertLoop 86 ms vs InsertMany 13 ms (~6.7x)
batch=1000 -> InsertLoop 4.5 s vs InsertMany 61 ms (~75x)
Postgres 15 (dockertest container, tcp/localhost):
batch=100 -> InsertLoop 609 ms vs InsertMany 44 ms (~13.7x)
batch=1000 -> InsertLoop 187 s vs InsertMany 351 ms (~533x)
Amplification grows with batch size because each per-key walkDown +
walkUp + UpdateRoot triple fans out into per-node DB calls, and the
socket latency of Postgres over dockertest amplifies the wall-clock
gap further at scale.
The Postgres bench runs only under -tags=test_db_postgres; the
default (no-tag) run uses SQLite via NewTestDB's existing build-tag
switch.
c9b377c to
76f659c
Compare
Partially resolves #451.
InsertMany previously performed N single-leaf Inserts wrapped in one transaction, so internal nodes shared by the batch were recomputed once per leaf rather than once per batch. Both Insert and InsertMany also interleaved reads and writes in the same recursive descent, which forced them onto asymmetric overflow checks (single Insert rejected valid replacements the batched path accepted) and prevented atomic rejection when a batch would have overflowed the root sum.
This PR reshapes InsertMany into a single recursive descent that partitions items by the next key bit at each touched internal node, so shared work is done once per batch. The descent is split into a read phase (walk the tree, queue mutations without applying them) and a write phase (flush the mutation queue). Both APIs get the same replacement-aware check and overflow rejections become atomic. Per-call heap allocations from BranchNode/LeafNode hashing are also cut, similar to what was done in #2183.
Net wins: on 10k-leaf in-memory batches, InsertMany is roughly 2× faster with ~8× fewer allocations end-to-end; against a live store InsertMany runs ~75× faster than the equivalent Insert-loop on SQLite, and ~533× on Postgres.
(N.b., the new InsertMany can be profitably used in more places than it currently is, but I'll make those changes in a separate PR.)