tapdb: parallelize universe proof ingest#2194
Conversation
Inserting a sequence of leaves that reuses keys must produce the same tree as inserting only the final leaf observed per key. This is the invariant that permits coalescing consecutive updates to the same key into a single insert of the latest value, as the multiverse tree does for universe roots.
Split upsertMultiverseLeafEntry into its write half and a separate multiverseRootAndProof read helper. The batch insert path previously called the combined function once per item, computing a multiverse root and inclusion proof each time only to discard them (the batch callers never read those fields), and rewriting the same universe's multiverse leaf once per item. UpsertProofLeafBatch now tracks the final universe root per universe and upserts each universe's multiverse leaf exactly once. This is sound because SMT insertion is last-write-wins per key (see the mssmt property test). DeleteProofLeaf similarly stops computing a discarded root and proof. Single-leaf paths keep their semantics: they call the write half and then fetch the root and inclusion proof explicitly.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request optimizes the proof ingestion process by disentangling the writing of universe-specific trees from the shared multiverse tree. By moving the shared tree updates to a batched, coalesced writer, the system avoids row-level contention that previously forced serial execution under Postgres's serializable isolation. This change yields a substantial performance gain in concurrent environments and includes a robust reconciliation process to ensure data integrity after potential crashes. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a multiverseRootCoalescer to batch and serialize writes to the shared multiverse trees, significantly improving concurrent universe proof ingest performance on Postgres by avoiding serialization conflicts. It also adds startup reconciliation to repair any diverged multiverse entries, along with comprehensive unit and property-based tests. Feedback on the changes highlights a potential deadlock issue in the coalescer's flush loop: if a panic occurs during a batch flush, the flushing state is never reset, permanently blocking future flushes. Utilizing a defer block to reset the flushing state is recommended to ensure robustness.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Every proof leaf insert must reflect its universe's new root in the shared multiverse tree for its proof type. Doing that write inside each insert's own transaction makes any two concurrent inserts collide on the multiverse root rows, regardless of which universes they touch: under Postgres serializable isolation one of them aborts and retries with backoff, effectively serializing ingest across universes (and making it slower than actually serial, since the retry backoff starts at 20-60ms). The coalescer funnels all multiverse writes through a single flusher using leader-based group commit: the first caller to find it idle flushes pending updates in rounds until none remain, while other callers just await their result. Updates accumulate while a flush is in flight and are applied together in one transaction, at most one leaf write per universe (last-write-wins per SMT key). Waiters receive the post-flush multiverse root and their universe's inclusion proof, preserving the response semantics of the insert paths. Not yet wired into the insert paths; this commit adds the component, its concurrency unit test, and a property test comparing flushed state against an in-memory oracle tree.
MultiverseStore.UpsertProofLeaf now commits only the per-universe rows in its own transaction, then reflects the universe's new root in the shared multiverse tree via the root coalescer. Insert transactions for different universes no longer touch any shared rows, so they can commit in parallel on Postgres instead of aborting each other through serialization failures on the multiverse root. Response semantics are preserved: the caller still receives the multiverse root and inclusion proof, now from the flush that carried its update. Under concurrent inserts into the same universe, those fields may reflect a slightly newer universe root than the one in the same response, which is the accurate post-flush state. A failed flush leaves the universe leaf committed and the multiverse entry stale; the entry is healed by the universe's next successful update. BaseUniverseTree.UpsertProofLeaf keeps its inline multiverse write, as it has no production callers (tests and bench fixtures only).
UpsertProofLeafBatch now commits only the per-universe rows in its own transaction, then submits each universe's final root to the root coalescer, matching the single-leaf path. With this, every insert-path write to the shared multiverse trees flows through the coalescer, so insert transactions never contend on rows shared across universes. Batch callers never consume multiverse roots or inclusion proofs, so the new batch entry point registers waiters that skip generating them; a flush only computes the root and proof for universes with a single-leaf waiter attached. Deletion paths (DeleteProofLeaf, DeleteUniverse) keep their inline multiverse writes: they are rare administrative operations, and any collision with a flush is absorbed by the existing transaction retry.
Since multiverse updates are written outside the proof insert transaction, a daemon stopping between a universe commit and its multiverse flush leaves the shared tree committing to a stale root, or missing the universe's leaf entirely. The multiverse trees are fully derived data (each leaf commits to a universe root), so this is always repairable. ReconcileMultiverse compares every universe root against its multiverse leaf and rewrites diverged entries through the root coalescer. It runs during server construction, before the store serves concurrent traffic. The leaf construction rule is extracted into multiverseLeafNode so the insert path and the reconciliation check share it. Covered by a deterministic crash-window test and a property test mixing healthy, orphaned and tampered universes.
With the multiverse root row moved out of the insert transactions, the remaining concurrency bottleneck is the flush transaction itself: under serializable isolation its SMT walks take page-level predicate locks (the recursive CTE plans as a bitmap index scan) that false-share index pages with every in-flight universe transaction, so flushes abort and retry with backoff, stalling all of their waiters. The flush is the sole writer of the multiverse namespaces, so serializable isolation buys it nothing: run it at read committed on Postgres. A non-serializable writer takes no predicate locks and does not flag conflicts on serializable readers, so flushes can neither abort nor be aborted by concurrent universe transactions. The single-writer invariant that makes this safe is enforced by the coalescer's single-flusher role plus a process-wide multiverse write mutex now shared with the deletion paths, the only other multiverse writers. Benchmarked on Postgres 15 (docker, 1k pre-populated universes, 8 workers inserting into distinct universes): concurrent ingest goes from 21.8 leaves/s at the merge base (worse than its 57.0 leaves/s serial, due to serialization-failure backoff) to 65.7 leaves/s, a 3x improvement, with no remaining inversion. SQLite is unaffected: the isolation override is Postgres-only.
(TLDR, disentangle the universe and multiverse writers. Postgres go brrrrrrrr. Fable's summary follows.)
Previously, every proof insert updated two trees inside one transaction: its own universe's tree, and the shared multiverse tree that summarizes all universe roots. Because the shared tree's root was rewritten by every insert, any two concurrent inserts — even into completely unrelated universes — collided on the same rows. Under Postgres's serializable isolation, a collision means abort-and-retry with backoff, so concurrent ingest was effectively single-writer, and in practice slower than serial: measured at 8 workers, 21.8 leaves/s concurrent vs 57.0 serial.
The fix is to decouple the two updates. An insert's transaction now touches only its own universe's state; the shared tree is maintained by a single coalescing writer that collects updates from all inserts and applies them in batches, one transaction per batch. Since only the latest root per universe matters, concurrent updates merge rather than queue, and callers still receive the post-flush multiverse root and inclusion proof, so RPC semantics are unchanged.
The multiverse tree is purely derived data, so a crash between universe commit and flush is healed by startup reconciliation — and its single-writer discipline lets the flush run at read committed on Postgres, taking it out of serialization-conflict detection entirely.
Result: 65.7 leaves/s concurrent vs 61.8 serial on the same benchmark, a 3x improvement with the inversion gone. Hooking #2188's batched-descent InsertMany into the flush should collapse each batch into a single descent per tree, which we expect to close much of the gap between the current ~65 leaves/s and the ~700 leaves/s ceiling measured for the universe transactions alone.