Pgq perf experiments by NikolayS · Pull Request #23 · NikolayS/postgres

NikolayS · 2026-04-13T14:28:32Z

No description provided.

Add wait__event__start and wait__event__end probes to the DTrace provider definition and invoke them from the static inline functions pgstat_report_wait_start() and pgstat_report_wait_end(). Because these functions are static inline, they get inlined at every call site (~100 locations across 36 files), leaving no function symbol for eBPF uprobes to attach to. USDT probes solve this: the compiler emits a nop instruction at each inlined site with ELF .note.stapsdt metadata, allowing eBPF tools to discover and attach to all call sites with a single probe definition. This enables full eBPF-based wait event tracing (e.g., with bpftrace) without requiring hardware watchpoints or PostgreSQL source patches beyond this change. When built without --enable-dtrace, the probes compile to do {} while(0) with zero overhead. PoC: covers all wait events via the two central inline functions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

7 ideas ranked by confidence (9/10 to 1/10) based on profiling PgQ insert_event() at ~148k ev/s on PG 18.3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

With 10+ concurrent inserters (common in queue workloads like PgQ), 8 WAL insertion locks become a serialization bottleneck. Raising to 32 spreads inserters across more locks, reducing LWLock:WALInsert wait time. The tradeoff is that WAL flush must iterate all locks, but 32 iterations vs 8 is negligible compared to the I/O cost of flushing. The additional shared memory is only ~3 KB (24 extra 128-byte WALInsertLockPadded structs). All usages of NUM_XLOGINSERT_LOCKS are via the macro (modulo for lock selection, loop bounds, shared memory sizing) — no hardcoded assumptions about the value being 8. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Detailed design analysis for spreading concurrent inserters across N target blocks (smgr_targblock[4]) instead of one, hashing by MyProcNumber. Documents data structure changes, code changes (~30 LOC), all interaction risks (FSM, VACUUM, visibility map, COPY/bistate, extension lock, smgr invalidation), and a gated prototype plan that depends on benchmarking Ideas 3+4 first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add a BulkInsertState to the ModifyTable executor node so that INSERT statements going through the standard executor path get the same bulk-extend and FSM-bypass optimizations that COPY uses. Previously, only COPY created a BulkInsertState; the executor INSERT path always passed bistate=NULL to table_tuple_insert(), meaning: - No adaptive bulk extension (already_extended_by was never tracked) - Pre-extended pages went into the FSM where other backends could grab them, wasting the extension effort - No buffer pin caching across consecutive inserts This change creates a BulkInsertState in ExecInitModifyTable() for non-FDW INSERT operations and passes it through to table_tuple_insert() in ExecInsert(). For partitioned tables, ReleaseBulkInsertStatePin() is called when the target partition changes, following the same pattern COPY uses. The main beneficiary is INSERT ... SELECT (multi-row inserts within a single plan execution). For single-row SPI INSERTs (e.g., PgQ), the bistate is created/destroyed per plan execution, so adaptive scaling does not accumulate across calls — a future Phase 2 optimization would persist the bistate across SPI calls within an SPI connection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion Change smgr_targblock from a single BlockNumber to an array of 4 slots. Each backend hashes into its own slot via (MyProcNumber & 3), so concurrent inserters naturally target different heap pages instead of all fighting for exclusive BufferContent lock on the same 8KB page. This addresses the ~23% LWLock:BufferContent contention seen in profiling of append-heavy queue workloads at 10+ concurrent inserters. Changes: - smgr.h: Add SMGR_TARGBLOCK_SLOTS (4), change smgr_targblock to array - rel.h: Add RelationTargetBlockSlot() macro, update RelationGetTargetBlock and RelationSetTargetBlock to index by backend slot - smgr.c: Initialize all 4 slots to InvalidBlockNumber in smgropen() and smgrrelease() - storage.c: Same initialization fix in RelationTruncate() The existing Assert sites in createas.c, matview.c, and heapam_handler.c work unchanged because they check the current backend's slot, which will be InvalidBlockNumber when no writes have occurred to a freshly created relation. BulkInsertState (COPY) bypasses target blocks entirely via bistate->current_buf, so COPY is unaffected. Memory overhead: 12 bytes extra per SMgrRelation entry (3 additional BlockNumber slots). Full build passes with zero warnings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

A/B test stock PG 18 vs patched PG 19dev with: - NUM_XLOGINSERT_LOCKS 8→32 - BulkInsertState for executor INSERT - Multi-slot target blocks (4 slots) Peak: 193k ev/s (100B) / 120k ev/s = 235 MB/s (2KB) Biggest win: multi-target blocks at 16+ clients nearly doubles small-payload throughput. Also tested wal_compression (lz4, zstd) — hurts with random JSON. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…lysis - 8 target slots regresses 19.5% vs 4 — cache pressure, reverted - PL/pgSQL mode: 95-125k ev/s, 73-82% of C at low concurrency - Sequence contention not visible — Lock:extend is the real bottleneck - AIO writes not available yet (reads only in PG19dev) - IO:DataFileWrite at 51% is the ceiling — needs async writes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tested PG 19dev AIO settings: - io_max_concurrency=128: +40% individually but unreliable in combo - debug_io_direct='data': +26% individually, conflicts with other settings - effective_io_concurrency=200: +35% individually - Combined: worse than individual (settings fight each other) Also tested: - TOAST: non-factor (2KB stays inline) - ev_txid index: 3% overhead (acceptable for consumer reads) - Sequence cache=100: 0.9% (noise) Full summary table of all 14 optimization attempts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Key finding: PG 19dev unpatched is SLOWER than PG 18 at low concurrency (likely debug/AIO overhead). Our 3 patches provide +98% to +264% improvement on PG 19dev baseline. commit_delay=50 with commit_siblings=3 gives +25% for free (config only). Updated full optimization inventory (14 items tested). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Key findings: - wal_compression=lz4 HELPS (+5.6%) with compressible JSON — earlier "hurts" finding was measurement noise. Now RECOMMENDED. - Full cycle: 1K batch 53% faster on patched PG, consumer ops sub-ms - commit_delay inconclusive on macOS (too much variance) - 16 total optimizations tested across 5 rounds Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Final results on stock PG 18.3 with proven config tuning: - PL/pgSQL (pgq2): 88k ev/s (1KB JSON), 75k ev/s (2KB) = 147 MB/s - C mode: 161k ev/s (100B), 135k ev/s (1KB) = 141 MB/s - Full cycle: 145ms for 10K events end-to-end - Consumer ops (tick/next/finish): 1-3ms constant - Tuning guide: synchronous_commit=off, shared_buffers, max_wal_size, lz4 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Final round: fill-then-drain test shows consumer is NOT the bottleneck (33K ev/s for giant 4.87M-event batch, 302K ev/s for normal batches). 7 rounds, 16 optimizations tested. Practical limit reached for this hardware. Final numbers and recommended tuning guide documented. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

NikolayS force-pushed the master branch from bee8ce7 to 1f62dbf Compare April 19, 2026 03:19

NikolayS force-pushed the pgq-perf-experiments branch from 9bec717 to d45d100 Compare April 22, 2026 17:33

NikolayS force-pushed the master branch from e471dc5 to 1f62dbf Compare April 22, 2026 17:50

Nik Samokhvalov and others added 13 commits April 22, 2026 11:18

Add PGQ performance optimization ideas for queue workloads

b2fe4ad

7 ideas ranked by confidence (9/10 to 1/10) based on profiling PgQ insert_event() at ~148k ev/s on PG 18.3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

NikolayS force-pushed the pgq-perf-experiments branch from d45d100 to e8f8754 Compare April 22, 2026 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pgq perf experiments#23

Pgq perf experiments#23
NikolayS wants to merge 13 commits intomasterfrom
pgq-perf-experiments

NikolayS commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NikolayS commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant