Skip to content

Pgq perf experiments#23

Open
NikolayS wants to merge 13 commits intomasterfrom
pgq-perf-experiments
Open

Pgq perf experiments#23
NikolayS wants to merge 13 commits intomasterfrom
pgq-perf-experiments

Conversation

@NikolayS
Copy link
Copy Markdown
Owner

No description provided.

Nik Samokhvalov and others added 13 commits April 22, 2026 11:18
Add wait__event__start and wait__event__end probes to the DTrace
provider definition and invoke them from the static inline functions
pgstat_report_wait_start() and pgstat_report_wait_end().

Because these functions are static inline, they get inlined at every
call site (~100 locations across 36 files), leaving no function symbol
for eBPF uprobes to attach to. USDT probes solve this: the compiler
emits a nop instruction at each inlined site with ELF .note.stapsdt
metadata, allowing eBPF tools to discover and attach to all call sites
with a single probe definition.

This enables full eBPF-based wait event tracing (e.g., with bpftrace)
without requiring hardware watchpoints or PostgreSQL source patches
beyond this change.

When built without --enable-dtrace, the probes compile to do {} while(0)
with zero overhead.

PoC: covers all wait events via the two central inline functions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7 ideas ranked by confidence (9/10 to 1/10) based on profiling
PgQ insert_event() at ~148k ev/s on PG 18.3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
With 10+ concurrent inserters (common in queue workloads like PgQ),
8 WAL insertion locks become a serialization bottleneck. Raising to
32 spreads inserters across more locks, reducing LWLock:WALInsert
wait time.

The tradeoff is that WAL flush must iterate all locks, but 32
iterations vs 8 is negligible compared to the I/O cost of flushing.
The additional shared memory is only ~3 KB (24 extra 128-byte
WALInsertLockPadded structs).

All usages of NUM_XLOGINSERT_LOCKS are via the macro (modulo for
lock selection, loop bounds, shared memory sizing) — no hardcoded
assumptions about the value being 8.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Detailed design analysis for spreading concurrent inserters across N
target blocks (smgr_targblock[4]) instead of one, hashing by
MyProcNumber. Documents data structure changes, code changes (~30 LOC),
all interaction risks (FSM, VACUUM, visibility map, COPY/bistate,
extension lock, smgr invalidation), and a gated prototype plan that
depends on benchmarking Ideas 3+4 first.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a BulkInsertState to the ModifyTable executor node so that INSERT
statements going through the standard executor path get the same
bulk-extend and FSM-bypass optimizations that COPY uses.

Previously, only COPY created a BulkInsertState; the executor INSERT
path always passed bistate=NULL to table_tuple_insert(), meaning:
- No adaptive bulk extension (already_extended_by was never tracked)
- Pre-extended pages went into the FSM where other backends could
  grab them, wasting the extension effort
- No buffer pin caching across consecutive inserts

This change creates a BulkInsertState in ExecInitModifyTable() for
non-FDW INSERT operations and passes it through to table_tuple_insert()
in ExecInsert(). For partitioned tables, ReleaseBulkInsertStatePin()
is called when the target partition changes, following the same pattern
COPY uses.

The main beneficiary is INSERT ... SELECT (multi-row inserts within a
single plan execution). For single-row SPI INSERTs (e.g., PgQ), the
bistate is created/destroyed per plan execution, so adaptive scaling
does not accumulate across calls — a future Phase 2 optimization would
persist the bistate across SPI calls within an SPI connection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion

Change smgr_targblock from a single BlockNumber to an array of 4 slots.
Each backend hashes into its own slot via (MyProcNumber & 3), so concurrent
inserters naturally target different heap pages instead of all fighting for
exclusive BufferContent lock on the same 8KB page.

This addresses the ~23% LWLock:BufferContent contention seen in profiling
of append-heavy queue workloads at 10+ concurrent inserters.

Changes:
- smgr.h: Add SMGR_TARGBLOCK_SLOTS (4), change smgr_targblock to array
- rel.h: Add RelationTargetBlockSlot() macro, update RelationGetTargetBlock
  and RelationSetTargetBlock to index by backend slot
- smgr.c: Initialize all 4 slots to InvalidBlockNumber in smgropen() and
  smgrrelease()
- storage.c: Same initialization fix in RelationTruncate()

The existing Assert sites in createas.c, matview.c, and heapam_handler.c
work unchanged because they check the current backend's slot, which will
be InvalidBlockNumber when no writes have occurred to a freshly created
relation. BulkInsertState (COPY) bypasses target blocks entirely via
bistate->current_buf, so COPY is unaffected.

Memory overhead: 12 bytes extra per SMgrRelation entry (3 additional
BlockNumber slots). Full build passes with zero warnings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A/B test stock PG 18 vs patched PG 19dev with:
- NUM_XLOGINSERT_LOCKS 8→32
- BulkInsertState for executor INSERT
- Multi-slot target blocks (4 slots)

Peak: 193k ev/s (100B) / 120k ev/s = 235 MB/s (2KB)
Biggest win: multi-target blocks at 16+ clients nearly doubles
small-payload throughput.

Also tested wal_compression (lz4, zstd) — hurts with random JSON.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lysis

- 8 target slots regresses 19.5% vs 4 — cache pressure, reverted
- PL/pgSQL mode: 95-125k ev/s, 73-82% of C at low concurrency
- Sequence contention not visible — Lock:extend is the real bottleneck
- AIO writes not available yet (reads only in PG19dev)
- IO:DataFileWrite at 51% is the ceiling — needs async writes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tested PG 19dev AIO settings:
- io_max_concurrency=128: +40% individually but unreliable in combo
- debug_io_direct='data': +26% individually, conflicts with other settings
- effective_io_concurrency=200: +35% individually
- Combined: worse than individual (settings fight each other)

Also tested:
- TOAST: non-factor (2KB stays inline)
- ev_txid index: 3% overhead (acceptable for consumer reads)
- Sequence cache=100: 0.9% (noise)

Full summary table of all 14 optimization attempts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key finding: PG 19dev unpatched is SLOWER than PG 18 at low concurrency
(likely debug/AIO overhead). Our 3 patches provide +98% to +264%
improvement on PG 19dev baseline.

commit_delay=50 with commit_siblings=3 gives +25% for free (config only).

Updated full optimization inventory (14 items tested).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key findings:
- wal_compression=lz4 HELPS (+5.6%) with compressible JSON — earlier
  "hurts" finding was measurement noise. Now RECOMMENDED.
- Full cycle: 1K batch 53% faster on patched PG, consumer ops sub-ms
- commit_delay inconclusive on macOS (too much variance)
- 16 total optimizations tested across 5 rounds

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Final results on stock PG 18.3 with proven config tuning:
- PL/pgSQL (pgq2): 88k ev/s (1KB JSON), 75k ev/s (2KB) = 147 MB/s
- C mode: 161k ev/s (100B), 135k ev/s (1KB) = 141 MB/s
- Full cycle: 145ms for 10K events end-to-end
- Consumer ops (tick/next/finish): 1-3ms constant
- Tuning guide: synchronous_commit=off, shared_buffers, max_wal_size, lz4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Final round: fill-then-drain test shows consumer is NOT the bottleneck
(33K ev/s for giant 4.87M-event batch, 302K ev/s for normal batches).

7 rounds, 16 optimizations tested. Practical limit reached for this
hardware. Final numbers and recommended tuning guide documented.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@NikolayS NikolayS force-pushed the pgq-perf-experiments branch from d45d100 to e8f8754 Compare April 22, 2026 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant