Add wait_event_timing: Oracle-style wait event instrumentation#1
Add wait_event_timing: Oracle-style wait event instrumentation#1DmitryNFomin wants to merge 7 commits intomasterfrom
Conversation
Add section to Performance documenting the proposed PostgreSQL patch (DmitryNFomin/postgres#1) that adds Oracle-style internal wait event instrumentation with near-zero overhead (<1%) vs hardware watchpoints (6-29%). Benchmark results on same environment (Hetzner cx43, 8 vCPU): - TPC-B: <1% vs 6% (hardware watchpoint) - SELECT-only worst case: <1% vs 29% Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7069864 to
7cadcee
Compare
Consolidated Fix Plan for PR #1PHASE 1: Blockers (must fix before any commit consideration)1. Shared memory footprint -- show-stopper (all 3 reviews)
2. Missing privilege checks on SQL functions (all 3 reviews)
3. Stale dead-backend data leaks (all 3 reviews)
4. Missing catversion bump (all 3 reviews)
5.
PHASE 2: Correctness (bugs and data integrity)6. LWLock tranche IDs can exceed 256 (all 3 reviews)
7. Trace ring buffer reader/writer race (GPT + Opus reviews)
8. Query hash entry publication order race (GPT review)
9. No guard against negative duration (Opus review)
10.
PHASE 3: Missing features / dead code11.
12. Query hash
PHASE 4: API / design improvements13. Histogram returned as text, not int4[] (all 3 reviews)
14. Query attribution uses stale
15. Auxiliary processes excluded without documentation (GPT + Opus reviews)
16. Histogram uses while-loop instead of hardware CLZ (Opus review)
17. Histogram bucket counters are int32, may wrap (GPT review)
18. GUC extern declarations misplaced (Opus review)
19. Wasted array slots for class IDs 0x00 and 0x02 (Opus review)
20. Unsupported-build GUC UX (GPT + Opus reviews)
PHASE 5: Integration (PostgreSQL conventions)21. Add system views (all 3 reviews)
22. Add SGML documentation (all 3 reviews)
23. Add regression / TAP tests (all 3 reviews)
24. Header comment claims direct shmem reading (GPT review)
Recommended patch split (strong consensus from all 3 reviews)
|
…ocation The WaitEventTraceState array was 4 MB per backend allocated unconditionally at startup via ShmemInitStruct (400 MB at max_connections=100). Replace with DSA-backed lazy allocation: a small WaitEventTraceControl struct (~8 bytes per backend) lives in fixed shmem, and the 4 MB ring buffer is allocated via dsa_create/dsa_allocate only when a backend SETs wait_event_trace = on. Key changes: - Add WaitEventTraceControl with dsa_handle, LWLock, per-backend dsa_pointers - Add LWTRANCHE_WAIT_EVENT_TRACE_DSA for the control LWLock - Lazy DSA creation on first use (any backend), attach on subsequent backends - GUC assign hook (assign_wait_event_trace) triggers allocation on enable - Cleanup via before_shmem_exit callback (runs before dsm_backend_shutdown detaches DSA segments, avoiding the use-after-detach crash that would occur if dsa_free ran from ProcKill's on_shmem_exit callback) - Fix pg_stat_get_wait_event_trace proisstrict: was defaulting to true, causing NULL argument (own backend) to skip the C function entirely Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Add wait_event_timing: Oracle-style wait event instrumentationPer-event timing with histograms, per-query attribution, and a per-session trace ring buffer for PostgreSQL. Controlled by a compile-time flag ( The problemPostgreSQL exposes the current wait event per backend (
External approaches have significant limitations:
Oracle solved this decades ago by instrumenting internally with What this patch providesCompile-time: Without the flag, the binary is identical to stock PostgreSQL (verified via Two runtime GUCs:
Oracle equivalents:
Design decisions
Shared memory cost
Benchmark resultspgbench, PG 19devel, 60-second runs, 3 runs per data point, median with outlier cleaning. Configs: A = stock PG (no flag), B = compiled with flag (GUCs off), C = timing ON, D = timing+trace ON SELECT-only (worst case -- highest wait event transition rate):
TPC-B:
CPU-bound (SELECT 1+1):
All configurations within run-to-run variance. No measurable overhead. Testing
Files changed (36 files)New:
Key modified:
Related work
|
Address all 20 issues identified by cross-checking Gemini 3.1, Opus 4.6, and GPT 5.4 reviews against the actual PR code. Each fix follows the approved plan in crosscheck_review_fix_3.md. Security / correctness fixes: Issue #1 (UAF in trace ring reads): - Remove pg_stat_get_wait_event_timing_by_query() entirely (OID 9951, view, docs, tests) -- cross-backend DSA ring scanning was the primary use-after-free surface. - Restrict pg_stat_get_wait_event_trace() to own-session only; the backend_id argument is accepted for forward compat but ignored. - Wrap DSA pointer mutations (attach/detach) in WaitEventTraceCtl->lock LW_EXCLUSIVE to protect external readers and fix 32-bit torn pointers (also resolves Issue #11). - Document the ring buffer as an external lock-free transport with a seqlock reader contract in wait_event_timing.h and monitoring.sgml. Issue #6 (cross-backend reset race): - Add per-backend LWLock to WaitEventTimingState, following the pgstat_reset_entry() pattern from pgstat_shmem.c. - Writers acquire LW_SHARED on the hot path; reset acquires LW_EXCLUSIVE. - Init and exit paths individually zero fields instead of memset on the whole struct, preserving the LWLock. - Register LWTRANCHE_WAIT_EVENT_TIMING in lwlocklist.h. Build / startup fixes: Issue #2: Replace non-existent wait_classes.h include with wait_event_types.h in wait_event_timing.h. Issue #3 (config-file trace broken): Add explicit wait_event_trace_attach() calls after ResourceOwner init in auxprocess.c (after CreateAuxProcessResourceOwner) and postinit.c (after process_settings) to handle inherited GUC values that do not re-fire assign hooks after fork. Issue #10 (non-timing build silent suppression): Emit WARNING before forcing wait_event_timing/wait_event_trace to off in check hooks for non-timing builds (source < PGC_S_INTERACTIVE path). Issue #14 (EXEC_BACKEND init order): Add comment in pgstat_set_wait_event_timing_storage() confirming SubPostmasterMain calls CreateSharedMemoryAndSemaphores() before InitProcess(). Query lifecycle / phase separation: Issue #4 (QUERY_END broken for multi-statement): - Hybrid approach: emit QUERY_END for old query and QUERY_START for new query in pgstat_report_query_id() transitions (backend_status.c). - Keep existing QUERY_END in send_ready_for_query (postgres.c) for the last-query-to-idle boundary. - Covers simple single/multi-stmt, extended, prepared, and pipeline protocols without double-emit. Issue #5 (phase separation): Add TRACE_EXEC_END marker (type=4) emitted from ExecutorEnd() in execMain.c, separating execution from post-exec cleanup (commit, WAL flush, lock release). Also resolves Issue #13 (extended protocol phase misattribution). Overflow and dependency warnings: Issue #7: One-time WARNING on first LWLock hash overflow (>192 tranches). Issue #8: WARNING in assign_wait_event_trace when track_activities is off; document the dependency in config.sgml. Issue #12: Add flat_overflow_count field + one-time WARNING when eventId exceeds per-class slot count. Header cleanup and code generation: Issue #15: Move static const arrays and internal inline functions from wait_event_timing.h to wait_event_timing.c as file-static. Convert trace marker inlines to extern functions with a shared wait_event_trace_write_marker() helper. Issue #16 (parallel array validation): Extend generate-wait_event_types.pl to emit wait_event_timing_data.c (mapping arrays) and the three sizing #defines into wait_event_types.h, generated from wait_event_names.txt. Eliminates hand-maintained parallel arrays entirely. Update Makefile, meson.build, and pgindent exclusions. Documentation: Issue #17: Document InitMaterializedSRF usage as acceptable for both the timing SRF (bounded result set) and trace SRF (own-session only, deferred max_records parameter). Issue #18: Document ring wrap limitation, polling contract, and st_query_id fallback in wait_event_timing.h and monitoring.sgml. Issue #19 nits: Remove duplicate header include, remove install/ entries from .gitignore, add tranche_id 65535 academic comment, add eventId-equals-array-index comment, verify OID range (9950/9952/9953). Testing: Issue #20: - Add trace lifecycle regression test (off/on/query/read/off cycle). - Add auxiliary process timing test (verify checkpointer/bgwriter/ walwriter accumulate non-zero wait event counts). - Add test_lwlock_hash_overflow() to test_wait_event_stress module: registers >192 custom LWLock tranches to exercise hash overflow path and verify WARNING emission. - Document that default Cirrus CI already exercises non-timing build path via wait_event_timing_1.out alternate expected output. Issues resolved by other fixes (no separate action needed): Issue #9: Resolved by Issue #1 (by_query view removed). Issue #11: Resolved by Issue #1C (LWLock around DSA pointer mutations). Issue #13: Resolved by Issue #5 (EXEC_START/EXEC_END markers). Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
|
given that upstream has structurally refactored two subsystems we depend on (ShmemInit callbacks in ipci.c, and instr_time ticks-to-ns conversion)? |
Compile-time: ./configure --enable-wait-event-timing (autoconf)
-Dwait_event_timing=true (meson)
Without either, the binary is byte-identical to stock PostgreSQL on
the hot path; the feature is fully gated behind USE_WAIT_EVENT_TIMING.
Runtime: one enum GUC wait_event_capture = off | stats | trace
(PGC_SUSET). "stats" enables per-event count, total, max, and log2
histogram accumulation. "trace" additionally enables a per-session
ring buffer of individual events (Oracle 10046-style) for query
attribution and timeline analysis. In stub builds the GUC rejects
any value other than off with a WARNING or ERROR (depending on source),
matching the "feature is compile-time" expectation.
This bundle adds only the plumbing: the build-system option, the GUC
declaration and stub hooks, the postgresql.conf.sample entry, the
installation/config SGML sections, and the extern declarations in
guc_hooks.h and guc.h that the core instrumentation in the next bundle
will implement.
Originally shipped across 14 commits, preserved at
origin/archive/wait-event-timing-review-rounds:
- ef4fade Add wait_event_timing: Oracle-style wait event instrumentation
- 1b95736 Add 10046-style per-session wait event trace ring buffer
- 7cadcee Make wait_event_timing a compile-time option (--enable-wait-event-timing)
- 937cd89 Fix #1: Replace unconditional trace ring allocation with DSA lazy allocation
- 8216ef5 Fix #5: Add --enable-wait-event-timing to configure, use pattern 1
- f927dd3 Fix #10: Warn when wait_event_trace enabled without wait_event_timing
- 8089e98 Fix #18: Replace bare extern GUC declarations with header include
- 4cc876d Fix #20: Add GUC check hooks that reject 'on' in unsupported builds
- bed120f Fix #22: Add SGML documentation for wait event timing
- c2d63f0 Fix review round 2: 17 issues from consolidated review
- 4ca0ec4 Fix review round 3: 20 issues from triple AI cross-check
- 708b05a Close wait_event_trace DoS surface: PGC_USERSET -> PGC_SUSET
- 12cc637 Collapse wait_event_timing + wait_event_trace booleans into single enum
- 3dc8592 Rename trace view/function to pg_backend_* to match session-local scope
…ocol
Gated behind USE_WAIT_EVENT_TIMING. Adds:
* pgstat_report_wait_start() / _end() inline hot path (wait_event.h):
INSTR_TIME_SET_CURRENT() + per-backend write when capture level is
non-off. Out-of-line body pgstat_report_wait_end_timing() keeps
I-cache pressure low at the many inline call sites.
* Per-backend WaitEventTimingState (wait_event_timing.h,
wait_event_timing.c): a flat events[] array for bounded wait-event
classes + an open-addressing hash (512 slots, 192-entry cap) for
unbounded LWLock tranche IDs. Writer-exclusive single-writer slot
design, no locks on the hot path.
* Log2 histogram with 16 buckets on powers-of-two nanoseconds
(~1us, 1-2us, ... 8-16ms, >=16ms), computed via CLZ without a
runtime /1000. Overflow counters (lwlock_overflow_count,
flat_overflow_count) track events dropped when the per-backend
hash is saturated or eventIds fall outside the flat range.
* Lock-free cross-backend reset via pg_atomic_uint32
reset_generation: requester bumps the atomic + SetLatch; owner
clears its own counters on the next wait_end. Own-backend reset
is synchronous. reset_count is a monotonic observability counter
that external callers can poll to confirm a remote reset has been
observed.
* DSA-lazy allocation: the per-backend timing array
(~30KB/backend) lives in DSA and is created the first time any
backend enables capture. Clusters running with
wait_event_capture = off pay only the small control struct
(sizeof(WaitEventTimingControl)) in fixed shmem.
* Per-session trace ring buffer (WaitEventTraceState,
131072 x 32-byte records = 4 MB) allocated lazily via DSA when
wait_event_capture = trace. Seqlock (WaitEventTraceRecord.seq,
odd = write in progress) for torn-read detection. Released back
to DSA on step-down from trace to stats/off. Per-record layout
pinned with StaticAssertDecl.
* Generator wiring: generate-wait_event_types.pl emits
wait_event_timing_data.c (dense class mapping consumed by the
bucket lookup) and wait_event_types.h into src/backend/utils/.
Build rules in backend/utils/Makefile + activity/Makefile +
meson.build; .gitignore entries on both sides.
* Backend lifecycle: InitProcess (proc.c) and auxiliary processes
(auxprocess.c) call pgstat_set_wait_event_timing_storage() with
their procNumber; ProcKill and related exit paths call
pgstat_reset_wait_event_timing_storage().
backend_status.c exposes query_id pointer for query-attribution
markers. postinit.c hooks into client-backend startup.
* Two new LWLock tranches (lwlocklist.h) for DSA creation.
* ipci.c registers the two small control structs
(WaitEventTimingControl + WaitEventTraceControl) in fixed shmem;
everything else is lazy DSA.
Hot-path invariants (verified by regression tests in a later bundle):
- wait_start is zero between waits (set by wait_end_timing on exit)
- current_event cleared on reset to avoid stale-state readers
- reset_generation observation happens before accumulation, so the
completing wait credits to freshly-zeroed counters
Originally shipped across 48 commits, preserved at
origin/archive/wait-event-timing-review-rounds:
- ef4fade Add wait_event_timing: Oracle-style wait event instrumentation
- 5120e80 Fix: remove pg_bitutils.h dependency, use simple bit shift for histogram bucket
- 6a81684 Fix segfault: handle auxiliary processes with procNumber >= MaxBackends
- 7cf09db Add query attribution: per (query_id, event) wait statistics
- e8a1717 Add query attribution: per (query_id, event) wait statistics
- 1b95736 Add 10046-style per-session wait event trace ring buffer
- 7cadcee Make wait_event_timing a compile-time option (--enable-wait-event-timing)
- 4ccf36a Increase trace ring buffer from 4K to 128K records per backend
- 0e04ef2 Fix: add missing tuplestore.h include for tuplestore_putvalues()
- 937cd89 Fix #1: Replace unconditional trace ring allocation with DSA lazy allocation
- e496cb2 Fix #2: Add privilege checks to wait event timing SQL functions
- 0fadd0a Fix #3: Zero shared memory on backend exit to prevent stale data leaks
- 8216ef5 Fix #5: Add --enable-wait-event-timing to configure, use pattern 1
- 5d821c8 Fix #6: LWLock tranche IDs can exceed 256 -- per-backend hash table
- 8999a5c Fix #7: Trace ring buffer reader/writer race -- seqlock protocol
- 37e2741 Fix #8: Query hash entry publication order race
- 1bea5c8 Fix #9: Clamp negative duration to zero
- f927dd3 Fix #10: Warn when wait_event_trace enabled without wait_event_timing
- 5a60203 Fix #11: Implement pg_stat_reset_wait_event_timing() SQL function
- 5aadd83 Fix #12: Redesign query attribution -- eliminate hash table, use trace ring
- a4e56f1 Fix #13: Return histogram as int4[] instead of text
- edb8d9b Fix #15: Include auxiliary processes in wait event timing
- 80c84f6 Fix #16: Replace histogram bucket loop with hardware CLZ
- 5a72b47 Fix #17: Promote histogram bucket counters from int32 to int64
- 4cc876d Fix #20: Add GUC check hooks that reject 'on' in unsupported builds
- c5abb1e Fix #21: Add system views and pid/backend_type to SRFs
- eb40cdd Fix #24: Rewrite header comment to point at SQL interface
- ebe34ba Add EXEC_START trace marker and phase column to by_query
- c2d63f0 Fix review round 2: 17 issues from consolidated review
- 6211659 Fix all compiler warnings in wait_event_timing.c
- 4ca0ec4 Fix review round 3: 20 issues from triple AI cross-check
- 65479aa Fix round 3 regressions: remove locks on single-writer paths, fix tests
- 9bcf83c Fix cross-backend reset race: lock-free request-response (Option B)
- 72549e2 Remove phantom backend_id argument from pg_stat_get_wait_event_trace
- 12cc637 Collapse wait_event_timing + wait_event_trace booleans into single enum
- 9a359ab Fix stale wait state + assign-hook ereport contract violation
- 7478abb Expose per-backend wait event timing overflow counters
- 3dc8592 Rename trace view/function to pg_backend_* to match session-local scope
- d1ea636 Allocate per-backend wait event timing array lazily in DSA
- 379d650 Untrack generated wait_event_timing_data.c file
- 826df19 Switch wait_event_timing reset to PID-based, add _all() variant
- 46fe5d0 Release wait-event trace DSA ring on TRACE -> OFF/STATS
- 6641deb Zero current_event on reset (issue #9 purist fix)
- 835e22c Expose reset_count on wait_event_timing_overflow view (issue #18)
- 7e15c8a Reuse histogram ArrayType across SRF rows (issue #19)
- f75c9b7 Skip /1000 in histogram bucket computation (issue #20)
- 11ed373 Double LWLock timing hash slot count to 512 (issue #21)
- a4c133a Pin trace-ring invariants with StaticAssertDecl + barrier annotations (issue #22)
When wait_event_capture = trace and compute_query_id is on, insert
per-query markers into the per-session trace ring so external
consumers can attribute wait events back to their originating query
by scanning. Markers are tagged records (TRACE_QUERY_START,
TRACE_QUERY_END, TRACE_EXEC_START, TRACE_EXEC_END) sharing the same
WaitEventTraceRecord layout as wait-event entries; they carry a
query_id payload instead of a wait-event-info + duration payload.
The two hook points:
* tcop/postgres.c: QUERY_START / QUERY_END bracket the
exec_simple_query / exec_execute_message boundary, covering the
full statement lifecycle including parse/analyse/rewrite and
post-execution cleanup.
* executor/execMain.c: EXEC_START / EXEC_END bracket
standard_ExecutorRun / _ExecutorEnd so the planner-less execution
phase is distinguishable from query-level overhead on the
timeline.
Query attribution is read-time: pg_backend_wait_event_trace
materializes the ring and the caller partitions events by the
enclosing QUERY_START/END markers. This replaced an earlier design
with a per-backend query-id-to-stats hash in shared memory, which was
removed in favour of the trace-ring approach.
Originally shipped across N commits, preserved at
origin/archive/wait-event-timing-review-rounds:
- ebe34ba Add EXEC_START trace marker and phase column to by_query
- 4ca0ec4 Fix review round 3: 20 issues from triple AI cross-check
User-facing surface:
* pg_stat_get_wait_event_timing() SRF: one row per
(backend, wait_event) with non-zero calls. Columns: pid,
backend_type, backend_id, wait_event_type, wait_event, calls,
total_time_ms, avg_time_us, max_time_us, histogram (bigint[16]).
* pg_stat_get_wait_event_timing_overflow() SRF: one row per live
backend exposing lwlock_overflow_count, flat_overflow_count, and
reset_count. reset_count is the observability hook external
callers poll to confirm an asynchronous cross-backend reset has
landed.
* pg_get_backend_wait_event_trace() SRF: session-local dump of
the caller's own trace ring. Returns (seq, timestamp_ns,
wait_event_type, wait_event, duration_us, query_id). Name mirrors
pg_get_backend_memory_contexts to make the session-local scope
explicit at the API level; there is no cross-backend trace read.
* pg_stat_reset_wait_event_timing(pid int) / _all():
NULL/own pid resets own backend synchronously; other pid requires
superuser and goes through the async reset_generation protocol.
_all() is a dedicated function instead of a -1 overload, matching
pg_stat_reset_* conventions elsewhere.
System views over the functions:
* pg_stat_wait_event_timing, pg_stat_wait_event_timing_overflow
(REVOKE PUBLIC / GRANT pg_read_all_stats).
* pg_backend_wait_event_trace (PUBLIC; session-local anyway).
* pg_wait_event_timing_histogram_buckets: a constant 16-row
taxonomy view that translates histogram index -> (lower_ns,
upper_ns, label). Canonical join pattern documented in
monitoring.sgml is
FROM pg_stat_wait_event_timing w,
LATERAL unnest(w.histogram) WITH ORDINALITY AS h(count, idx)
JOIN pg_wait_event_timing_histogram_buckets b
ON b.bucket_idx = h.idx - 1.
SGML docs:
* A <sect2> per view in monitoring.sgml, mirroring existing
pg_stat_* view documentation.
* Columns typed, histogram bin edges described as powers of two
on nanoseconds with the approximate decimal-microsecond grid
called out (1024 ns ~ 1 us, ..., 2^24 ns ~ 16 ms).
Uses OIDs 9956-9960 (verified unclaimed upstream) and bumps
CATALOG_VERSION_NO. All functions have non-timing-build stubs
elsewhere that either return empty SRFs or raise
ERRCODE_FEATURE_NOT_SUPPORTED; SQL callers on stock PostgreSQL get
predictable behaviour.
Originally shipped across N commits, preserved at
origin/archive/wait-event-timing-review-rounds:
- ef4fade Add wait_event_timing: Oracle-style wait event instrumentation
- 7cf09db Add query attribution: per (query_id, event) wait statistics
- 1b95736 Add 10046-style per-session wait event trace ring buffer
- 937cd89 Fix #1: Replace unconditional trace ring allocation with DSA lazy allocation
- 281da3f Fix #4: Bump CATALOG_VERSION_NO for new pg_proc OIDs
- 5a60203 Fix #11: Implement pg_stat_reset_wait_event_timing() SQL function
- a4e56f1 Fix #13: Return histogram as int4[] instead of text
- 5a72b47 Fix #17: Promote histogram bucket counters from int32 to int64
- c5abb1e Fix #21: Add system views and pid/backend_type to SRFs
- bed120f Fix #22: Add SGML documentation for wait event timing
- ebe34ba Add EXEC_START trace marker and phase column to by_query
- c2d63f0 Fix review round 2: 17 issues from consolidated review
- 4ca0ec4 Fix review round 3: 20 issues from triple AI cross-check
- 72549e2 Remove phantom backend_id argument from pg_stat_get_wait_event_trace
- 12cc637 Collapse wait_event_timing + wait_event_trace booleans into single enum
- 7478abb Expose per-backend wait event timing overflow counters
- 3dc8592 Rename trace view/function to pg_backend_* to match session-local scope
- 826df19 Switch wait_event_timing reset to PID-based, add _all() variant
- d412c87 Renumber wait_event_timing OIDs to 9956-9960 (issue #17)
- 835e22c Expose reset_count on wait_event_timing_overflow view (issue #18)
- f75c9b7 Skip /1000 in histogram bucket computation (issue #20)
- 7d14ecd Add pg_wait_event_timing_histogram_buckets taxonomy view (issue #23)
Two exported functions on a loadable extension in src/test/modules:
* stress_wait_events(n int) -> bigint
Calls pgstat_report_wait_start(WAIT_EVENT_PG_SLEEP) +
pgstat_report_wait_end() in a tight loop exactly n times.
Returns elapsed microseconds for sanity checks and lets the
regression test assert exact counts (count(*) = n, calls = n)
without the spurious-latch-wake non-determinism that pg_sleep()
suffers from on busy CI runners.
* test_lwlock_hash_overflow(n int) -> int
Registers n custom LWLock tranches via LWLockNewTrancheId() and
fires one wait-event cycle on each, exercising the per-backend
LWLock timing hash up to and past its 192-entry cap. The
regression test uses this to pin the overflow-counter semantics
(lwlock_overflow_count increments, one-shot WARNING fires, reset
clears the suppression).
Regression test covers:
* Basic "capture is wired" smoke.
* Deterministic exact-count assertions (count(*) = 5, calls = 5)
after stress_wait_events(5) under trace capture -- catches the
symmetric-duplication class of bugs that the pg_sleep-based
core test structurally can't detect.
* LWLock hash overflow at 192 entries: cap enforced, overflow
counter visible, WARNING fires once, reset clears counter.
Top-level pg_stat_reset_wait_event_timing(NULL) at start of the test
makes it idempotent against prior state in the same session.
Originally shipped across N commits, preserved at
origin/archive/wait-event-timing-review-rounds:
- ebe34ba Add EXEC_START trace marker and phase column to by_query
- c2d63f0 Fix review round 2: 17 issues from consolidated review
- 4ca0ec4 Fix review round 3: 20 issues from triple AI cross-check
- 65479aa Fix round 3 regressions: remove locks on single-writer paths, fix tests
- 12cc637 Collapse wait_event_timing + wait_event_trace booleans into single enum
- 9a359ab Fix stale wait state + assign-hook ereport contract violation
- 5ad0dca Pin overflow-counter reset semantics with a regression assertion
- 46594e4 Reformulate ring-lifecycle test to be strict AND robust (issues #25, #26)
…utput)
Core regress test (wait_event_timing.sql / .out) exercises:
* GUC default + SET wait_event_capture = stats.
* Column types on pg_stat_wait_event_timing.
* A pg_sleep generates at least one PgSleep wait event, reflected
in calls/total/avg/max and summing correctly over the histogram.
* pg_stat_reset_wait_event_timing(NULL / own_pid / unknown_pid),
pg_stat_reset_wait_event_timing_all().
* Trace ring basic lifecycle: SET capture=trace, sleep, read the
ring, confirm query markers are present.
* Trace ring reallocation across a TRACE -> OFF/STATS -> TRACE
toggle. Strict-equal on count-agnostic invariants: phase 2's
max(seq) < phase 1's max(seq) iff the ring was freed; aggregated
calls exactly equals the sum of the two phase ring counts.
* Overflow counters visible (lwlock_overflow_count,
flat_overflow_count, reset_count) and the reset_count_bumped
invariant after resets earlier in the test.
The assertions avoid "pg_sleep emits exactly one PgSleep event" -- a
false premise under CPU contention on small CI VMs -- without
relaxing strictness: every equality is on a property that survives
spurious-wake variance. Symmetric-duplication coverage lives in the
stress module.
Alt expected (wait_event_timing_1.out) is selected by pg_regress when
the build lacks USE_WAIT_EVENT_TIMING: SET wait_event_capture !=
off fails with an ERROR, and the SRFs return empty.
parallel_schedule: wait_event_timing added to a group of 3 tests
small enough to not starve the single-backend timing assertions under
concurrent load.
rules.out: refreshed for the 4 new views in pg_catalog.
Originally shipped across N commits, preserved at
origin/archive/wait-event-timing-review-rounds:
- 995d99a Fix #23: Add regression tests for wait event timing
- ebe34ba Add EXEC_START trace marker and phase column to by_query
- c2d63f0 Fix review round 2: 17 issues from consolidated review
- 2f845dc Fix parallel_schedule: move wait_event_timing to smaller test group
- 4ca0ec4 Fix review round 3: 20 issues from triple AI cross-check
- 65479aa Fix round 3 regressions: remove locks on single-writer paths, fix tests
- 72549e2 Remove phantom backend_id argument from pg_stat_get_wait_event_trace
- 12cc637 Collapse wait_event_timing + wait_event_trace booleans into single enum
- 7478abb Expose per-backend wait event timing overflow counters
- 3dc8592 Rename trace view/function to pg_backend_* to match session-local scope
- 826df19 Switch wait_event_timing reset to PID-based, add _all() variant
- 22ae8a5 Pin Option A invariants for trace DSA release (issue #15)
- 835e22c Expose reset_count on wait_event_timing_overflow view (issue #18)
- 7d14ecd Add pg_wait_event_timing_histogram_buckets taxonomy view (issue #23)
- 46594e4 Reformulate ring-lifecycle test to be strict AND robust (issues #25, #26)
Without this, upstream CI compiled the #ifdef USE_WAIT_EVENT_TIMING paths zero times on every PR, and pg_regress silently selected the non-timing alt expected output (wait_event_timing_1.out), leaving the timing-build expected file unused. Append --enable-wait-event-timing to the existing "Linux - Debian Trixie - Autoconf" matrix entry's configure line. That entry already runs make check-world under --enable-cassert --enable-injection-points --enable-debug --enable-tap-tests, so adding one flag makes the timing path a first-class part of every PR's CI without adding a new task. Originally shipped across N commits, preserved at origin/archive/wait-event-timing-review-rounds: - 43df2a0 ci: enable --enable-wait-event-timing on one Linux task (issue #24)
46594e4 to
6497681
Compare
Four small, independently-justified fixes from the consolidated review in review_5.md. All verified by `make check` (249/249 passing, including the wait_event_timing regression test). #1 Stale docstring on pg_stat_reset_wait_event_timing (wait_event_timing.c) Two stacked docstrings: the first described an obsolete backend_id-based API ("0 = own, -1 = all") that does not exist; the function takes a PID. Merge them into one block that uses the correct PID semantics and preserves the async-protocol + reset_count-polling guidance from the obsolete one. #2 Spurious wait_event_* includes in ipci.c Two #includes (wait_event.h, wait_event_timing.h) referenced no symbols in ipci.c. The WaitEvent*ShmemCallbacks externs come via storage/subsystems.h -> subsystemlist.h. Removing both leaves ipci.c building cleanly and shrinks its include graph. #3 Missing WARNING for wait_event_capture=trace + compute_query_id=off (wait_event_timing.c, config.sgml) assign_wait_event_capture already warns when track_activities is off; mirror the same warning for compute_query_id, since markers silently skip when query_id is 0. Verified live: setting compute_query_id=off then wait_event_capture=trace now produces the expected WARNING + HINT pointing at pg_stat_statements as a common enabler. #4 Trace doc only described QUERY markers; EXEC markers were silent (config.sgml, monitoring.sgml) The trace ring emits four marker types, not two. ExecStart/ ExecEnd bracket every executor invocation and are NOT gated on track_activities, while QueryStart/QueryEnd cover inter-statement and idle-transition boundaries that the executor cannot. Restate the prereqs in config.sgml as an itemized list per family, update monitoring.sgml's view description and consumer guidance to name ExecStart as the primary attribution signal and QueryStart as the fallback for boundaries the executor cannot mark. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds Oracle-style wait event instrumentation to PostgreSQL — per-event timing,
histograms, query attribution, and per-session 10046-style tracing. Controlled
by a compile-time flag (
--enable-wait-event-timing) with zero overhead whennot compiled in.
The problem
External BPF-based wait event profilers (like pg_wait_tracer) use CPU hardware watchpoints on
PGPROC->wait_event_info, costing ~200-300 ns per debug exception — 29% TPS overhead on high-transition workloads. Oracle solved this by instrumenting internally withclock_gettime()VDSO calls (~70-100 ns, no kernel trap).What this patch provides
Compile-time:
./configure --enable-wait-event-timing(ormeson setup -Dwait_event_timing=true).Without the flag, the binary is identical to stock PostgreSQL.
One runtime GUC (
PGC_SUSET):wait_event_captureoff(default)statspg_stat_wait_event_timingtracestatsplus a per-session ring of individual events + QUERY/EXEC markers inpg_backend_wait_event_trace, allowing per-query attribution by scanning the ringOracle equivalents:
pg_stat_get_wait_event_timing()pg_stat_get_wait_event_timing()—histogramcolumnpg_stat_get_wait_event_timing()—backend_idcolumnpg_backend_wait_event_traceand group byquery_id(QUERY_START/END markers delimit events per query)pg_get_backend_wait_event_trace()(session-local by design)Benchmark results
Environment: Hetzner cx43 (8 vCPU, 16 GB RAM), Rocky 9.7, PG 19devel,
pgbench scale 100, shared_buffers=128MB, 8 clients, SELECT-only — worst case
at ~220K wait event transitions/sec, 60-second runs.
Test 1: Compile flag overhead
Stock PG vs patched (without
--enable-wait-event-timing) on same data directory,alternating A/B runs, 5 rounds:
No measurable difference. Hot-path object code is byte-identical
(verified via
objcopy -O binary -j .text).Test 2: GUC overhead (same binary + same data)
Same binary (WITH flag), same data directory, toggling
wait_event_capturebetween restarts, 5 rounds:offstatstrace< 0.5% difference. All configs within run-to-run variance.
vs hardware watchpoints
All 244 PostgreSQL regression tests pass (
make check).Files changed
New files:
src/include/utils/wait_event_timing.h— data structures, inline helperssrc/backend/utils/activity/wait_event_timing.c— shmem, SQL functions, stubsModified files:
src/include/utils/wait_event.h— timing code in inline functions (#ifdef)src/backend/storage/lmgr/proc.c— init/cleanup timing storagesrc/backend/storage/ipc/ipci.c— shared memory allocationsrc/backend/utils/activity/backend_status.c— query_id pointer setupsrc/backend/utils/misc/guc_parameters.dat— GUC definitionssrc/backend/utils/misc/guc_tables.c— GUC variable externssrc/include/catalog/pg_proc.dat— SQL function registrationconfigure.ac,meson.build,meson_options.txt— compile-time optionsrc/include/pg_config.h.in—USE_WAIT_EVENT_TIMINGdefineMakefile,meson.buildin activity/Related work