Add wait_event_timing: Oracle-style wait event instrumentation by DmitryNFomin · Pull Request #1 · DmitryNFomin/postgres

DmitryNFomin · 2026-03-22T20:50:42Z

Summary

Adds Oracle-style wait event instrumentation to PostgreSQL — per-event timing,
histograms, query attribution, and per-session 10046-style tracing. Controlled
by a compile-time flag (--enable-wait-event-timing) with zero overhead when
not compiled in.

Branch history (2026-04-25 rebase): the patch series was squashed
from 60 review-round commits into the 7 logical bundles you see below
and rebased onto current upstream/master. The full pre-rebase
commit history (every individual review fix, with original commit
messages) is preserved at
origin/archive/wait-event-timing-review-rounds.
Each squashed commit message ends with a list of the originals it
absorbed, so any review-thread reference to an old commit SHA can be
resolved against that archive branch. Issue #12 (Windows
INSTR_TIME_GET_NANOSEC per-call QueryPerformanceFrequency) was
closed automatically by the rebase: upstream's recent
ticks_per_ns_scaled refactor (commits 0022622c93, 294520c444)
replaces the per-call frequency lookup with a one-time-initialized
scale factor.

The problem

External BPF-based wait event profilers (like pg_wait_tracer) use CPU hardware watchpoints on PGPROC->wait_event_info, costing ~200-300 ns per debug exception — 29% TPS overhead on high-transition workloads. Oracle solved this by instrumenting internally with clock_gettime() VDSO calls (~70-100 ns, no kernel trap).

What this patch provides

Compile-time: ./configure --enable-wait-event-timing (or meson setup -Dwait_event_timing=true).
Without the flag, the binary is identical to stock PostgreSQL.

One runtime GUC (PGC_SUSET):

`wait_event_capture`	What it enables
`off` (default)	No instrumentation, no hot-path cost
`stats`	Per-event count, total ns, max, log2 histogram in `pg_stat_wait_event_timing`
`trace`	Everything in `stats` plus a per-session ring of individual events + QUERY/EXEC markers in `pg_backend_wait_event_trace`, allowing per-query attribution by scanning the ring

Oracle equivalents:

Oracle	PostgreSQL
V$SYSTEM_EVENT	`pg_stat_get_wait_event_timing()`
V$EVENT_HISTOGRAM	`pg_stat_get_wait_event_timing()` — `histogram` column
V$SESSION_EVENT	`pg_stat_get_wait_event_timing()` — `backend_id` column
V$SQL wait stats	Scan `pg_backend_wait_event_trace` and group by `query_id` (QUERY_START/END markers delimit events per query)
10046 trace	`pg_get_backend_wait_event_trace()` (session-local by design)

Benchmark results

Environment: Hetzner cx43 (8 vCPU, 16 GB RAM), Rocky 9.7, PG 19devel,
pgbench scale 100, shared_buffers=128MB, 8 clients, SELECT-only — worst case
at ~220K wait event transitions/sec, 60-second runs.

Test 1: Compile flag overhead

Stock PG vs patched (without --enable-wait-event-timing) on same data directory,
alternating A/B runs, 5 rounds:

Round	Stock TPS	Patched (no flag) TPS
1	116,398	118,427
2	115,850	116,609
3	116,141	117,669
4	117,699	118,136
5	117,634	118,879

No measurable difference. Hot-path object code is byte-identical
(verified via objcopy -O binary -j .text).

Test 2: GUC overhead (same binary + same data)

Same binary (WITH flag), same data directory, toggling wait_event_capture between restarts, 5 rounds:

Round	`off`	`stats`	`trace`
1	114,768	116,058	115,911
2	116,726	115,789	116,502
3	115,972	115,977	115,170
4	116,405	113,924	115,242
5	115,781	116,556	114,603
Mean	115,930	115,661	115,486

< 0.5% difference. All configs within run-to-run variance.

vs hardware watchpoints

Approach	Worst-case overhead
This patch (all features ON)	< 0.5%
Hardware watchpoint (pg_wait_tracer)	29%

All 244 PostgreSQL regression tests pass (make check).

Files changed

New files:

src/include/utils/wait_event_timing.h — data structures, inline helpers
src/backend/utils/activity/wait_event_timing.c — shmem, SQL functions, stubs

Modified files:

src/include/utils/wait_event.h — timing code in inline functions (#ifdef)
src/backend/storage/lmgr/proc.c — init/cleanup timing storage
src/backend/storage/ipc/ipci.c — shared memory allocation
src/backend/utils/activity/backend_status.c — query_id pointer setup
src/backend/utils/misc/guc_parameters.dat — GUC definitions
src/backend/utils/misc/guc_tables.c — GUC variable externs
src/include/catalog/pg_proc.dat — SQL function registration
configure.ac, meson.build, meson_options.txt — compile-time option
src/include/pg_config.h.in — USE_WAIT_EVENT_TIMING define
Build files: Makefile, meson.build in activity/

Related work

NikolayS/postgres#18 — USDT probes (complementary)
pg_wait_sampling — polling-based (10ms resolution)
pgsql-hackers thread — Jeremy Schneider's proposal

Add section to Performance documenting the proposed PostgreSQL patch (DmitryNFomin/postgres#1) that adds Oracle-style internal wait event instrumentation with near-zero overhead (<1%) vs hardware watchpoints (6-29%). Benchmark results on same environment (Hetzner cx43, 8 vCPU): - TPC-B: <1% vs 6% (hardware watchpoint) - SELECT-only worst case: <1% vs 29% Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

DmitryNFomin · 2026-04-10T19:16:25Z

Consolidated Fix Plan for PR #1

PHASE 1: Blockers (must fix before any commit consideration)

1. Shared memory footprint -- show-stopper (all 3 reviews)

WaitEventTraceState is 4 MB per backend, not "~128 KB" as the comment claims (131072 records * 32 bytes)
At max_connections=100, trace alone eats ~400 MB of shared memory, allocated unconditionally
Fix: either reduce ring size drastically (e.g. 4096 records = 128 KB), make it configurable via a sizing GUC, allocate lazily, or split trace into a separate patch entirely (all 3 reviews recommend splitting)
Also fix the wrong comment on WaitEventTraceState

2. Missing privilege checks on SQL functions (all 3 reviews)

All three SRFs expose all backends' data to any SQL user -- timing, query attribution, and trace
Fix: follow pg_stat_get_activity() pattern -- has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) for full access, restrict unprivileged users to own backend only

3. Stale dead-backend data leaks (all 3 reviews)

pgstat_reset_wait_event_timing_storage() only NULLs local pointers, does NOT zero shared memory
SRFs iterate all MaxBackends slots blindly, exposing old data from exited backends
Fix: either zero the shmem slot on backend exit, or iterate only live backend entries (preferred by GPT review)

4. Missing catversion bump (all 3 reviews)

3 new OIDs (9950-9952) added to pg_proc.dat without updating catversion.h
Fix: bump CATALOG_VERSION_NO in src/include/catalog/catversion.h

5. configure script not regenerated (Opus + GPT reviews)

configure.ac updated but configure was not regenerated with autoconf
./configure --enable-wait-event-timing silently ignores the flag
Fix: regenerate configure, or document meson as required build path

PHASE 2: Correctness (bugs and data integrity)

6. LWLock tranche IDs can exceed 256 (all 3 reviews)

wait_event_timing_index() caps eventId at 256, but LWLock tranches are dynamically allocated and can exceed this
Silently drops valid wait events with no user-visible indication
Fix: use a separate hash/mapping for LWLock class, or increase the per-class limit for class 0x01, or move to a hash keyed by full wait_event_info

7. Trace ring buffer reader/writer race (GPT + Opus reviews)

Writer fills record fields then increments write_pos -- no memory barrier between payload and position
Reader can observe torn/half-written records or mixed-generation data during wraparound
Fix: add per-record sequence/generation counter, or a seqlock, or explicit acquire/release barriers. At minimum document the best-effort approximation contract

8. Query hash entry publication order race (GPT review)

New entries write query_id first, readers treat query_id != 0 as valid
Concurrent reader can see partially initialized event_idx/count/total_ns
Fix: write payload fields first, barrier, then publish query_id last

9. No guard against negative duration (Opus review)

CLOCK_MONOTONIC can theoretically go backward under VM live migration
Negative duration_ns corrupts total_ns, max_ns, and causes undefined behavior in wait_event_timing_bucket()
Fix: add duration_ns = Max(duration_ns, 0) guard

10. wait_event_trace silently depends on wait_event_timing (GPT review)

Trace recording is nested inside if (wait_event_timing && ...) block
User sets wait_event_trace = on but gets no trace if wait_event_timing is off -- no warning
Fix: either document dependency + add GUC check hook to warn/reject, or make trace independent

PHASE 3: Missing features / dead code

11. reset_count is dead code / no reset function (all 3 reviews)

WaitEventTimingState.reset_count references nonexistent pg_stat_reset_wait_event_timing()
No way to reset cumulative counters without restarting the backend
Fix: implement a reset function (add to pg_proc.dat, write the C function), or remove the dead field

12. Query hash overflow_count not exposed (GPT + Opus reviews)

The hash table silently drops measurements when full (1024 slots)
overflow_count is tracked internally but never returned to the user
Fix: expose it as an extra column in pg_stat_get_wait_event_timing_by_query(), or add a separate status function

PHASE 4: API / design improvements

13. Histogram returned as text, not int4[] (all 3 reviews)

Current: "{0,0,1,...}" text string -- requires SQL string parsing
Fix: use construct_array() with INT4OID to return a proper int4[] array

14. Query attribution uses stale st_query_id (GPT review)

st_query_id persists after statement ends, so idle waits like ClientRead get charged to the previous query
Fix: add a separate "currently attributable query_id" field, or gate attribution on active execution state

15. Auxiliary processes excluded without documentation (GPT + Opus reviews)

bgwriter, checkpointer, walwriter, etc. have procNumber >= MaxBackends and are silently skipped
Fix: either extend storage to cover aux procs, or document the limitation explicitly

16. Histogram uses while-loop instead of hardware CLZ (Opus review)

wait_event_timing_bucket() uses manual bit-shifting loop
PostgreSQL provides pg_leftmost_one_pos64() in pg_bitutils.h which uses hardware __builtin_clzl
Fix: use the builtin for a single-instruction hot-path operation

17. Histogram bucket counters are int32, may wrap (GPT review)

WaitEventTimingEntry.histogram[] uses int32 while count/total_ns are int64
Long-lived busy backends can wrap bucket counters
Fix: promote to int64 (reassess memory impact), or define reset semantics tightly enough

18. GUC extern declarations misplaced (Opus review)

In guc_tables.c, externs are injected between #include lines (non-standard)
Fix: include wait_event_timing.h instead, or move externs to a proper location

19. Wasted array slots for class IDs 0x00 and 0x02 (Opus review)

These class IDs have no events, wasting 2 * 256 * 88 = ~44 KB per backend
Fix: use a class-to-dense-index mapping, or accept as minor waste

20. Unsupported-build GUC UX (GPT + Opus reviews)

Without --enable-wait-event-timing, GUCs exist and can be SET on with no effect until SQL functions error
Fix: reject SET wait_event_timing = on in unsupported builds via GUC check hook

PHASE 5: Integration (PostgreSQL conventions)

21. Add system views (all 3 reviews)

Raw pg_stat_get_* functions without wrapping views is unfinished
Fix: add views in system_views.sql with useful context (pid, backend_type, datname, etc.)

22. Add SGML documentation (all 3 reviews)

No docs for GUCs, functions, or compile-time option
Fix: update config.sgml, monitoring.sgml, system-views.sgml

23. Add regression / TAP tests (all 3 reviews)

No test coverage for the new feature
Fix: basic aggregate timing rows, histogram bucket validation with pg_sleep(), permission tests, dead-backend cleanup test, build-path test

24. Header comment claims direct shmem reading (GPT review)

"External tools can read the accumulated stats from shared memory" implies an ABI promise
Fix: remove or tone down; the SQL/view layer is the supported interface

Recommended patch split (strong consensus from all 3 reviews)

Patch	Scope
Patch 1	Aggregate per-backend/per-event timing + histogram + views + permissions + docs + tests
Patch 2	By-query attribution (after fixing query_id semantics and overflow visibility)
Patch 3	Per-session trace ring buffer (after solving memory, concurrency, and permissions)

DmitryNFomin · 2026-04-13T11:38:02Z

summary_review_fix_plan.md

…ocation The WaitEventTraceState array was 4 MB per backend allocated unconditionally at startup via ShmemInitStruct (400 MB at max_connections=100). Replace with DSA-backed lazy allocation: a small WaitEventTraceControl struct (~8 bytes per backend) lives in fixed shmem, and the 4 MB ring buffer is allocated via dsa_create/dsa_allocate only when a backend SETs wait_event_trace = on. Key changes: - Add WaitEventTraceControl with dsa_handle, LWLock, per-backend dsa_pointers - Add LWTRANCHE_WAIT_EVENT_TRACE_DSA for the control LWLock - Lazy DSA creation on first use (any backend), attach on subsequent backends - GUC assign hook (assign_wait_event_trace) triggers allocation on enable - Cleanup via before_shmem_exit callback (runs before dsm_backend_shutdown detaches DSA segments, avoiding the use-after-detach crash that would occur if dsa_free ran from ProcKill's on_shmem_exit callback) - Fix pg_stat_get_wait_event_trace proisstrict: was defaulting to true, causing NULL argument (own backend) to skip the C function entirely Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

DmitryNFomin · 2026-04-15T20:08:42Z

review_2_fix_plan.md
summary_review_2.md

DmitryNFomin · 2026-04-16T13:07:18Z

Add wait_event_timing: Oracle-style wait event instrumentation

Per-event timing with histograms, per-query attribution, and a per-session trace ring buffer for PostgreSQL. Controlled by a compile-time flag (--enable-wait-event-timing) with zero overhead when not compiled in.

The problem

PostgreSQL exposes the current wait event per backend (pg_stat_activity.wait_event), but provides no built-in way to answer:

How long did each wait take?
Which waits dominate this query?
What is the distribution of wait durations?

External approaches have significant limitations:

Approach	Resolution	Overhead	Limitation
pg_wait_sampling	10 ms polling	Low	Misses short waits, statistical
pg_wait_tracer (BPF)	Exact	29%	Hardware watchpoint debug exceptions

Oracle solved this decades ago by instrumenting internally with clock_gettime()-equivalent calls -- cheap VDSO operations (~40-100 ns) with no kernel trap.

What this patch provides

Compile-time: ./configure --enable-wait-event-timing (or meson setup -Dwait_event_timing=true)

Without the flag, the binary is identical to stock PostgreSQL (verified via objcopy -O binary -j .text).

Two runtime GUCs:

GUC	Context	What it enables
`wait_event_timing`	PGC_SUSET	Per-event count, total ns, max, 16-bucket log2 histogram
`wait_event_trace`	PGC_USERSET	Per-session ring buffer of individual events (Oracle 10046 equivalent)

Oracle equivalents:

Oracle	PostgreSQL
V$SYSTEM_EVENT	`pg_stat_wait_event_timing`
V$EVENT_HISTOGRAM	`pg_stat_wait_event_timing` (histogram column)
V$SESSION_EVENT	`pg_stat_wait_event_timing` (per backend_id)
V$SQL wait stats	`pg_stat_wait_event_timing_by_query`
10046 trace	`pg_stat_wait_event_trace`

Design decisions

Compile-time flag gates all instrumentation. Follows the PostgreSQL tradition of opt-in overhead (cf. --enable-dtrace, track_io_timing). Hot-path code is #ifdef'd away entirely when not compiled in.
Per-backend flat array + LWLock hash. Each backend writes only to its own shared memory slot -- no locking needed. The flat array uses a compact per-class mapping (656 entries, ~97 KB) sized to actual event populations with headroom. LWLock tranches use a separate open-addressing hash table (192 entries) since their IDs are dynamically allocated.
DSA-backed trace ring buffer. The ~4 MB per-backend ring is allocated lazily via DSA when wait_event_trace is enabled. Only tracing backends pay the cost. The ring is designed as a lock-free transport for external async readers; the SQL by_query view is a convenience interface.
Signal-safe trace writes. Writers atomically claim a ring slot via pg_atomic_fetch_add_u64, then use a per-record seqlock for torn-read detection by concurrent readers.
Out-of-line timing body. Only a bool check + conditional call is inlined at each pgstat_report_wait_end() call site. The full timing logic is in a separate function, keeping I-cache pressure low.
Graceful degradation. When compiled without the flag, SQL functions return empty result sets (not errors), GUC SET ... = on is rejected with a helpful hint, and views exist but are empty. Monitoring tools work without modification.

Shared memory cost

Component	Per backend	At max_connections=200
Timing arrays (flat + LWLock hash)	~127 KB	~28 MB
Trace ring buffer (DSA, lazy)	~4 MB	Only traced backends

Benchmark results

pgbench, PG 19devel, 60-second runs, 3 runs per data point, median with outlier cleaning.

Configs: A = stock PG (no flag), B = compiled with flag (GUCs off), C = timing ON, D = timing+trace ON

SELECT-only (worst case -- highest wait event transition rate):

Clients	A (base)	B (compiled)	C (timing ON)	D (all ON)
1	23,385	23,571	23,563	22,988
4	72,019	72,895	72,643	73,081
64	93,588	95,887	95,463	91,915

TPC-B:

Clients	A (base)	B (compiled)	C (timing ON)	D (all ON)
1	4,350	4,433	4,242	4,340
16	11,225	11,186	11,251	11,009

CPU-bound (SELECT 1+1):

Clients	A (base)	B (compiled)	C (timing ON)	D (all ON)
4	88,199	87,978	88,251	87,915
64	96,922	96,398	96,172	97,357

All configurations within run-to-run variance. No measurable overhead.

Testing

All 244 regression tests pass (both timing and non-timing builds)
Zero compiler warnings under -Wall -Wmissing-prototypes -Wmissing-variable-declarations
Dedicated regression test covers: view structure, histogram invariants, trace records, query markers, reset, privilege checks, alternate expected output for non-timing builds

Files changed (36 files)

New:

src/include/utils/wait_event_timing.h -- data structures, inline helpers
src/backend/utils/activity/wait_event_timing.c -- shmem, SQL functions, stubs
src/test/modules/test_wait_event_stress/ -- overhead measurement extension

Key modified:

src/include/utils/wait_event.h -- timing hooks in inline functions
src/backend/storage/lmgr/proc.c -- init/cleanup timing storage
src/backend/storage/ipc/ipci.c -- shared memory allocation
src/backend/tcop/postgres.c -- QUERY_END marker on idle transition
src/include/catalog/pg_proc.dat -- 4 new SQL functions (OIDs 9950-9953)
Build system: configure.ac, meson.build, meson_options.txt
Documentation: config.sgml, monitoring.sgml, installation.sgml

Related work

Static tracepoints thread -- Jeremy Schneider's proposal for USDT probes (complementary)
pg_wait_sampling -- polling-based (10ms resolution)
pg_wait_tracer -- BPF hardware watchpoint approach

Address all 20 issues identified by cross-checking Gemini 3.1, Opus 4.6, and GPT 5.4 reviews against the actual PR code. Each fix follows the approved plan in crosscheck_review_fix_3.md. Security / correctness fixes: Issue #1 (UAF in trace ring reads): - Remove pg_stat_get_wait_event_timing_by_query() entirely (OID 9951, view, docs, tests) -- cross-backend DSA ring scanning was the primary use-after-free surface. - Restrict pg_stat_get_wait_event_trace() to own-session only; the backend_id argument is accepted for forward compat but ignored. - Wrap DSA pointer mutations (attach/detach) in WaitEventTraceCtl->lock LW_EXCLUSIVE to protect external readers and fix 32-bit torn pointers (also resolves Issue #11). - Document the ring buffer as an external lock-free transport with a seqlock reader contract in wait_event_timing.h and monitoring.sgml. Issue #6 (cross-backend reset race): - Add per-backend LWLock to WaitEventTimingState, following the pgstat_reset_entry() pattern from pgstat_shmem.c. - Writers acquire LW_SHARED on the hot path; reset acquires LW_EXCLUSIVE. - Init and exit paths individually zero fields instead of memset on the whole struct, preserving the LWLock. - Register LWTRANCHE_WAIT_EVENT_TIMING in lwlocklist.h. Build / startup fixes: Issue #2: Replace non-existent wait_classes.h include with wait_event_types.h in wait_event_timing.h. Issue #3 (config-file trace broken): Add explicit wait_event_trace_attach() calls after ResourceOwner init in auxprocess.c (after CreateAuxProcessResourceOwner) and postinit.c (after process_settings) to handle inherited GUC values that do not re-fire assign hooks after fork. Issue #10 (non-timing build silent suppression): Emit WARNING before forcing wait_event_timing/wait_event_trace to off in check hooks for non-timing builds (source < PGC_S_INTERACTIVE path). Issue #14 (EXEC_BACKEND init order): Add comment in pgstat_set_wait_event_timing_storage() confirming SubPostmasterMain calls CreateSharedMemoryAndSemaphores() before InitProcess(). Query lifecycle / phase separation: Issue #4 (QUERY_END broken for multi-statement): - Hybrid approach: emit QUERY_END for old query and QUERY_START for new query in pgstat_report_query_id() transitions (backend_status.c). - Keep existing QUERY_END in send_ready_for_query (postgres.c) for the last-query-to-idle boundary. - Covers simple single/multi-stmt, extended, prepared, and pipeline protocols without double-emit. Issue #5 (phase separation): Add TRACE_EXEC_END marker (type=4) emitted from ExecutorEnd() in execMain.c, separating execution from post-exec cleanup (commit, WAL flush, lock release). Also resolves Issue #13 (extended protocol phase misattribution). Overflow and dependency warnings: Issue #7: One-time WARNING on first LWLock hash overflow (>192 tranches). Issue #8: WARNING in assign_wait_event_trace when track_activities is off; document the dependency in config.sgml. Issue #12: Add flat_overflow_count field + one-time WARNING when eventId exceeds per-class slot count. Header cleanup and code generation: Issue #15: Move static const arrays and internal inline functions from wait_event_timing.h to wait_event_timing.c as file-static. Convert trace marker inlines to extern functions with a shared wait_event_trace_write_marker() helper. Issue #16 (parallel array validation): Extend generate-wait_event_types.pl to emit wait_event_timing_data.c (mapping arrays) and the three sizing #defines into wait_event_types.h, generated from wait_event_names.txt. Eliminates hand-maintained parallel arrays entirely. Update Makefile, meson.build, and pgindent exclusions. Documentation: Issue #17: Document InitMaterializedSRF usage as acceptable for both the timing SRF (bounded result set) and trace SRF (own-session only, deferred max_records parameter). Issue #18: Document ring wrap limitation, polling contract, and st_query_id fallback in wait_event_timing.h and monitoring.sgml. Issue #19 nits: Remove duplicate header include, remove install/ entries from .gitignore, add tranche_id 65535 academic comment, add eventId-equals-array-index comment, verify OID range (9950/9952/9953). Testing: Issue #20: - Add trace lifecycle regression test (off/on/query/read/off cycle). - Add auxiliary process timing test (verify checkpointer/bgwriter/ walwriter accumulate non-zero wait event counts). - Add test_lwlock_hash_overflow() to test_wait_event_stress module: registers >192 custom LWLock tranches to exercise hash overflow path and verify WARNING emission. - Document that default Cirrus CI already exercises non-timing build path via wait_event_timing_1.out alternate expected output. Issues resolved by other fixes (no separate action needed): Issue #9: Resolved by Issue #1 (by_query view removed). Issue #11: Resolved by Issue #1C (LWLock around DSA pointer mutations). Issue #13: Resolved by Issue #5 (EXEC_START/EXEC_END markers). Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

DmitryNFomin · 2026-04-24T09:18:52Z

crosscheck_review_4.md

DmitryNFomin · 2026-04-24T09:19:50Z

given that upstream has structurally refactored two subsystems we depend on (ShmemInit callbacks in ipci.c, and instr_time ticks-to-ns conversion)?

Compile-time: ./configure --enable-wait-event-timing (autoconf) -Dwait_event_timing=true (meson) Without either, the binary is byte-identical to stock PostgreSQL on the hot path; the feature is fully gated behind USE_WAIT_EVENT_TIMING. Runtime: one enum GUC wait_event_capture = off | stats | trace (PGC_SUSET). "stats" enables per-event count, total, max, and log2 histogram accumulation. "trace" additionally enables a per-session ring buffer of individual events (Oracle 10046-style) for query attribution and timeline analysis. In stub builds the GUC rejects any value other than off with a WARNING or ERROR (depending on source), matching the "feature is compile-time" expectation. This bundle adds only the plumbing: the build-system option, the GUC declaration and stub hooks, the postgresql.conf.sample entry, the installation/config SGML sections, and the extern declarations in guc_hooks.h and guc.h that the core instrumentation in the next bundle will implement. Originally shipped across 14 commits, preserved at origin/archive/wait-event-timing-review-rounds: - ef4fade Add wait_event_timing: Oracle-style wait event instrumentation - 1b95736 Add 10046-style per-session wait event trace ring buffer - 7cadcee Make wait_event_timing a compile-time option (--enable-wait-event-timing) - 937cd89 Fix #1: Replace unconditional trace ring allocation with DSA lazy allocation - 8216ef5 Fix #5: Add --enable-wait-event-timing to configure, use pattern 1 - f927dd3 Fix #10: Warn when wait_event_trace enabled without wait_event_timing - 8089e98 Fix #18: Replace bare extern GUC declarations with header include - 4cc876d Fix #20: Add GUC check hooks that reject 'on' in unsupported builds - bed120f Fix #22: Add SGML documentation for wait event timing - c2d63f0 Fix review round 2: 17 issues from consolidated review - 4ca0ec4 Fix review round 3: 20 issues from triple AI cross-check - 708b05a Close wait_event_trace DoS surface: PGC_USERSET -> PGC_SUSET - 12cc637 Collapse wait_event_timing + wait_event_trace booleans into single enum - 3dc8592 Rename trace view/function to pg_backend_* to match session-local scope

…ocol Gated behind USE_WAIT_EVENT_TIMING. Adds: * pgstat_report_wait_start() / _end() inline hot path (wait_event.h): INSTR_TIME_SET_CURRENT() + per-backend write when capture level is non-off. Out-of-line body pgstat_report_wait_end_timing() keeps I-cache pressure low at the many inline call sites. * Per-backend WaitEventTimingState (wait_event_timing.h, wait_event_timing.c): a flat events[] array for bounded wait-event classes + an open-addressing hash (512 slots, 192-entry cap) for unbounded LWLock tranche IDs. Writer-exclusive single-writer slot design, no locks on the hot path. * Log2 histogram with 16 buckets on powers-of-two nanoseconds (~1us, 1-2us, ... 8-16ms, >=16ms), computed via CLZ without a runtime /1000. Overflow counters (lwlock_overflow_count, flat_overflow_count) track events dropped when the per-backend hash is saturated or eventIds fall outside the flat range. * Lock-free cross-backend reset via pg_atomic_uint32 reset_generation: requester bumps the atomic + SetLatch; owner clears its own counters on the next wait_end. Own-backend reset is synchronous. reset_count is a monotonic observability counter that external callers can poll to confirm a remote reset has been observed. * DSA-lazy allocation: the per-backend timing array (~30KB/backend) lives in DSA and is created the first time any backend enables capture. Clusters running with wait_event_capture = off pay only the small control struct (sizeof(WaitEventTimingControl)) in fixed shmem. * Per-session trace ring buffer (WaitEventTraceState, 131072 x 32-byte records = 4 MB) allocated lazily via DSA when wait_event_capture = trace. Seqlock (WaitEventTraceRecord.seq, odd = write in progress) for torn-read detection. Released back to DSA on step-down from trace to stats/off. Per-record layout pinned with StaticAssertDecl. * Generator wiring: generate-wait_event_types.pl emits wait_event_timing_data.c (dense class mapping consumed by the bucket lookup) and wait_event_types.h into src/backend/utils/. Build rules in backend/utils/Makefile + activity/Makefile + meson.build; .gitignore entries on both sides. * Backend lifecycle: InitProcess (proc.c) and auxiliary processes (auxprocess.c) call pgstat_set_wait_event_timing_storage() with their procNumber; ProcKill and related exit paths call pgstat_reset_wait_event_timing_storage(). backend_status.c exposes query_id pointer for query-attribution markers. postinit.c hooks into client-backend startup. * Two new LWLock tranches (lwlocklist.h) for DSA creation. * ipci.c registers the two small control structs (WaitEventTimingControl + WaitEventTraceControl) in fixed shmem; everything else is lazy DSA. Hot-path invariants (verified by regression tests in a later bundle): - wait_start is zero between waits (set by wait_end_timing on exit) - current_event cleared on reset to avoid stale-state readers - reset_generation observation happens before accumulation, so the completing wait credits to freshly-zeroed counters Originally shipped across 48 commits, preserved at origin/archive/wait-event-timing-review-rounds: - ef4fade Add wait_event_timing: Oracle-style wait event instrumentation - 5120e80 Fix: remove pg_bitutils.h dependency, use simple bit shift for histogram bucket - 6a81684 Fix segfault: handle auxiliary processes with procNumber >= MaxBackends - 7cf09db Add query attribution: per (query_id, event) wait statistics - e8a1717 Add query attribution: per (query_id, event) wait statistics - 1b95736 Add 10046-style per-session wait event trace ring buffer - 7cadcee Make wait_event_timing a compile-time option (--enable-wait-event-timing) - 4ccf36a Increase trace ring buffer from 4K to 128K records per backend - 0e04ef2 Fix: add missing tuplestore.h include for tuplestore_putvalues() - 937cd89 Fix #1: Replace unconditional trace ring allocation with DSA lazy allocation - e496cb2 Fix #2: Add privilege checks to wait event timing SQL functions - 0fadd0a Fix #3: Zero shared memory on backend exit to prevent stale data leaks - 8216ef5 Fix #5: Add --enable-wait-event-timing to configure, use pattern 1 - 5d821c8 Fix #6: LWLock tranche IDs can exceed 256 -- per-backend hash table - 8999a5c Fix #7: Trace ring buffer reader/writer race -- seqlock protocol - 37e2741 Fix #8: Query hash entry publication order race - 1bea5c8 Fix #9: Clamp negative duration to zero - f927dd3 Fix #10: Warn when wait_event_trace enabled without wait_event_timing - 5a60203 Fix #11: Implement pg_stat_reset_wait_event_timing() SQL function - 5aadd83 Fix #12: Redesign query attribution -- eliminate hash table, use trace ring - a4e56f1 Fix #13: Return histogram as int4[] instead of text - edb8d9b Fix #15: Include auxiliary processes in wait event timing - 80c84f6 Fix #16: Replace histogram bucket loop with hardware CLZ - 5a72b47 Fix #17: Promote histogram bucket counters from int32 to int64 - 4cc876d Fix #20: Add GUC check hooks that reject 'on' in unsupported builds - c5abb1e Fix #21: Add system views and pid/backend_type to SRFs - eb40cdd Fix #24: Rewrite header comment to point at SQL interface - ebe34ba Add EXEC_START trace marker and phase column to by_query - c2d63f0 Fix review round 2: 17 issues from consolidated review - 6211659 Fix all compiler warnings in wait_event_timing.c - 4ca0ec4 Fix review round 3: 20 issues from triple AI cross-check - 65479aa Fix round 3 regressions: remove locks on single-writer paths, fix tests - 9bcf83c Fix cross-backend reset race: lock-free request-response (Option B) - 72549e2 Remove phantom backend_id argument from pg_stat_get_wait_event_trace - 12cc637 Collapse wait_event_timing + wait_event_trace booleans into single enum - 9a359ab Fix stale wait state + assign-hook ereport contract violation - 7478abb Expose per-backend wait event timing overflow counters - 3dc8592 Rename trace view/function to pg_backend_* to match session-local scope - d1ea636 Allocate per-backend wait event timing array lazily in DSA - 379d650 Untrack generated wait_event_timing_data.c file - 826df19 Switch wait_event_timing reset to PID-based, add _all() variant - 46fe5d0 Release wait-event trace DSA ring on TRACE -> OFF/STATS - 6641deb Zero current_event on reset (issue #9 purist fix) - 835e22c Expose reset_count on wait_event_timing_overflow view (issue #18) - 7e15c8a Reuse histogram ArrayType across SRF rows (issue #19) - f75c9b7 Skip /1000 in histogram bucket computation (issue #20) - 11ed373 Double LWLock timing hash slot count to 512 (issue #21) - a4c133a Pin trace-ring invariants with StaticAssertDecl + barrier annotations (issue #22)

When wait_event_capture = trace and compute_query_id is on, insert per-query markers into the per-session trace ring so external consumers can attribute wait events back to their originating query by scanning. Markers are tagged records (TRACE_QUERY_START, TRACE_QUERY_END, TRACE_EXEC_START, TRACE_EXEC_END) sharing the same WaitEventTraceRecord layout as wait-event entries; they carry a query_id payload instead of a wait-event-info + duration payload. The two hook points: * tcop/postgres.c: QUERY_START / QUERY_END bracket the exec_simple_query / exec_execute_message boundary, covering the full statement lifecycle including parse/analyse/rewrite and post-execution cleanup. * executor/execMain.c: EXEC_START / EXEC_END bracket standard_ExecutorRun / _ExecutorEnd so the planner-less execution phase is distinguishable from query-level overhead on the timeline. Query attribution is read-time: pg_backend_wait_event_trace materializes the ring and the caller partitions events by the enclosing QUERY_START/END markers. This replaced an earlier design with a per-backend query-id-to-stats hash in shared memory, which was removed in favour of the trace-ring approach. Originally shipped across N commits, preserved at origin/archive/wait-event-timing-review-rounds: - ebe34ba Add EXEC_START trace marker and phase column to by_query - 4ca0ec4 Fix review round 3: 20 issues from triple AI cross-check

User-facing surface: * pg_stat_get_wait_event_timing() SRF: one row per (backend, wait_event) with non-zero calls. Columns: pid, backend_type, backend_id, wait_event_type, wait_event, calls, total_time_ms, avg_time_us, max_time_us, histogram (bigint[16]). * pg_stat_get_wait_event_timing_overflow() SRF: one row per live backend exposing lwlock_overflow_count, flat_overflow_count, and reset_count. reset_count is the observability hook external callers poll to confirm an asynchronous cross-backend reset has landed. * pg_get_backend_wait_event_trace() SRF: session-local dump of the caller's own trace ring. Returns (seq, timestamp_ns, wait_event_type, wait_event, duration_us, query_id). Name mirrors pg_get_backend_memory_contexts to make the session-local scope explicit at the API level; there is no cross-backend trace read. * pg_stat_reset_wait_event_timing(pid int) / _all(): NULL/own pid resets own backend synchronously; other pid requires superuser and goes through the async reset_generation protocol. _all() is a dedicated function instead of a -1 overload, matching pg_stat_reset_* conventions elsewhere. System views over the functions: * pg_stat_wait_event_timing, pg_stat_wait_event_timing_overflow (REVOKE PUBLIC / GRANT pg_read_all_stats). * pg_backend_wait_event_trace (PUBLIC; session-local anyway). * pg_wait_event_timing_histogram_buckets: a constant 16-row taxonomy view that translates histogram index -> (lower_ns, upper_ns, label). Canonical join pattern documented in monitoring.sgml is FROM pg_stat_wait_event_timing w, LATERAL unnest(w.histogram) WITH ORDINALITY AS h(count, idx) JOIN pg_wait_event_timing_histogram_buckets b ON b.bucket_idx = h.idx - 1. SGML docs: * A <sect2> per view in monitoring.sgml, mirroring existing pg_stat_* view documentation. * Columns typed, histogram bin edges described as powers of two on nanoseconds with the approximate decimal-microsecond grid called out (1024 ns ~ 1 us, ..., 2^24 ns ~ 16 ms). Uses OIDs 9956-9960 (verified unclaimed upstream) and bumps CATALOG_VERSION_NO. All functions have non-timing-build stubs elsewhere that either return empty SRFs or raise ERRCODE_FEATURE_NOT_SUPPORTED; SQL callers on stock PostgreSQL get predictable behaviour. Originally shipped across N commits, preserved at origin/archive/wait-event-timing-review-rounds: - ef4fade Add wait_event_timing: Oracle-style wait event instrumentation - 7cf09db Add query attribution: per (query_id, event) wait statistics - 1b95736 Add 10046-style per-session wait event trace ring buffer - 937cd89 Fix #1: Replace unconditional trace ring allocation with DSA lazy allocation - 281da3f Fix #4: Bump CATALOG_VERSION_NO for new pg_proc OIDs - 5a60203 Fix #11: Implement pg_stat_reset_wait_event_timing() SQL function - a4e56f1 Fix #13: Return histogram as int4[] instead of text - 5a72b47 Fix #17: Promote histogram bucket counters from int32 to int64 - c5abb1e Fix #21: Add system views and pid/backend_type to SRFs - bed120f Fix #22: Add SGML documentation for wait event timing - ebe34ba Add EXEC_START trace marker and phase column to by_query - c2d63f0 Fix review round 2: 17 issues from consolidated review - 4ca0ec4 Fix review round 3: 20 issues from triple AI cross-check - 72549e2 Remove phantom backend_id argument from pg_stat_get_wait_event_trace - 12cc637 Collapse wait_event_timing + wait_event_trace booleans into single enum - 7478abb Expose per-backend wait event timing overflow counters - 3dc8592 Rename trace view/function to pg_backend_* to match session-local scope - 826df19 Switch wait_event_timing reset to PID-based, add _all() variant - d412c87 Renumber wait_event_timing OIDs to 9956-9960 (issue #17) - 835e22c Expose reset_count on wait_event_timing_overflow view (issue #18) - f75c9b7 Skip /1000 in histogram bucket computation (issue #20) - 7d14ecd Add pg_wait_event_timing_histogram_buckets taxonomy view (issue #23)

Two exported functions on a loadable extension in src/test/modules: * stress_wait_events(n int) -> bigint Calls pgstat_report_wait_start(WAIT_EVENT_PG_SLEEP) + pgstat_report_wait_end() in a tight loop exactly n times. Returns elapsed microseconds for sanity checks and lets the regression test assert exact counts (count(*) = n, calls = n) without the spurious-latch-wake non-determinism that pg_sleep() suffers from on busy CI runners. * test_lwlock_hash_overflow(n int) -> int Registers n custom LWLock tranches via LWLockNewTrancheId() and fires one wait-event cycle on each, exercising the per-backend LWLock timing hash up to and past its 192-entry cap. The regression test uses this to pin the overflow-counter semantics (lwlock_overflow_count increments, one-shot WARNING fires, reset clears the suppression). Regression test covers: * Basic "capture is wired" smoke. * Deterministic exact-count assertions (count(*) = 5, calls = 5) after stress_wait_events(5) under trace capture -- catches the symmetric-duplication class of bugs that the pg_sleep-based core test structurally can't detect. * LWLock hash overflow at 192 entries: cap enforced, overflow counter visible, WARNING fires once, reset clears counter. Top-level pg_stat_reset_wait_event_timing(NULL) at start of the test makes it idempotent against prior state in the same session. Originally shipped across N commits, preserved at origin/archive/wait-event-timing-review-rounds: - ebe34ba Add EXEC_START trace marker and phase column to by_query - c2d63f0 Fix review round 2: 17 issues from consolidated review - 4ca0ec4 Fix review round 3: 20 issues from triple AI cross-check - 65479aa Fix round 3 regressions: remove locks on single-writer paths, fix tests - 12cc637 Collapse wait_event_timing + wait_event_trace booleans into single enum - 9a359ab Fix stale wait state + assign-hook ereport contract violation - 5ad0dca Pin overflow-counter reset semantics with a regression assertion - 46594e4 Reformulate ring-lifecycle test to be strict AND robust (issues #25, #26)

…utput) Core regress test (wait_event_timing.sql / .out) exercises: * GUC default + SET wait_event_capture = stats. * Column types on pg_stat_wait_event_timing. * A pg_sleep generates at least one PgSleep wait event, reflected in calls/total/avg/max and summing correctly over the histogram. * pg_stat_reset_wait_event_timing(NULL / own_pid / unknown_pid), pg_stat_reset_wait_event_timing_all(). * Trace ring basic lifecycle: SET capture=trace, sleep, read the ring, confirm query markers are present. * Trace ring reallocation across a TRACE -> OFF/STATS -> TRACE toggle. Strict-equal on count-agnostic invariants: phase 2's max(seq) < phase 1's max(seq) iff the ring was freed; aggregated calls exactly equals the sum of the two phase ring counts. * Overflow counters visible (lwlock_overflow_count, flat_overflow_count, reset_count) and the reset_count_bumped invariant after resets earlier in the test. The assertions avoid "pg_sleep emits exactly one PgSleep event" -- a false premise under CPU contention on small CI VMs -- without relaxing strictness: every equality is on a property that survives spurious-wake variance. Symmetric-duplication coverage lives in the stress module. Alt expected (wait_event_timing_1.out) is selected by pg_regress when the build lacks USE_WAIT_EVENT_TIMING: SET wait_event_capture != off fails with an ERROR, and the SRFs return empty. parallel_schedule: wait_event_timing added to a group of 3 tests small enough to not starve the single-backend timing assertions under concurrent load. rules.out: refreshed for the 4 new views in pg_catalog. Originally shipped across N commits, preserved at origin/archive/wait-event-timing-review-rounds: - 995d99a Fix #23: Add regression tests for wait event timing - ebe34ba Add EXEC_START trace marker and phase column to by_query - c2d63f0 Fix review round 2: 17 issues from consolidated review - 2f845dc Fix parallel_schedule: move wait_event_timing to smaller test group - 4ca0ec4 Fix review round 3: 20 issues from triple AI cross-check - 65479aa Fix round 3 regressions: remove locks on single-writer paths, fix tests - 72549e2 Remove phantom backend_id argument from pg_stat_get_wait_event_trace - 12cc637 Collapse wait_event_timing + wait_event_trace booleans into single enum - 7478abb Expose per-backend wait event timing overflow counters - 3dc8592 Rename trace view/function to pg_backend_* to match session-local scope - 826df19 Switch wait_event_timing reset to PID-based, add _all() variant - 22ae8a5 Pin Option A invariants for trace DSA release (issue #15) - 835e22c Expose reset_count on wait_event_timing_overflow view (issue #18) - 7d14ecd Add pg_wait_event_timing_histogram_buckets taxonomy view (issue #23) - 46594e4 Reformulate ring-lifecycle test to be strict AND robust (issues #25, #26)

Without this, upstream CI compiled the #ifdef USE_WAIT_EVENT_TIMING paths zero times on every PR, and pg_regress silently selected the non-timing alt expected output (wait_event_timing_1.out), leaving the timing-build expected file unused. Append --enable-wait-event-timing to the existing "Linux - Debian Trixie - Autoconf" matrix entry's configure line. That entry already runs make check-world under --enable-cassert --enable-injection-points --enable-debug --enable-tap-tests, so adding one flag makes the timing path a first-class part of every PR's CI without adding a new task. Originally shipped across N commits, preserved at origin/archive/wait-event-timing-review-rounds: - 43df2a0 ci: enable --enable-wait-event-timing on one Linux task (issue #24)

DmitryNFomin · 2026-04-25T20:06:52Z

review_5.md

Four small, independently-justified fixes from the consolidated review in review_5.md. All verified by `make check` (249/249 passing, including the wait_event_timing regression test). #1 Stale docstring on pg_stat_reset_wait_event_timing (wait_event_timing.c) Two stacked docstrings: the first described an obsolete backend_id-based API ("0 = own, -1 = all") that does not exist; the function takes a PID. Merge them into one block that uses the correct PID semantics and preserves the async-protocol + reset_count-polling guidance from the obsolete one. #2 Spurious wait_event_* includes in ipci.c Two #includes (wait_event.h, wait_event_timing.h) referenced no symbols in ipci.c. The WaitEvent*ShmemCallbacks externs come via storage/subsystems.h -> subsystemlist.h. Removing both leaves ipci.c building cleanly and shrinks its include graph. #3 Missing WARNING for wait_event_capture=trace + compute_query_id=off (wait_event_timing.c, config.sgml) assign_wait_event_capture already warns when track_activities is off; mirror the same warning for compute_query_id, since markers silently skip when query_id is 0. Verified live: setting compute_query_id=off then wait_event_capture=trace now produces the expected WARNING + HINT pointing at pg_stat_statements as a common enabler. #4 Trace doc only described QUERY markers; EXEC markers were silent (config.sgml, monitoring.sgml) The trace ring emits four marker types, not two. ExecStart/ ExecEnd bracket every executor invocation and are NOT gated on track_activities, while QueryStart/QueryEnd cover inter-statement and idle-transition boundaries that the executor cannot. Restate the prereqs in config.sgml as an itemized list per family, update monitoring.sgml's view description and consumer guidance to name ExecStart as the primary attribution signal and QueryStart as the fallback for boundaries the executor cannot mark. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DmitryNFomin force-pushed the wait-event-timing branch 4 times, most recently from 7069864 to 7cadcee Compare March 24, 2026 10:14

NikolayS mentioned this pull request Mar 25, 2026

PoC: USDT static tracepoints for wait event tracing NikolayS/postgres#18

Open

DmitryNFomin added 7 commits April 24, 2026 20:38

DmitryNFomin force-pushed the wait-event-timing branch from 46594e4 to 6497681 Compare April 25, 2026 13:32

DmitryNFomin closed this Apr 25, 2026

DmitryNFomin reopened this Apr 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add wait_event_timing: Oracle-style wait event instrumentation#1

Add wait_event_timing: Oracle-style wait event instrumentation#1
DmitryNFomin wants to merge 7 commits intomasterfrom
wait-event-timing

DmitryNFomin commented Mar 22, 2026 •

edited

Loading

Uh oh!

DmitryNFomin commented Apr 10, 2026

Uh oh!

DmitryNFomin commented Apr 13, 2026

Uh oh!

DmitryNFomin commented Apr 15, 2026

Uh oh!

DmitryNFomin commented Apr 16, 2026

Uh oh!

DmitryNFomin commented Apr 24, 2026

Uh oh!

DmitryNFomin commented Apr 24, 2026

Uh oh!

DmitryNFomin commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DmitryNFomin commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The problem

What this patch provides

Benchmark results

Test 1: Compile flag overhead

Test 2: GUC overhead (same binary + same data)

vs hardware watchpoints

Files changed

Related work

Uh oh!

DmitryNFomin commented Apr 10, 2026

Consolidated Fix Plan for PR #1

PHASE 1: Blockers (must fix before any commit consideration)

PHASE 2: Correctness (bugs and data integrity)

PHASE 3: Missing features / dead code

PHASE 4: API / design improvements

PHASE 5: Integration (PostgreSQL conventions)

Recommended patch split (strong consensus from all 3 reviews)

Uh oh!

DmitryNFomin commented Apr 13, 2026

Uh oh!

DmitryNFomin commented Apr 15, 2026

Uh oh!

DmitryNFomin commented Apr 16, 2026

Add wait_event_timing: Oracle-style wait event instrumentation

The problem

What this patch provides

Design decisions

Shared memory cost

Benchmark results

Testing

Files changed (36 files)

Related work

Uh oh!

DmitryNFomin commented Apr 24, 2026

Uh oh!

DmitryNFomin commented Apr 24, 2026

Uh oh!

DmitryNFomin commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DmitryNFomin commented Mar 22, 2026 •

edited

Loading