Skip to content

Add wait_event_timing: Oracle-style wait event instrumentation#1

Open
DmitryNFomin wants to merge 7 commits intomasterfrom
wait-event-timing
Open

Add wait_event_timing: Oracle-style wait event instrumentation#1
DmitryNFomin wants to merge 7 commits intomasterfrom
wait-event-timing

Conversation

@DmitryNFomin
Copy link
Copy Markdown
Owner

@DmitryNFomin DmitryNFomin commented Mar 22, 2026

Summary

Adds Oracle-style wait event instrumentation to PostgreSQL — per-event timing,
histograms, query attribution, and per-session 10046-style tracing. Controlled
by a compile-time flag (--enable-wait-event-timing) with zero overhead when
not compiled in.

Branch history (2026-04-25 rebase): the patch series was squashed
from 60 review-round commits into the 7 logical bundles you see below
and rebased onto current upstream/master. The full pre-rebase
commit history (every individual review fix, with original commit
messages) is preserved at
origin/archive/wait-event-timing-review-rounds.
Each squashed commit message ends with a list of the originals it
absorbed, so any review-thread reference to an old commit SHA can be
resolved against that archive branch. Issue #12 (Windows
INSTR_TIME_GET_NANOSEC per-call QueryPerformanceFrequency) was
closed automatically by the rebase: upstream's recent
ticks_per_ns_scaled refactor (commits 0022622c93, 294520c444)
replaces the per-call frequency lookup with a one-time-initialized
scale factor.

The problem

External BPF-based wait event profilers (like pg_wait_tracer) use CPU hardware watchpoints on PGPROC->wait_event_info, costing ~200-300 ns per debug exception — 29% TPS overhead on high-transition workloads. Oracle solved this by instrumenting internally with clock_gettime() VDSO calls (~70-100 ns, no kernel trap).

What this patch provides

Compile-time: ./configure --enable-wait-event-timing (or meson setup -Dwait_event_timing=true).
Without the flag, the binary is identical to stock PostgreSQL.

One runtime GUC (PGC_SUSET):

wait_event_capture What it enables
off (default) No instrumentation, no hot-path cost
stats Per-event count, total ns, max, log2 histogram in pg_stat_wait_event_timing
trace Everything in stats plus a per-session ring of individual events + QUERY/EXEC markers in pg_backend_wait_event_trace, allowing per-query attribution by scanning the ring

Oracle equivalents:

Oracle PostgreSQL
V$SYSTEM_EVENT pg_stat_get_wait_event_timing()
V$EVENT_HISTOGRAM pg_stat_get_wait_event_timing()histogram column
V$SESSION_EVENT pg_stat_get_wait_event_timing()backend_id column
V$SQL wait stats Scan pg_backend_wait_event_trace and group by query_id (QUERY_START/END markers delimit events per query)
10046 trace pg_get_backend_wait_event_trace() (session-local by design)

Benchmark results

Environment: Hetzner cx43 (8 vCPU, 16 GB RAM), Rocky 9.7, PG 19devel,
pgbench scale 100, shared_buffers=128MB, 8 clients, SELECT-only — worst case
at ~220K wait event transitions/sec, 60-second runs.

Test 1: Compile flag overhead

Stock PG vs patched (without --enable-wait-event-timing) on same data directory,
alternating A/B runs, 5 rounds:

Round Stock TPS Patched (no flag) TPS
1 116,398 118,427
2 115,850 116,609
3 116,141 117,669
4 117,699 118,136
5 117,634 118,879

No measurable difference. Hot-path object code is byte-identical
(verified via objcopy -O binary -j .text).

Test 2: GUC overhead (same binary + same data)

Same binary (WITH flag), same data directory, toggling wait_event_capture between restarts, 5 rounds:

Round off stats trace
1 114,768 116,058 115,911
2 116,726 115,789 116,502
3 115,972 115,977 115,170
4 116,405 113,924 115,242
5 115,781 116,556 114,603
Mean 115,930 115,661 115,486

< 0.5% difference. All configs within run-to-run variance.

vs hardware watchpoints

Approach Worst-case overhead
This patch (all features ON) < 0.5%
Hardware watchpoint (pg_wait_tracer) 29%

All 244 PostgreSQL regression tests pass (make check).

Files changed

New files:

  • src/include/utils/wait_event_timing.h — data structures, inline helpers
  • src/backend/utils/activity/wait_event_timing.c — shmem, SQL functions, stubs

Modified files:

  • src/include/utils/wait_event.h — timing code in inline functions (#ifdef)
  • src/backend/storage/lmgr/proc.c — init/cleanup timing storage
  • src/backend/storage/ipc/ipci.c — shared memory allocation
  • src/backend/utils/activity/backend_status.c — query_id pointer setup
  • src/backend/utils/misc/guc_parameters.dat — GUC definitions
  • src/backend/utils/misc/guc_tables.c — GUC variable externs
  • src/include/catalog/pg_proc.dat — SQL function registration
  • configure.ac, meson.build, meson_options.txt — compile-time option
  • src/include/pg_config.h.inUSE_WAIT_EVENT_TIMING define
  • Build files: Makefile, meson.build in activity/

Related work

DmitryNFomin added a commit to DmitryNFomin/pg_wait_tracer that referenced this pull request Mar 23, 2026
Add section to Performance documenting the proposed PostgreSQL patch
(DmitryNFomin/postgres#1) that adds Oracle-style internal wait event
instrumentation with near-zero overhead (<1%) vs hardware watchpoints
(6-29%).

Benchmark results on same environment (Hetzner cx43, 8 vCPU):
- TPC-B: <1% vs 6% (hardware watchpoint)
- SELECT-only worst case: <1% vs 29%

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@DmitryNFomin DmitryNFomin force-pushed the wait-event-timing branch 4 times, most recently from 7069864 to 7cadcee Compare March 24, 2026 10:14
@DmitryNFomin
Copy link
Copy Markdown
Owner Author

Consolidated Fix Plan for PR #1

PHASE 1: Blockers (must fix before any commit consideration)

1. Shared memory footprint -- show-stopper (all 3 reviews)

  • WaitEventTraceState is 4 MB per backend, not "~128 KB" as the comment claims (131072 records * 32 bytes)
  • At max_connections=100, trace alone eats ~400 MB of shared memory, allocated unconditionally
  • Fix: either reduce ring size drastically (e.g. 4096 records = 128 KB), make it configurable via a sizing GUC, allocate lazily, or split trace into a separate patch entirely (all 3 reviews recommend splitting)
  • Also fix the wrong comment on WaitEventTraceState

2. Missing privilege checks on SQL functions (all 3 reviews)

  • All three SRFs expose all backends' data to any SQL user -- timing, query attribution, and trace
  • Fix: follow pg_stat_get_activity() pattern -- has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS) for full access, restrict unprivileged users to own backend only

3. Stale dead-backend data leaks (all 3 reviews)

  • pgstat_reset_wait_event_timing_storage() only NULLs local pointers, does NOT zero shared memory
  • SRFs iterate all MaxBackends slots blindly, exposing old data from exited backends
  • Fix: either zero the shmem slot on backend exit, or iterate only live backend entries (preferred by GPT review)

4. Missing catversion bump (all 3 reviews)

  • 3 new OIDs (9950-9952) added to pg_proc.dat without updating catversion.h
  • Fix: bump CATALOG_VERSION_NO in src/include/catalog/catversion.h

5. configure script not regenerated (Opus + GPT reviews)

  • configure.ac updated but configure was not regenerated with autoconf
  • ./configure --enable-wait-event-timing silently ignores the flag
  • Fix: regenerate configure, or document meson as required build path

PHASE 2: Correctness (bugs and data integrity)

6. LWLock tranche IDs can exceed 256 (all 3 reviews)

  • wait_event_timing_index() caps eventId at 256, but LWLock tranches are dynamically allocated and can exceed this
  • Silently drops valid wait events with no user-visible indication
  • Fix: use a separate hash/mapping for LWLock class, or increase the per-class limit for class 0x01, or move to a hash keyed by full wait_event_info

7. Trace ring buffer reader/writer race (GPT + Opus reviews)

  • Writer fills record fields then increments write_pos -- no memory barrier between payload and position
  • Reader can observe torn/half-written records or mixed-generation data during wraparound
  • Fix: add per-record sequence/generation counter, or a seqlock, or explicit acquire/release barriers. At minimum document the best-effort approximation contract

8. Query hash entry publication order race (GPT review)

  • New entries write query_id first, readers treat query_id != 0 as valid
  • Concurrent reader can see partially initialized event_idx/count/total_ns
  • Fix: write payload fields first, barrier, then publish query_id last

9. No guard against negative duration (Opus review)

  • CLOCK_MONOTONIC can theoretically go backward under VM live migration
  • Negative duration_ns corrupts total_ns, max_ns, and causes undefined behavior in wait_event_timing_bucket()
  • Fix: add duration_ns = Max(duration_ns, 0) guard

10. wait_event_trace silently depends on wait_event_timing (GPT review)

  • Trace recording is nested inside if (wait_event_timing && ...) block
  • User sets wait_event_trace = on but gets no trace if wait_event_timing is off -- no warning
  • Fix: either document dependency + add GUC check hook to warn/reject, or make trace independent

PHASE 3: Missing features / dead code

11. reset_count is dead code / no reset function (all 3 reviews)

  • WaitEventTimingState.reset_count references nonexistent pg_stat_reset_wait_event_timing()
  • No way to reset cumulative counters without restarting the backend
  • Fix: implement a reset function (add to pg_proc.dat, write the C function), or remove the dead field

12. Query hash overflow_count not exposed (GPT + Opus reviews)

  • The hash table silently drops measurements when full (1024 slots)
  • overflow_count is tracked internally but never returned to the user
  • Fix: expose it as an extra column in pg_stat_get_wait_event_timing_by_query(), or add a separate status function

PHASE 4: API / design improvements

13. Histogram returned as text, not int4[] (all 3 reviews)

  • Current: "{0,0,1,...}" text string -- requires SQL string parsing
  • Fix: use construct_array() with INT4OID to return a proper int4[] array

14. Query attribution uses stale st_query_id (GPT review)

  • st_query_id persists after statement ends, so idle waits like ClientRead get charged to the previous query
  • Fix: add a separate "currently attributable query_id" field, or gate attribution on active execution state

15. Auxiliary processes excluded without documentation (GPT + Opus reviews)

  • bgwriter, checkpointer, walwriter, etc. have procNumber >= MaxBackends and are silently skipped
  • Fix: either extend storage to cover aux procs, or document the limitation explicitly

16. Histogram uses while-loop instead of hardware CLZ (Opus review)

  • wait_event_timing_bucket() uses manual bit-shifting loop
  • PostgreSQL provides pg_leftmost_one_pos64() in pg_bitutils.h which uses hardware __builtin_clzl
  • Fix: use the builtin for a single-instruction hot-path operation

17. Histogram bucket counters are int32, may wrap (GPT review)

  • WaitEventTimingEntry.histogram[] uses int32 while count/total_ns are int64
  • Long-lived busy backends can wrap bucket counters
  • Fix: promote to int64 (reassess memory impact), or define reset semantics tightly enough

18. GUC extern declarations misplaced (Opus review)

  • In guc_tables.c, externs are injected between #include lines (non-standard)
  • Fix: include wait_event_timing.h instead, or move externs to a proper location

19. Wasted array slots for class IDs 0x00 and 0x02 (Opus review)

  • These class IDs have no events, wasting 2 * 256 * 88 = ~44 KB per backend
  • Fix: use a class-to-dense-index mapping, or accept as minor waste

20. Unsupported-build GUC UX (GPT + Opus reviews)

  • Without --enable-wait-event-timing, GUCs exist and can be SET on with no effect until SQL functions error
  • Fix: reject SET wait_event_timing = on in unsupported builds via GUC check hook

PHASE 5: Integration (PostgreSQL conventions)

21. Add system views (all 3 reviews)

  • Raw pg_stat_get_* functions without wrapping views is unfinished
  • Fix: add views in system_views.sql with useful context (pid, backend_type, datname, etc.)

22. Add SGML documentation (all 3 reviews)

  • No docs for GUCs, functions, or compile-time option
  • Fix: update config.sgml, monitoring.sgml, system-views.sgml

23. Add regression / TAP tests (all 3 reviews)

  • No test coverage for the new feature
  • Fix: basic aggregate timing rows, histogram bucket validation with pg_sleep(), permission tests, dead-backend cleanup test, build-path test

24. Header comment claims direct shmem reading (GPT review)

  • "External tools can read the accumulated stats from shared memory" implies an ABI promise
  • Fix: remove or tone down; the SQL/view layer is the supported interface

Recommended patch split (strong consensus from all 3 reviews)

Patch Scope
Patch 1 Aggregate per-backend/per-event timing + histogram + views + permissions + docs + tests
Patch 2 By-query attribution (after fixing query_id semantics and overflow visibility)
Patch 3 Per-session trace ring buffer (after solving memory, concurrency, and permissions)

@DmitryNFomin
Copy link
Copy Markdown
Owner Author

DmitryNFomin pushed a commit that referenced this pull request Apr 13, 2026
…ocation

The WaitEventTraceState array was 4 MB per backend allocated unconditionally
at startup via ShmemInitStruct (400 MB at max_connections=100). Replace with
DSA-backed lazy allocation: a small WaitEventTraceControl struct (~8 bytes
per backend) lives in fixed shmem, and the 4 MB ring buffer is allocated
via dsa_create/dsa_allocate only when a backend SETs wait_event_trace = on.

Key changes:
- Add WaitEventTraceControl with dsa_handle, LWLock, per-backend dsa_pointers
- Add LWTRANCHE_WAIT_EVENT_TRACE_DSA for the control LWLock
- Lazy DSA creation on first use (any backend), attach on subsequent backends
- GUC assign hook (assign_wait_event_trace) triggers allocation on enable
- Cleanup via before_shmem_exit callback (runs before dsm_backend_shutdown
  detaches DSA segments, avoiding the use-after-detach crash that would
  occur if dsa_free ran from ProcKill's on_shmem_exit callback)
- Fix pg_stat_get_wait_event_trace proisstrict: was defaulting to true,
  causing NULL argument (own backend) to skip the C function entirely

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
@DmitryNFomin
Copy link
Copy Markdown
Owner Author

@DmitryNFomin
Copy link
Copy Markdown
Owner Author

Add wait_event_timing: Oracle-style wait event instrumentation

Per-event timing with histograms, per-query attribution, and a per-session trace ring buffer for PostgreSQL. Controlled by a compile-time flag (--enable-wait-event-timing) with zero overhead when not compiled in.

The problem

PostgreSQL exposes the current wait event per backend (pg_stat_activity.wait_event), but provides no built-in way to answer:

  • How long did each wait take?
  • Which waits dominate this query?
  • What is the distribution of wait durations?

External approaches have significant limitations:

Approach Resolution Overhead Limitation
pg_wait_sampling 10 ms polling Low Misses short waits, statistical
pg_wait_tracer (BPF) Exact 29% Hardware watchpoint debug exceptions

Oracle solved this decades ago by instrumenting internally with clock_gettime()-equivalent calls -- cheap VDSO operations (~40-100 ns) with no kernel trap.

What this patch provides

Compile-time: ./configure --enable-wait-event-timing (or meson setup -Dwait_event_timing=true)

Without the flag, the binary is identical to stock PostgreSQL (verified via objcopy -O binary -j .text).

Two runtime GUCs:

GUC Context What it enables
wait_event_timing PGC_SUSET Per-event count, total ns, max, 16-bucket log2 histogram
wait_event_trace PGC_USERSET Per-session ring buffer of individual events (Oracle 10046 equivalent)

Oracle equivalents:

Oracle PostgreSQL
V$SYSTEM_EVENT pg_stat_wait_event_timing
V$EVENT_HISTOGRAM pg_stat_wait_event_timing (histogram column)
V$SESSION_EVENT pg_stat_wait_event_timing (per backend_id)
V$SQL wait stats pg_stat_wait_event_timing_by_query
10046 trace pg_stat_wait_event_trace

Design decisions

  1. Compile-time flag gates all instrumentation. Follows the PostgreSQL tradition of opt-in overhead (cf. --enable-dtrace, track_io_timing). Hot-path code is #ifdef'd away entirely when not compiled in.

  2. Per-backend flat array + LWLock hash. Each backend writes only to its own shared memory slot -- no locking needed. The flat array uses a compact per-class mapping (656 entries, ~97 KB) sized to actual event populations with headroom. LWLock tranches use a separate open-addressing hash table (192 entries) since their IDs are dynamically allocated.

  3. DSA-backed trace ring buffer. The ~4 MB per-backend ring is allocated lazily via DSA when wait_event_trace is enabled. Only tracing backends pay the cost. The ring is designed as a lock-free transport for external async readers; the SQL by_query view is a convenience interface.

  4. Signal-safe trace writes. Writers atomically claim a ring slot via pg_atomic_fetch_add_u64, then use a per-record seqlock for torn-read detection by concurrent readers.

  5. Out-of-line timing body. Only a bool check + conditional call is inlined at each pgstat_report_wait_end() call site. The full timing logic is in a separate function, keeping I-cache pressure low.

  6. Graceful degradation. When compiled without the flag, SQL functions return empty result sets (not errors), GUC SET ... = on is rejected with a helpful hint, and views exist but are empty. Monitoring tools work without modification.

Shared memory cost

Component Per backend At max_connections=200
Timing arrays (flat + LWLock hash) ~127 KB ~28 MB
Trace ring buffer (DSA, lazy) ~4 MB Only traced backends

Benchmark results

pgbench, PG 19devel, 60-second runs, 3 runs per data point, median with outlier cleaning.

Configs: A = stock PG (no flag), B = compiled with flag (GUCs off), C = timing ON, D = timing+trace ON

SELECT-only (worst case -- highest wait event transition rate):

Clients A (base) B (compiled) C (timing ON) D (all ON)
1 23,385 23,571 23,563 22,988
4 72,019 72,895 72,643 73,081
64 93,588 95,887 95,463 91,915

TPC-B:

Clients A (base) B (compiled) C (timing ON) D (all ON)
1 4,350 4,433 4,242 4,340
16 11,225 11,186 11,251 11,009

CPU-bound (SELECT 1+1):

Clients A (base) B (compiled) C (timing ON) D (all ON)
4 88,199 87,978 88,251 87,915
64 96,922 96,398 96,172 97,357

All configurations within run-to-run variance. No measurable overhead.

Testing

  • All 244 regression tests pass (both timing and non-timing builds)
  • Zero compiler warnings under -Wall -Wmissing-prototypes -Wmissing-variable-declarations
  • Dedicated regression test covers: view structure, histogram invariants, trace records, query markers, reset, privilege checks, alternate expected output for non-timing builds

Files changed (36 files)

New:

  • src/include/utils/wait_event_timing.h -- data structures, inline helpers
  • src/backend/utils/activity/wait_event_timing.c -- shmem, SQL functions, stubs
  • src/test/modules/test_wait_event_stress/ -- overhead measurement extension

Key modified:

  • src/include/utils/wait_event.h -- timing hooks in inline functions
  • src/backend/storage/lmgr/proc.c -- init/cleanup timing storage
  • src/backend/storage/ipc/ipci.c -- shared memory allocation
  • src/backend/tcop/postgres.c -- QUERY_END marker on idle transition
  • src/include/catalog/pg_proc.dat -- 4 new SQL functions (OIDs 9950-9953)
  • Build system: configure.ac, meson.build, meson_options.txt
  • Documentation: config.sgml, monitoring.sgml, installation.sgml

Related work

DmitryNFomin pushed a commit that referenced this pull request Apr 17, 2026
Address all 20 issues identified by cross-checking Gemini 3.1, Opus 4.6,
and GPT 5.4 reviews against the actual PR code.  Each fix follows the
approved plan in crosscheck_review_fix_3.md.

Security / correctness fixes:

  Issue #1 (UAF in trace ring reads):
  - Remove pg_stat_get_wait_event_timing_by_query() entirely (OID 9951,
    view, docs, tests) -- cross-backend DSA ring scanning was the primary
    use-after-free surface.
  - Restrict pg_stat_get_wait_event_trace() to own-session only; the
    backend_id argument is accepted for forward compat but ignored.
  - Wrap DSA pointer mutations (attach/detach) in WaitEventTraceCtl->lock
    LW_EXCLUSIVE to protect external readers and fix 32-bit torn pointers
    (also resolves Issue #11).
  - Document the ring buffer as an external lock-free transport with a
    seqlock reader contract in wait_event_timing.h and monitoring.sgml.

  Issue #6 (cross-backend reset race):
  - Add per-backend LWLock to WaitEventTimingState, following the
    pgstat_reset_entry() pattern from pgstat_shmem.c.
  - Writers acquire LW_SHARED on the hot path; reset acquires LW_EXCLUSIVE.
  - Init and exit paths individually zero fields instead of memset on the
    whole struct, preserving the LWLock.
  - Register LWTRANCHE_WAIT_EVENT_TIMING in lwlocklist.h.

Build / startup fixes:

  Issue #2: Replace non-existent wait_classes.h include with
  wait_event_types.h in wait_event_timing.h.

  Issue #3 (config-file trace broken): Add explicit
  wait_event_trace_attach() calls after ResourceOwner init in
  auxprocess.c (after CreateAuxProcessResourceOwner) and postinit.c
  (after process_settings) to handle inherited GUC values that do not
  re-fire assign hooks after fork.

  Issue #10 (non-timing build silent suppression): Emit WARNING before
  forcing wait_event_timing/wait_event_trace to off in check hooks for
  non-timing builds (source < PGC_S_INTERACTIVE path).

  Issue #14 (EXEC_BACKEND init order): Add comment in
  pgstat_set_wait_event_timing_storage() confirming SubPostmasterMain
  calls CreateSharedMemoryAndSemaphores() before InitProcess().

Query lifecycle / phase separation:

  Issue #4 (QUERY_END broken for multi-statement):
  - Hybrid approach: emit QUERY_END for old query and QUERY_START for new
    query in pgstat_report_query_id() transitions (backend_status.c).
  - Keep existing QUERY_END in send_ready_for_query (postgres.c) for the
    last-query-to-idle boundary.
  - Covers simple single/multi-stmt, extended, prepared, and pipeline
    protocols without double-emit.

  Issue #5 (phase separation): Add TRACE_EXEC_END marker (type=4) emitted
  from ExecutorEnd() in execMain.c, separating execution from post-exec
  cleanup (commit, WAL flush, lock release).  Also resolves Issue #13
  (extended protocol phase misattribution).

Overflow and dependency warnings:

  Issue #7: One-time WARNING on first LWLock hash overflow (>192 tranches).
  Issue #8: WARNING in assign_wait_event_trace when track_activities is
  off; document the dependency in config.sgml.
  Issue #12: Add flat_overflow_count field + one-time WARNING when eventId
  exceeds per-class slot count.

Header cleanup and code generation:

  Issue #15: Move static const arrays and internal inline functions from
  wait_event_timing.h to wait_event_timing.c as file-static.  Convert
  trace marker inlines to extern functions with a shared
  wait_event_trace_write_marker() helper.

  Issue #16 (parallel array validation): Extend generate-wait_event_types.pl
  to emit wait_event_timing_data.c (mapping arrays) and the three sizing
  #defines into wait_event_types.h, generated from wait_event_names.txt.
  Eliminates hand-maintained parallel arrays entirely.  Update Makefile,
  meson.build, and pgindent exclusions.

Documentation:

  Issue #17: Document InitMaterializedSRF usage as acceptable for both
  the timing SRF (bounded result set) and trace SRF (own-session only,
  deferred max_records parameter).

  Issue #18: Document ring wrap limitation, polling contract, and
  st_query_id fallback in wait_event_timing.h and monitoring.sgml.

  Issue #19 nits: Remove duplicate header include, remove install/
  entries from .gitignore, add tranche_id 65535 academic comment, add
  eventId-equals-array-index comment, verify OID range (9950/9952/9953).

Testing:

  Issue #20:
  - Add trace lifecycle regression test (off/on/query/read/off cycle).
  - Add auxiliary process timing test (verify checkpointer/bgwriter/
    walwriter accumulate non-zero wait event counts).
  - Add test_lwlock_hash_overflow() to test_wait_event_stress module:
    registers >192 custom LWLock tranches to exercise hash overflow
    path and verify WARNING emission.
  - Document that default Cirrus CI already exercises non-timing build
    path via wait_event_timing_1.out alternate expected output.

Issues resolved by other fixes (no separate action needed):
  Issue #9: Resolved by Issue #1 (by_query view removed).
  Issue #11: Resolved by Issue #1C (LWLock around DSA pointer mutations).
  Issue #13: Resolved by Issue #5 (EXEC_START/EXEC_END markers).

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
@DmitryNFomin
Copy link
Copy Markdown
Owner Author

crosscheck_review_4.md

@DmitryNFomin
Copy link
Copy Markdown
Owner Author

given that upstream has structurally refactored two subsystems we depend on (ShmemInit callbacks in ipci.c, and instr_time ticks-to-ns conversion)?

Compile-time: ./configure --enable-wait-event-timing  (autoconf)
              -Dwait_event_timing=true                (meson)
Without either, the binary is byte-identical to stock PostgreSQL on
the hot path; the feature is fully gated behind USE_WAIT_EVENT_TIMING.

Runtime: one enum GUC wait_event_capture = off | stats | trace
(PGC_SUSET).  "stats" enables per-event count, total, max, and log2
histogram accumulation.  "trace" additionally enables a per-session
ring buffer of individual events (Oracle 10046-style) for query
attribution and timeline analysis.  In stub builds the GUC rejects
any value other than off with a WARNING or ERROR (depending on source),
matching the "feature is compile-time" expectation.

This bundle adds only the plumbing: the build-system option, the GUC
declaration and stub hooks, the postgresql.conf.sample entry, the
installation/config SGML sections, and the extern declarations in
guc_hooks.h and guc.h that the core instrumentation in the next bundle
will implement.

Originally shipped across 14 commits, preserved at
origin/archive/wait-event-timing-review-rounds:
  - ef4fade Add wait_event_timing: Oracle-style wait event instrumentation
  - 1b95736 Add 10046-style per-session wait event trace ring buffer
  - 7cadcee Make wait_event_timing a compile-time option (--enable-wait-event-timing)
  - 937cd89 Fix #1: Replace unconditional trace ring allocation with DSA lazy allocation
  - 8216ef5 Fix #5: Add --enable-wait-event-timing to configure, use pattern 1
  - f927dd3 Fix #10: Warn when wait_event_trace enabled without wait_event_timing
  - 8089e98 Fix #18: Replace bare extern GUC declarations with header include
  - 4cc876d Fix #20: Add GUC check hooks that reject 'on' in unsupported builds
  - bed120f Fix #22: Add SGML documentation for wait event timing
  - c2d63f0 Fix review round 2: 17 issues from consolidated review
  - 4ca0ec4 Fix review round 3: 20 issues from triple AI cross-check
  - 708b05a Close wait_event_trace DoS surface: PGC_USERSET -> PGC_SUSET
  - 12cc637 Collapse wait_event_timing + wait_event_trace booleans into single enum
  - 3dc8592 Rename trace view/function to pg_backend_* to match session-local scope
…ocol

Gated behind USE_WAIT_EVENT_TIMING.  Adds:

  * pgstat_report_wait_start() / _end() inline hot path (wait_event.h):
    INSTR_TIME_SET_CURRENT() + per-backend write when capture level is
    non-off.  Out-of-line body pgstat_report_wait_end_timing() keeps
    I-cache pressure low at the many inline call sites.

  * Per-backend WaitEventTimingState (wait_event_timing.h,
    wait_event_timing.c): a flat events[] array for bounded wait-event
    classes + an open-addressing hash (512 slots, 192-entry cap) for
    unbounded LWLock tranche IDs.  Writer-exclusive single-writer slot
    design, no locks on the hot path.

  * Log2 histogram with 16 buckets on powers-of-two nanoseconds
    (~1us, 1-2us, ... 8-16ms, >=16ms), computed via CLZ without a
    runtime /1000.  Overflow counters (lwlock_overflow_count,
    flat_overflow_count) track events dropped when the per-backend
    hash is saturated or eventIds fall outside the flat range.

  * Lock-free cross-backend reset via pg_atomic_uint32
    reset_generation: requester bumps the atomic + SetLatch; owner
    clears its own counters on the next wait_end.  Own-backend reset
    is synchronous.  reset_count is a monotonic observability counter
    that external callers can poll to confirm a remote reset has been
    observed.

  * DSA-lazy allocation: the per-backend timing array
    (~30KB/backend) lives in DSA and is created the first time any
    backend enables capture.  Clusters running with
    wait_event_capture = off pay only the small control struct
    (sizeof(WaitEventTimingControl)) in fixed shmem.

  * Per-session trace ring buffer (WaitEventTraceState,
    131072 x 32-byte records = 4 MB) allocated lazily via DSA when
    wait_event_capture = trace.  Seqlock (WaitEventTraceRecord.seq,
    odd = write in progress) for torn-read detection.  Released back
    to DSA on step-down from trace to stats/off.  Per-record layout
    pinned with StaticAssertDecl.

  * Generator wiring: generate-wait_event_types.pl emits
    wait_event_timing_data.c (dense class mapping consumed by the
    bucket lookup) and wait_event_types.h into src/backend/utils/.
    Build rules in backend/utils/Makefile + activity/Makefile +
    meson.build; .gitignore entries on both sides.

  * Backend lifecycle: InitProcess (proc.c) and auxiliary processes
    (auxprocess.c) call pgstat_set_wait_event_timing_storage() with
    their procNumber; ProcKill and related exit paths call
    pgstat_reset_wait_event_timing_storage().
    backend_status.c exposes query_id pointer for query-attribution
    markers.  postinit.c hooks into client-backend startup.

  * Two new LWLock tranches (lwlocklist.h) for DSA creation.

  * ipci.c registers the two small control structs
    (WaitEventTimingControl + WaitEventTraceControl) in fixed shmem;
    everything else is lazy DSA.

Hot-path invariants (verified by regression tests in a later bundle):
- wait_start is zero between waits (set by wait_end_timing on exit)
- current_event cleared on reset to avoid stale-state readers
- reset_generation observation happens before accumulation, so the
  completing wait credits to freshly-zeroed counters

Originally shipped across 48 commits, preserved at
origin/archive/wait-event-timing-review-rounds:
  - ef4fade Add wait_event_timing: Oracle-style wait event instrumentation
  - 5120e80 Fix: remove pg_bitutils.h dependency, use simple bit shift for histogram bucket
  - 6a81684 Fix segfault: handle auxiliary processes with procNumber >= MaxBackends
  - 7cf09db Add query attribution: per (query_id, event) wait statistics
  - e8a1717 Add query attribution: per (query_id, event) wait statistics
  - 1b95736 Add 10046-style per-session wait event trace ring buffer
  - 7cadcee Make wait_event_timing a compile-time option (--enable-wait-event-timing)
  - 4ccf36a Increase trace ring buffer from 4K to 128K records per backend
  - 0e04ef2 Fix: add missing tuplestore.h include for tuplestore_putvalues()
  - 937cd89 Fix #1: Replace unconditional trace ring allocation with DSA lazy allocation
  - e496cb2 Fix #2: Add privilege checks to wait event timing SQL functions
  - 0fadd0a Fix #3: Zero shared memory on backend exit to prevent stale data leaks
  - 8216ef5 Fix #5: Add --enable-wait-event-timing to configure, use pattern 1
  - 5d821c8 Fix #6: LWLock tranche IDs can exceed 256 -- per-backend hash table
  - 8999a5c Fix #7: Trace ring buffer reader/writer race -- seqlock protocol
  - 37e2741 Fix #8: Query hash entry publication order race
  - 1bea5c8 Fix #9: Clamp negative duration to zero
  - f927dd3 Fix #10: Warn when wait_event_trace enabled without wait_event_timing
  - 5a60203 Fix #11: Implement pg_stat_reset_wait_event_timing() SQL function
  - 5aadd83 Fix #12: Redesign query attribution -- eliminate hash table, use trace ring
  - a4e56f1 Fix #13: Return histogram as int4[] instead of text
  - edb8d9b Fix #15: Include auxiliary processes in wait event timing
  - 80c84f6 Fix #16: Replace histogram bucket loop with hardware CLZ
  - 5a72b47 Fix #17: Promote histogram bucket counters from int32 to int64
  - 4cc876d Fix #20: Add GUC check hooks that reject 'on' in unsupported builds
  - c5abb1e Fix #21: Add system views and pid/backend_type to SRFs
  - eb40cdd Fix #24: Rewrite header comment to point at SQL interface
  - ebe34ba Add EXEC_START trace marker and phase column to by_query
  - c2d63f0 Fix review round 2: 17 issues from consolidated review
  - 6211659 Fix all compiler warnings in wait_event_timing.c
  - 4ca0ec4 Fix review round 3: 20 issues from triple AI cross-check
  - 65479aa Fix round 3 regressions: remove locks on single-writer paths, fix tests
  - 9bcf83c Fix cross-backend reset race: lock-free request-response (Option B)
  - 72549e2 Remove phantom backend_id argument from pg_stat_get_wait_event_trace
  - 12cc637 Collapse wait_event_timing + wait_event_trace booleans into single enum
  - 9a359ab Fix stale wait state + assign-hook ereport contract violation
  - 7478abb Expose per-backend wait event timing overflow counters
  - 3dc8592 Rename trace view/function to pg_backend_* to match session-local scope
  - d1ea636 Allocate per-backend wait event timing array lazily in DSA
  - 379d650 Untrack generated wait_event_timing_data.c file
  - 826df19 Switch wait_event_timing reset to PID-based, add _all() variant
  - 46fe5d0 Release wait-event trace DSA ring on TRACE -> OFF/STATS
  - 6641deb Zero current_event on reset (issue #9 purist fix)
  - 835e22c Expose reset_count on wait_event_timing_overflow view (issue #18)
  - 7e15c8a Reuse histogram ArrayType across SRF rows (issue #19)
  - f75c9b7 Skip /1000 in histogram bucket computation (issue #20)
  - 11ed373 Double LWLock timing hash slot count to 512 (issue #21)
  - a4c133a Pin trace-ring invariants with StaticAssertDecl + barrier annotations (issue #22)
When wait_event_capture = trace and compute_query_id is on, insert
per-query markers into the per-session trace ring so external
consumers can attribute wait events back to their originating query
by scanning.  Markers are tagged records (TRACE_QUERY_START,
TRACE_QUERY_END, TRACE_EXEC_START, TRACE_EXEC_END) sharing the same
WaitEventTraceRecord layout as wait-event entries; they carry a
query_id payload instead of a wait-event-info + duration payload.

The two hook points:

  * tcop/postgres.c: QUERY_START / QUERY_END bracket the
    exec_simple_query / exec_execute_message boundary, covering the
    full statement lifecycle including parse/analyse/rewrite and
    post-execution cleanup.

  * executor/execMain.c: EXEC_START / EXEC_END bracket
    standard_ExecutorRun / _ExecutorEnd so the planner-less execution
    phase is distinguishable from query-level overhead on the
    timeline.

Query attribution is read-time: pg_backend_wait_event_trace
materializes the ring and the caller partitions events by the
enclosing QUERY_START/END markers.  This replaced an earlier design
with a per-backend query-id-to-stats hash in shared memory, which was
removed in favour of the trace-ring approach.

Originally shipped across N commits, preserved at
origin/archive/wait-event-timing-review-rounds:
  - ebe34ba Add EXEC_START trace marker and phase column to by_query
  - 4ca0ec4 Fix review round 3: 20 issues from triple AI cross-check
User-facing surface:

  * pg_stat_get_wait_event_timing() SRF: one row per
    (backend, wait_event) with non-zero calls.  Columns: pid,
    backend_type, backend_id, wait_event_type, wait_event, calls,
    total_time_ms, avg_time_us, max_time_us, histogram (bigint[16]).

  * pg_stat_get_wait_event_timing_overflow() SRF: one row per live
    backend exposing lwlock_overflow_count, flat_overflow_count, and
    reset_count.  reset_count is the observability hook external
    callers poll to confirm an asynchronous cross-backend reset has
    landed.

  * pg_get_backend_wait_event_trace() SRF: session-local dump of
    the caller's own trace ring.  Returns (seq, timestamp_ns,
    wait_event_type, wait_event, duration_us, query_id).  Name mirrors
    pg_get_backend_memory_contexts to make the session-local scope
    explicit at the API level; there is no cross-backend trace read.

  * pg_stat_reset_wait_event_timing(pid int) / _all():
    NULL/own pid resets own backend synchronously; other pid requires
    superuser and goes through the async reset_generation protocol.
    _all() is a dedicated function instead of a -1 overload, matching
    pg_stat_reset_* conventions elsewhere.

System views over the functions:

  * pg_stat_wait_event_timing, pg_stat_wait_event_timing_overflow
    (REVOKE PUBLIC / GRANT pg_read_all_stats).
  * pg_backend_wait_event_trace (PUBLIC; session-local anyway).
  * pg_wait_event_timing_histogram_buckets: a constant 16-row
    taxonomy view that translates histogram index -> (lower_ns,
    upper_ns, label).  Canonical join pattern documented in
    monitoring.sgml is
        FROM pg_stat_wait_event_timing w,
             LATERAL unnest(w.histogram) WITH ORDINALITY AS h(count, idx)
        JOIN pg_wait_event_timing_histogram_buckets b
             ON b.bucket_idx = h.idx - 1.

SGML docs:

  * A <sect2> per view in monitoring.sgml, mirroring existing
    pg_stat_* view documentation.
  * Columns typed, histogram bin edges described as powers of two
    on nanoseconds with the approximate decimal-microsecond grid
    called out (1024 ns ~ 1 us, ..., 2^24 ns ~ 16 ms).

Uses OIDs 9956-9960 (verified unclaimed upstream) and bumps
CATALOG_VERSION_NO.  All functions have non-timing-build stubs
elsewhere that either return empty SRFs or raise
ERRCODE_FEATURE_NOT_SUPPORTED; SQL callers on stock PostgreSQL get
predictable behaviour.

Originally shipped across N commits, preserved at
origin/archive/wait-event-timing-review-rounds:
  - ef4fade Add wait_event_timing: Oracle-style wait event instrumentation
  - 7cf09db Add query attribution: per (query_id, event) wait statistics
  - 1b95736 Add 10046-style per-session wait event trace ring buffer
  - 937cd89 Fix #1: Replace unconditional trace ring allocation with DSA lazy allocation
  - 281da3f Fix #4: Bump CATALOG_VERSION_NO for new pg_proc OIDs
  - 5a60203 Fix #11: Implement pg_stat_reset_wait_event_timing() SQL function
  - a4e56f1 Fix #13: Return histogram as int4[] instead of text
  - 5a72b47 Fix #17: Promote histogram bucket counters from int32 to int64
  - c5abb1e Fix #21: Add system views and pid/backend_type to SRFs
  - bed120f Fix #22: Add SGML documentation for wait event timing
  - ebe34ba Add EXEC_START trace marker and phase column to by_query
  - c2d63f0 Fix review round 2: 17 issues from consolidated review
  - 4ca0ec4 Fix review round 3: 20 issues from triple AI cross-check
  - 72549e2 Remove phantom backend_id argument from pg_stat_get_wait_event_trace
  - 12cc637 Collapse wait_event_timing + wait_event_trace booleans into single enum
  - 7478abb Expose per-backend wait event timing overflow counters
  - 3dc8592 Rename trace view/function to pg_backend_* to match session-local scope
  - 826df19 Switch wait_event_timing reset to PID-based, add _all() variant
  - d412c87 Renumber wait_event_timing OIDs to 9956-9960 (issue #17)
  - 835e22c Expose reset_count on wait_event_timing_overflow view (issue #18)
  - f75c9b7 Skip /1000 in histogram bucket computation (issue #20)
  - 7d14ecd Add pg_wait_event_timing_histogram_buckets taxonomy view (issue #23)
Two exported functions on a loadable extension in src/test/modules:

  * stress_wait_events(n int) -> bigint
    Calls pgstat_report_wait_start(WAIT_EVENT_PG_SLEEP) +
    pgstat_report_wait_end() in a tight loop exactly n times.
    Returns elapsed microseconds for sanity checks and lets the
    regression test assert exact counts (count(*) = n, calls = n)
    without the spurious-latch-wake non-determinism that pg_sleep()
    suffers from on busy CI runners.

  * test_lwlock_hash_overflow(n int) -> int
    Registers n custom LWLock tranches via LWLockNewTrancheId() and
    fires one wait-event cycle on each, exercising the per-backend
    LWLock timing hash up to and past its 192-entry cap.  The
    regression test uses this to pin the overflow-counter semantics
    (lwlock_overflow_count increments, one-shot WARNING fires, reset
    clears the suppression).

Regression test covers:
  * Basic "capture is wired" smoke.
  * Deterministic exact-count assertions (count(*) = 5, calls = 5)
    after stress_wait_events(5) under trace capture -- catches the
    symmetric-duplication class of bugs that the pg_sleep-based
    core test structurally can't detect.
  * LWLock hash overflow at 192 entries: cap enforced, overflow
    counter visible, WARNING fires once, reset clears counter.

Top-level pg_stat_reset_wait_event_timing(NULL) at start of the test
makes it idempotent against prior state in the same session.

Originally shipped across N commits, preserved at
origin/archive/wait-event-timing-review-rounds:
  - ebe34ba Add EXEC_START trace marker and phase column to by_query
  - c2d63f0 Fix review round 2: 17 issues from consolidated review
  - 4ca0ec4 Fix review round 3: 20 issues from triple AI cross-check
  - 65479aa Fix round 3 regressions: remove locks on single-writer paths, fix tests
  - 12cc637 Collapse wait_event_timing + wait_event_trace booleans into single enum
  - 9a359ab Fix stale wait state + assign-hook ereport contract violation
  - 5ad0dca Pin overflow-counter reset semantics with a regression assertion
  - 46594e4 Reformulate ring-lifecycle test to be strict AND robust (issues #25, #26)
…utput)

Core regress test (wait_event_timing.sql / .out) exercises:

  * GUC default + SET wait_event_capture = stats.
  * Column types on pg_stat_wait_event_timing.
  * A pg_sleep generates at least one PgSleep wait event, reflected
    in calls/total/avg/max and summing correctly over the histogram.
  * pg_stat_reset_wait_event_timing(NULL / own_pid / unknown_pid),
    pg_stat_reset_wait_event_timing_all().
  * Trace ring basic lifecycle: SET capture=trace, sleep, read the
    ring, confirm query markers are present.
  * Trace ring reallocation across a TRACE -> OFF/STATS -> TRACE
    toggle.  Strict-equal on count-agnostic invariants: phase 2's
    max(seq) < phase 1's max(seq) iff the ring was freed; aggregated
    calls exactly equals the sum of the two phase ring counts.
  * Overflow counters visible (lwlock_overflow_count,
    flat_overflow_count, reset_count) and the reset_count_bumped
    invariant after resets earlier in the test.

The assertions avoid "pg_sleep emits exactly one PgSleep event" -- a
false premise under CPU contention on small CI VMs -- without
relaxing strictness: every equality is on a property that survives
spurious-wake variance.  Symmetric-duplication coverage lives in the
stress module.

Alt expected (wait_event_timing_1.out) is selected by pg_regress when
the build lacks USE_WAIT_EVENT_TIMING: SET wait_event_capture !=
off fails with an ERROR, and the SRFs return empty.

parallel_schedule: wait_event_timing added to a group of 3 tests
small enough to not starve the single-backend timing assertions under
concurrent load.

rules.out: refreshed for the 4 new views in pg_catalog.

Originally shipped across N commits, preserved at
origin/archive/wait-event-timing-review-rounds:
  - 995d99a Fix #23: Add regression tests for wait event timing
  - ebe34ba Add EXEC_START trace marker and phase column to by_query
  - c2d63f0 Fix review round 2: 17 issues from consolidated review
  - 2f845dc Fix parallel_schedule: move wait_event_timing to smaller test group
  - 4ca0ec4 Fix review round 3: 20 issues from triple AI cross-check
  - 65479aa Fix round 3 regressions: remove locks on single-writer paths, fix tests
  - 72549e2 Remove phantom backend_id argument from pg_stat_get_wait_event_trace
  - 12cc637 Collapse wait_event_timing + wait_event_trace booleans into single enum
  - 7478abb Expose per-backend wait event timing overflow counters
  - 3dc8592 Rename trace view/function to pg_backend_* to match session-local scope
  - 826df19 Switch wait_event_timing reset to PID-based, add _all() variant
  - 22ae8a5 Pin Option A invariants for trace DSA release (issue #15)
  - 835e22c Expose reset_count on wait_event_timing_overflow view (issue #18)
  - 7d14ecd Add pg_wait_event_timing_histogram_buckets taxonomy view (issue #23)
  - 46594e4 Reformulate ring-lifecycle test to be strict AND robust (issues #25, #26)
Without this, upstream CI compiled the #ifdef USE_WAIT_EVENT_TIMING
paths zero times on every PR, and pg_regress silently selected the
non-timing alt expected output (wait_event_timing_1.out), leaving the
timing-build expected file unused.

Append --enable-wait-event-timing to the existing
"Linux - Debian Trixie - Autoconf" matrix entry's configure line.
That entry already runs make check-world under
--enable-cassert --enable-injection-points --enable-debug
--enable-tap-tests, so adding one flag makes the timing path a
first-class part of every PR's CI without adding a new task.

Originally shipped across N commits, preserved at
origin/archive/wait-event-timing-review-rounds:
  - 43df2a0 ci: enable --enable-wait-event-timing on one Linux task (issue #24)
@DmitryNFomin
Copy link
Copy Markdown
Owner Author

review_5.md

DmitryNFomin added a commit that referenced this pull request Apr 25, 2026
Four small, independently-justified fixes from the consolidated review
in review_5.md.  All verified by `make check` (249/249 passing,
including the wait_event_timing regression test).

#1  Stale docstring on pg_stat_reset_wait_event_timing
    (wait_event_timing.c)
    Two stacked docstrings: the first described an obsolete
    backend_id-based API ("0 = own, -1 = all") that does not exist;
    the function takes a PID.  Merge them into one block that uses
    the correct PID semantics and preserves the async-protocol +
    reset_count-polling guidance from the obsolete one.

#2  Spurious wait_event_* includes in ipci.c
    Two #includes (wait_event.h, wait_event_timing.h) referenced no
    symbols in ipci.c.  The WaitEvent*ShmemCallbacks externs come via
    storage/subsystems.h -> subsystemlist.h.  Removing both leaves
    ipci.c building cleanly and shrinks its include graph.

#3  Missing WARNING for wait_event_capture=trace + compute_query_id=off
    (wait_event_timing.c, config.sgml)
    assign_wait_event_capture already warns when track_activities is
    off; mirror the same warning for compute_query_id, since markers
    silently skip when query_id is 0.  Verified live: setting
    compute_query_id=off then wait_event_capture=trace now produces
    the expected WARNING + HINT pointing at pg_stat_statements as a
    common enabler.

#4  Trace doc only described QUERY markers; EXEC markers were silent
    (config.sgml, monitoring.sgml)
    The trace ring emits four marker types, not two.  ExecStart/
    ExecEnd bracket every executor invocation and are NOT gated on
    track_activities, while QueryStart/QueryEnd cover inter-statement
    and idle-transition boundaries that the executor cannot.  Restate
    the prereqs in config.sgml as an itemized list per family, update
    monitoring.sgml's view description and consumer guidance to name
    ExecStart as the primary attribution signal and QueryStart as the
    fallback for boundaries the executor cannot mark.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant