wait_event_timing: reviewable patch series (stats / DSA / trace) by DmitryNFomin · Pull Request #2 · DmitryNFomin/postgres

DmitryNFomin · 2026-06-07T20:35:53Z

Purpose

This PR re-implements the wait_event_timing work (originally #1, ~8k lines in one
patch) as a reviewable patch series, following Andrey Borodin's review feedback:
split into chewable, independently-committable parts so a committer can see value
from the first commit without venturing into the whole thing.

This is a working/tracking PR on the fork — not an upstream submission. Upstream
goes via a git format-patch series to pgsql-hackers + one Commitfest entry once the
series is ready. Base branch is wet-series-base (pinned at the upstream baseline
commit 89eafad297a) so this PR's diff shows only our series commits.

The split — three projects

A — stats mode on eager shared memory (anchor; small, review-friendly).
Deliberately uses plain ShmemAlloc, not the lazy-DSA design, so all the DSA
re-entrancy/guard complexity is isolated into Project B.

B — lazy-DSA refactor (high-risk; isolates the DSA machinery and the bugs found
across the original review rounds). Pure refactor: A's tests pass unchanged.

C — trace mode (ring buffer, position-encoded seqlock, orphan lifecycle, query
attribution markers, cross-backend reader). Depends on B.

Each commit builds and passes check-world on its own (bisectable); docs and tests
travel with their feature commit.

Progress

Project A — stats (eager shmem)

A1 — --enable-wait-event-timing flag (both build systems) + wait_event_capture GUC scaffold (off|stats), check/assign hooks, stub-build path. Verified: 3 build configs + runtime GUC behaviour.
A2 — recording hot path (inline gate, no unlikely()) + eager per-backend storage + generated class mapping + LWLock tranche hash + 32-bucket histogram + pg_stat_get_wait_event_timing SRF/view + histogram-buckets view + docs + tests. Verified: 3 build configs + regression test (feature & stub) + live capture; adversarial review passed (incl. EXEC_BACKEND).
A3 — overflow counters + pg_stat_reset_wait_event_timing(pid)/_all() (atomic reset_generation + latch) + docs + tests. Review: reset concurrency model sound; fixed a proargdefaults catalog/doc mismatch (inherited from Add wait_event_timing: Oracle-style wait event instrumentation #1).
A4 — enable --enable-wait-event-timing on one CI task (GitHub Actions Linux - Autoconf; upstream replaced Cirrus). Verified UBSan-clean. Project A complete.

Project B — lazy-DSA refactor

B1 — convert eager shmem → lazy DSA (control struct, ensure_dsa, attach_array, re-entrancy guard, crit-section/lwWaiting attach guards, before_shmem_exit writes-disabled gate). Same SQL surface; A's tests pass unchanged.

Project C — trace

C1 — trace GUC level + ring writer (seqlock) + ring-size GUC + lazy ring attach.
C2 — slot lifecycle (FREE/OWNED/ORPHANED, before_shmem_exit transition, clear_orphan_at_init, release_slot, trace writes-disabled gate).
C3 — trace SQL surface (pg_get_backend_wait_event_trace, pg_get_wait_event_trace, pg_stat_clear_orphaned_wait_event_rings) + privileges (incl. the PUBLIC-execute security fix).
C4 — query attribution markers (execMain.c, postgres.c pipelined flush, backend_status.c).
C5 — trace docs + tests (incl. injection-point seqlock test).

Carry-over fixes folded into the series

Drop the inline-gate unlikely() (A2) — A/B benchmark showed byte-identical codegen and ±0.09% TPS (within noise) on gcc 13 -O2.
Fix pg_get_backend_wait_event_trace PUBLIC-execute (C3; was a live security finding in Add wait_event_timing: Oracle-style wait event instrumentation #1).
Resolve the "bgwriter periodic sweep" claim in Add wait_event_timing: Oracle-style wait event instrumentation #1's description (C2) — implement or strike.

Original (monolithic) PR

See #1 for the full prior implementation, the benchmark battery, and Andrey's review thread.

Validation (Project A, CCX43 16 vCPU)

Correctness: full meson test (365 groups — regression + isolation + recovery/subscription/pg_upgrade TAP) passes with 0 failures in both the timing build and the stub build. Targeted tests pass incl. a two-session cross-backend reset. UBSan-clean. (The full suite caught a missing rules.out update for the new views — now folded into the A2/A3 commits so each stays bisectable.)

Performance (pgbench; BASELINE=no patch vs WET-off vs WET-stats):

workload	WET/off vs baseline	WET/stats vs off
-S read-only (cached)	−0.57%	−0.30%
TPC-B	+0.18% (8-run)	−0.51%
-S wait-saturated (32MB)	−0.42%	+0.29%

Off-mode overhead is within noise. A TPC-B run initially showed −3% for WET/off; a stub-wet control (patch with the gate #ifdef'd out) proved it a binary-layout artifact — the gate-removed build was −2.39% while adding the gate was +2.64%, impossible for a real cost. Gate cost is below the measurement floor.

…_capture GUC Introduce the compile-time option --enable-wait-event-timing (meson: -Dwait_event_timing=true) defining USE_WAIT_EVENT_TIMING, and the runtime GUC wait_event_capture (PGC_SUSET, enum off|stats, default off). This commit is scaffolding only: it wires the flag through both build systems and registers the GUC with check/assign hooks, but adds no instrumentation yet -- later commits in this series add the recording hot path, the SQL surface, and the trace level. In builds compiled without --enable-wait-event-timing, the GUC's check hook rejects any value other than off (downgrading to off with a warning for non-interactive sources), so the variable exists uniformly for tooling but cannot be enabled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…vel) Implement wait_event_capture = stats. When enabled, every transition through pgstat_report_wait_start()/pgstat_report_wait_end() records the wait duration and accumulates per-(backend, event) statistics -- count, total and maximum duration, and a 32-bucket log2 duration histogram -- in shared memory. Storage is a per-backend slot array in the main shared memory segment, sized at postmaster start from wait_event_timing_max_tranches. Because each slot lives for the entire life of its backend, the hot path needs no lazy attach and no teardown gating: pgstat_report_wait_start_timing() records a timestamp and the current event, and pgstat_report_wait_end_timing() computes the duration and accumulates. Each backend writes only to its own slot, so no locking is required and the SRF reader is lock-free. The inline gate in pgstat_report_wait_start()/_end() is a single load of wait_event_capture plus a branch, with the bodies kept out-of-line so the many inlined call sites stay compact; while capture is off the hot path adds only that branch. Non-LWLock wait events map to a flat array via a class table generated from wait_event_names.txt by generate-wait_event_types.pl. LWLock events, whose tranche ids are unbounded, use a per-backend open-addressing hash capped by wait_event_timing_max_tranches (PGC_POSTMASTER, default 192). SQL surface: - pg_stat_get_wait_event_timing(pid) and the pg_stat_wait_event_timing view, one row per backend per event with a non-zero count; - pg_wait_event_timing_histogram_buckets, naming the 32 histogram bins. Builds compiled without --enable-wait-event-timing keep the GUC (its check hook rejects any non-off value) and empty-result SQL stubs, so tooling sees a uniform surface. A later commit in the series exposes the per-backend overflow counters (maintained here) and adds the reset functions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Surface the per-backend truncation counters maintained by the recording path, and add the ability to reset wait-event-timing statistics. pg_stat_get_wait_event_timing_overflow(pid) and the pg_stat_wait_event_timing_overflow view report, per backend, lwlock_overflow_count (LWLock waits dropped because the per-backend tranche hash was full), flat_overflow_count (events whose class index was out of range), and reset_count. Resets use a lock-free request/response so the hot path stays single writer: each slot carries an atomic reset_generation, bumped by the resetter; the owning backend compares it against a backend-local last-seen value at its next wait_end and clears its own counters, incrementing reset_count. pg_stat_reset_wait_event_timing(pid) resets one backend -- synchronously when it targets the caller's own session (any user; pid defaults to NULL), or asynchronously for another backend (requiring pg_signal_backend, matching pg_stat_reset_backend_stats). pg_stat_reset_wait_event_timing_all() resets every backend and is superuser-only. Builds without --enable-wait-event-timing keep empty-result/feature-not- supported stubs for the new functions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add --enable-wait-event-timing to the "Linux - Autoconf" GitHub Actions task so the wait-event-timing build path -- including the expected output src/test/regress/expected/wait_event_timing.out -- is exercised on every push. That task already runs check-world under the undefined/alignment sanitizers with a small segment size, giving the timing code meaningful coverage. Every other CI task keeps building without the flag, so the stub path and its alternate output wait_event_timing_1.out remain covered as well. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

DmitryNFomin · 2026-06-10T08:15:28Z

Project A — full validation (correctness + performance)

Run on a dedicated Hetzner CCX43 (16 dedicated AMD EPYC-Milan vCPU, 64 GB), gcc 13 -O2. Branch wet-series @ 74bd22c (commits: flag+GUC → stats engine → overflow+reset → CI).

Correctness

Full meson test — the entire suite (main regression, isolation, and the recovery / subscription / pg_upgrade TAP tests), run as a non-root user:

build	result
`--enable-wait-event-timing` (timing)	365 Ok / 0 Fail
feature off (stub)	365 Ok / 0 Fail

The inline gate touches every wait-event call site in the backend; a clean full suite in both build configs confirms nothing regressed. Also UBSan-clean (undefined + alignment sanitizers, halt_on_error=1).

Targeted functional tests — all pass:

LWLock / IO / IPC / Client capture; histogram-sum == calls invariant holds across all captured events
wait_event_capture toggle: stats persist across off → stats (cleared only by reset)
wait_event_timing_max_tranches = 16: overflow counters present and correct
own-reset clears + reset_count 0→1
parallel query runs; workers register their own slots
cross-backend reset, two concurrent sessions: session B requests a reset of session A; A services it asynchronously at its next wait_end (reset_count N→N+1)

The full suite also caught a real gap — A2/A3 added three system views without updating expected/rules.out (the test that dumps every pg_catalog view definition). Fixed by folding the rules.out delta into the A2/A3 commits so each stays independently green.

Performance — pgbench, 5 interleaved runs/cell

BASELINE = series base (no patch); WET = this branch with the flag. off/stats via GUC.

workload	BASELINE	WET / off	WET / stats
`-S` read-only (4 GB SB, cached)	169,513	168,553 (−0.57 %)	168,079 (−0.85 %)
TPC-B (4 GB SB, read-write)	20,538	19,923 (−3.00 %)	19,881 (−3.20 %)
`-S` wait-saturated (32 MB SB, ~98 % buffer miss)	145,112	144,498 (−0.42 %)	144,914 (−0.14 %)

Read-only and saturated are within run-to-run noise. The TPC-B −3 % looked alarming, so it was investigated rather than reported as-is.

The TPC-B −3 % is a binary-layout artifact, not the gate

The gate is one global load + a predicted branch per wait event: at ~20 k TPS × ~15 wait events/txn that is ~0.001 % of the cycle budget — it cannot cost 3 %. A controlled re-run (8 interleaved runs each) adds a stub-wet config: the patch built with the gate #ifdef'd out (so its inlined wait_event functions are byte-identical to baseline):

config	mean TPS	stdev	vs BASELINE
baseline (no patch)	20,343	123	—
stub_wet (patch, gate compiled out)	19,857	140	−2.39 %
wet_off (patch, flag on, GUC off)	20,380	123	+0.18 %
wet_stats (flag on, GUC stats)	20,277	114	−0.32 %

Two tells, both impossible if the cost were real:

wet_off is +0.18 % — indistinguishable from baseline. The same WET binary measured −3.00 % then +0.18 % across two sessions → the −3 % was not reproducible.
The control is upside-down: the gate-removed build (stub_wet) is the slowest (−2.39 %), and adding the gate back makes it +2.64 % faster. Adding code cannot speed up execution — this only happens when the ±2–3 % band is driven by code layout (alignment / I-cache placement of unrelated hot functions like XLogInsert), the classic "Producing Wrong Data Without Doing Anything Obviously Wrong" hazard.

Verdict

Off-mode overhead is within noise — compiling the feature in costs nothing when it is off (gate cost below the measurement floor), corroborated across three workloads with a gate-compiled-out control.
Stats-mode runtime cost ≤ ~0.5 % vs off, including the wait-event-saturated worst case.

🤖 Generated with Claude Code

DmitryNFomin and others added 4 commits June 7, 2026 20:31

DmitryNFomin force-pushed the wet-series branch from 79775b0 to 74bd22c Compare June 9, 2026 20:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wait_event_timing: reviewable patch series (stats / DSA / trace)#2

wait_event_timing: reviewable patch series (stats / DSA / trace)#2
DmitryNFomin wants to merge 4 commits into
wet-series-basefrom
wet-series

DmitryNFomin commented Jun 7, 2026 •

edited

Loading

Uh oh!

DmitryNFomin commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DmitryNFomin commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

The split — three projects

Progress

Project A — stats (eager shmem)

Project B — lazy-DSA refactor

Project C — trace

Carry-over fixes folded into the series

Original (monolithic) PR

Validation (Project A, CCX43 16 vCPU)

Uh oh!

DmitryNFomin commented Jun 10, 2026

Project A — full validation (correctness + performance)

Correctness

Performance — pgbench, 5 interleaved runs/cell

The TPC-B −3 % is a binary-layout artifact, not the gate

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DmitryNFomin commented Jun 7, 2026 •

edited

Loading