wait_event_timing: reviewable patch series (stats / DSA / trace)#2
wait_event_timing: reviewable patch series (stats / DSA / trace)#2DmitryNFomin wants to merge 4 commits into
Conversation
…_capture GUC Introduce the compile-time option --enable-wait-event-timing (meson: -Dwait_event_timing=true) defining USE_WAIT_EVENT_TIMING, and the runtime GUC wait_event_capture (PGC_SUSET, enum off|stats, default off). This commit is scaffolding only: it wires the flag through both build systems and registers the GUC with check/assign hooks, but adds no instrumentation yet -- later commits in this series add the recording hot path, the SQL surface, and the trace level. In builds compiled without --enable-wait-event-timing, the GUC's check hook rejects any value other than off (downgrading to off with a warning for non-interactive sources), so the variable exists uniformly for tooling but cannot be enabled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…vel)
Implement wait_event_capture = stats. When enabled, every transition
through pgstat_report_wait_start()/pgstat_report_wait_end() records the
wait duration and accumulates per-(backend, event) statistics -- count,
total and maximum duration, and a 32-bucket log2 duration histogram -- in
shared memory.
Storage is a per-backend slot array in the main shared memory segment,
sized at postmaster start from wait_event_timing_max_tranches. Because
each slot lives for the entire life of its backend, the hot path needs no
lazy attach and no teardown gating: pgstat_report_wait_start_timing()
records a timestamp and the current event, and
pgstat_report_wait_end_timing() computes the duration and accumulates.
Each backend writes only to its own slot, so no locking is required and
the SRF reader is lock-free.
The inline gate in pgstat_report_wait_start()/_end() is a single load of
wait_event_capture plus a branch, with the bodies kept out-of-line so the
many inlined call sites stay compact; while capture is off the hot path
adds only that branch.
Non-LWLock wait events map to a flat array via a class table generated
from wait_event_names.txt by generate-wait_event_types.pl. LWLock events,
whose tranche ids are unbounded, use a per-backend open-addressing hash
capped by wait_event_timing_max_tranches (PGC_POSTMASTER, default 192).
SQL surface:
- pg_stat_get_wait_event_timing(pid) and the pg_stat_wait_event_timing
view, one row per backend per event with a non-zero count;
- pg_wait_event_timing_histogram_buckets, naming the 32 histogram bins.
Builds compiled without --enable-wait-event-timing keep the GUC (its check
hook rejects any non-off value) and empty-result SQL stubs, so tooling
sees a uniform surface.
A later commit in the series exposes the per-backend overflow counters
(maintained here) and adds the reset functions.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Surface the per-backend truncation counters maintained by the recording path, and add the ability to reset wait-event-timing statistics. pg_stat_get_wait_event_timing_overflow(pid) and the pg_stat_wait_event_timing_overflow view report, per backend, lwlock_overflow_count (LWLock waits dropped because the per-backend tranche hash was full), flat_overflow_count (events whose class index was out of range), and reset_count. Resets use a lock-free request/response so the hot path stays single writer: each slot carries an atomic reset_generation, bumped by the resetter; the owning backend compares it against a backend-local last-seen value at its next wait_end and clears its own counters, incrementing reset_count. pg_stat_reset_wait_event_timing(pid) resets one backend -- synchronously when it targets the caller's own session (any user; pid defaults to NULL), or asynchronously for another backend (requiring pg_signal_backend, matching pg_stat_reset_backend_stats). pg_stat_reset_wait_event_timing_all() resets every backend and is superuser-only. Builds without --enable-wait-event-timing keep empty-result/feature-not- supported stubs for the new functions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add --enable-wait-event-timing to the "Linux - Autoconf" GitHub Actions task so the wait-event-timing build path -- including the expected output src/test/regress/expected/wait_event_timing.out -- is exercised on every push. That task already runs check-world under the undefined/alignment sanitizers with a small segment size, giving the timing code meaningful coverage. Every other CI task keeps building without the flag, so the stub path and its alternate output wait_event_timing_1.out remain covered as well. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Project A — full validation (correctness + performance)Run on a dedicated Hetzner CCX43 (16 dedicated AMD EPYC-Milan vCPU, 64 GB), gcc 13 CorrectnessFull
The inline gate touches every wait-event call site in the backend; a clean full suite in both build configs confirms nothing regressed. Also UBSan-clean (undefined + alignment sanitizers, Targeted functional tests — all pass:
Performance — pgbench, 5 interleaved runs/cellBASELINE = series base (no patch); WET = this branch with the flag. off/stats via GUC.
Read-only and saturated are within run-to-run noise. The TPC-B The TPC-B −3 % is a binary-layout artifact, not the gateThe gate is one global load + a predicted branch per wait event: at ~20 k TPS × ~15 wait events/txn that is ~0.001 % of the cycle budget — it cannot cost 3 %. A controlled re-run (8 interleaved runs each) adds a stub-wet config: the patch built with the gate
Two tells, both impossible if the cost were real:
Verdict
🤖 Generated with Claude Code |
Purpose
This PR re-implements the wait_event_timing work (originally #1, ~8k lines in one
patch) as a reviewable patch series, following Andrey Borodin's review feedback:
split into chewable, independently-committable parts so a committer can see value
from the first commit without venturing into the whole thing.
This is a working/tracking PR on the fork — not an upstream submission. Upstream
goes via a
git format-patchseries to pgsql-hackers + one Commitfest entry once theseries is ready. Base branch is
wet-series-base(pinned at the upstream baselinecommit
89eafad297a) so this PR's diff shows only our series commits.The split — three projects
A —
statsmode on eager shared memory (anchor; small, review-friendly).Deliberately uses plain
ShmemAlloc, not the lazy-DSA design, so all the DSAre-entrancy/guard complexity is isolated into Project B.
B — lazy-DSA refactor (high-risk; isolates the DSA machinery and the bugs found
across the original review rounds). Pure refactor: A's tests pass unchanged.
C —
tracemode (ring buffer, position-encoded seqlock, orphan lifecycle, queryattribution markers, cross-backend reader). Depends on B.
Each commit builds and passes
check-worldon its own (bisectable); docs and teststravel with their feature commit.
Progress
Project A — stats (eager shmem)
--enable-wait-event-timingflag (both build systems) +wait_event_captureGUC scaffold (off|stats), check/assign hooks, stub-build path. Verified: 3 build configs + runtime GUC behaviour.unlikely()) + eager per-backend storage + generated class mapping + LWLock tranche hash + 32-bucket histogram +pg_stat_get_wait_event_timingSRF/view + histogram-buckets view + docs + tests. Verified: 3 build configs + regression test (feature & stub) + live capture; adversarial review passed (incl. EXEC_BACKEND).pg_stat_reset_wait_event_timing(pid)/_all()(atomic reset_generation + latch) + docs + tests. Review: reset concurrency model sound; fixed aproargdefaultscatalog/doc mismatch (inherited from Add wait_event_timing: Oracle-style wait event instrumentation #1).--enable-wait-event-timingon one CI task (GitHub ActionsLinux - Autoconf; upstream replaced Cirrus). Verified UBSan-clean. Project A complete.Project B — lazy-DSA refactor
Project C — trace
traceGUC level + ring writer (seqlock) + ring-size GUC + lazy ring attach.pg_get_backend_wait_event_trace,pg_get_wait_event_trace,pg_stat_clear_orphaned_wait_event_rings) + privileges (incl. the PUBLIC-execute security fix).Carry-over fixes folded into the series
unlikely()(A2) — A/B benchmark showed byte-identical codegen and ±0.09% TPS (within noise) on gcc 13 -O2.pg_get_backend_wait_event_tracePUBLIC-execute (C3; was a live security finding in Add wait_event_timing: Oracle-style wait event instrumentation #1).Original (monolithic) PR
See #1 for the full prior implementation, the benchmark battery, and Andrey's review thread.
Validation (Project A, CCX43 16 vCPU)
Correctness: full
meson test(365 groups — regression + isolation + recovery/subscription/pg_upgrade TAP) passes with 0 failures in both the timing build and the stub build. Targeted tests pass incl. a two-session cross-backend reset. UBSan-clean. (The full suite caught a missingrules.outupdate for the new views — now folded into the A2/A3 commits so each stays bisectable.)Performance (pgbench; BASELINE=no patch vs WET-off vs WET-stats):
Off-mode overhead is within noise. A TPC-B run initially showed −3% for WET/off; a stub-wet control (patch with the gate
#ifdef'd out) proved it a binary-layout artifact — the gate-removed build was −2.39% while adding the gate was +2.64%, impossible for a real cost. Gate cost is below the measurement floor.