Skip to content

wait_event_timing: reviewable patch series (stats / DSA / trace)#2

Open
DmitryNFomin wants to merge 4 commits into
wet-series-basefrom
wet-series
Open

wait_event_timing: reviewable patch series (stats / DSA / trace)#2
DmitryNFomin wants to merge 4 commits into
wet-series-basefrom
wet-series

Conversation

@DmitryNFomin

@DmitryNFomin DmitryNFomin commented Jun 7, 2026

Copy link
Copy Markdown
Owner

Purpose

This PR re-implements the wait_event_timing work (originally #1, ~8k lines in one
patch) as a reviewable patch series, following Andrey Borodin's review feedback:
split into chewable, independently-committable parts so a committer can see value
from the first commit without venturing into the whole thing.

This is a working/tracking PR on the fork — not an upstream submission. Upstream
goes via a git format-patch series to pgsql-hackers + one Commitfest entry once the
series is ready. Base branch is wet-series-base (pinned at the upstream baseline
commit 89eafad297a) so this PR's diff shows only our series commits.

The split — three projects

A — stats mode on eager shared memory (anchor; small, review-friendly).
Deliberately uses plain ShmemAlloc, not the lazy-DSA design, so all the DSA
re-entrancy/guard complexity is isolated into Project B.

B — lazy-DSA refactor (high-risk; isolates the DSA machinery and the bugs found
across the original review rounds). Pure refactor: A's tests pass unchanged.

C — trace mode (ring buffer, position-encoded seqlock, orphan lifecycle, query
attribution markers, cross-backend reader). Depends on B.

Each commit builds and passes check-world on its own (bisectable); docs and tests
travel with their feature commit.

Progress

Project A — stats (eager shmem)

  • A1--enable-wait-event-timing flag (both build systems) + wait_event_capture GUC scaffold (off|stats), check/assign hooks, stub-build path. Verified: 3 build configs + runtime GUC behaviour.
  • A2 — recording hot path (inline gate, no unlikely()) + eager per-backend storage + generated class mapping + LWLock tranche hash + 32-bucket histogram + pg_stat_get_wait_event_timing SRF/view + histogram-buckets view + docs + tests. Verified: 3 build configs + regression test (feature & stub) + live capture; adversarial review passed (incl. EXEC_BACKEND).
  • A3 — overflow counters + pg_stat_reset_wait_event_timing(pid)/_all() (atomic reset_generation + latch) + docs + tests. Review: reset concurrency model sound; fixed a proargdefaults catalog/doc mismatch (inherited from Add wait_event_timing: Oracle-style wait event instrumentation #1).
  • A4 — enable --enable-wait-event-timing on one CI task (GitHub Actions Linux - Autoconf; upstream replaced Cirrus). Verified UBSan-clean. Project A complete.

Project B — lazy-DSA refactor

  • B1 — convert eager shmem → lazy DSA (control struct, ensure_dsa, attach_array, re-entrancy guard, crit-section/lwWaiting attach guards, before_shmem_exit writes-disabled gate). Same SQL surface; A's tests pass unchanged.

Project C — trace

  • C1trace GUC level + ring writer (seqlock) + ring-size GUC + lazy ring attach.
  • C2 — slot lifecycle (FREE/OWNED/ORPHANED, before_shmem_exit transition, clear_orphan_at_init, release_slot, trace writes-disabled gate).
  • C3 — trace SQL surface (pg_get_backend_wait_event_trace, pg_get_wait_event_trace, pg_stat_clear_orphaned_wait_event_rings) + privileges (incl. the PUBLIC-execute security fix).
  • C4 — query attribution markers (execMain.c, postgres.c pipelined flush, backend_status.c).
  • C5 — trace docs + tests (incl. injection-point seqlock test).

Carry-over fixes folded into the series

Original (monolithic) PR

See #1 for the full prior implementation, the benchmark battery, and Andrey's review thread.

Validation (Project A, CCX43 16 vCPU)

Correctness: full meson test (365 groups — regression + isolation + recovery/subscription/pg_upgrade TAP) passes with 0 failures in both the timing build and the stub build. Targeted tests pass incl. a two-session cross-backend reset. UBSan-clean. (The full suite caught a missing rules.out update for the new views — now folded into the A2/A3 commits so each stays bisectable.)

Performance (pgbench; BASELINE=no patch vs WET-off vs WET-stats):

workload WET/off vs baseline WET/stats vs off
-S read-only (cached) −0.57% −0.30%
TPC-B +0.18% (8-run) −0.51%
-S wait-saturated (32MB) −0.42% +0.29%

Off-mode overhead is within noise. A TPC-B run initially showed −3% for WET/off; a stub-wet control (patch with the gate #ifdef'd out) proved it a binary-layout artifact — the gate-removed build was −2.39% while adding the gate was +2.64%, impossible for a real cost. Gate cost is below the measurement floor.

DmitryNFomin and others added 4 commits June 7, 2026 20:31
…_capture GUC

Introduce the compile-time option --enable-wait-event-timing
(meson: -Dwait_event_timing=true) defining USE_WAIT_EVENT_TIMING, and
the runtime GUC wait_event_capture (PGC_SUSET, enum off|stats, default
off).  This commit is scaffolding only: it wires the flag through both
build systems and registers the GUC with check/assign hooks, but adds
no instrumentation yet -- later commits in this series add the
recording hot path, the SQL surface, and the trace level.

In builds compiled without --enable-wait-event-timing, the GUC's check
hook rejects any value other than off (downgrading to off with a
warning for non-interactive sources), so the variable exists uniformly
for tooling but cannot be enabled.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…vel)

Implement wait_event_capture = stats.  When enabled, every transition
through pgstat_report_wait_start()/pgstat_report_wait_end() records the
wait duration and accumulates per-(backend, event) statistics -- count,
total and maximum duration, and a 32-bucket log2 duration histogram -- in
shared memory.

Storage is a per-backend slot array in the main shared memory segment,
sized at postmaster start from wait_event_timing_max_tranches.  Because
each slot lives for the entire life of its backend, the hot path needs no
lazy attach and no teardown gating: pgstat_report_wait_start_timing()
records a timestamp and the current event, and
pgstat_report_wait_end_timing() computes the duration and accumulates.
Each backend writes only to its own slot, so no locking is required and
the SRF reader is lock-free.

The inline gate in pgstat_report_wait_start()/_end() is a single load of
wait_event_capture plus a branch, with the bodies kept out-of-line so the
many inlined call sites stay compact; while capture is off the hot path
adds only that branch.

Non-LWLock wait events map to a flat array via a class table generated
from wait_event_names.txt by generate-wait_event_types.pl.  LWLock events,
whose tranche ids are unbounded, use a per-backend open-addressing hash
capped by wait_event_timing_max_tranches (PGC_POSTMASTER, default 192).

SQL surface:
  - pg_stat_get_wait_event_timing(pid) and the pg_stat_wait_event_timing
    view, one row per backend per event with a non-zero count;
  - pg_wait_event_timing_histogram_buckets, naming the 32 histogram bins.

Builds compiled without --enable-wait-event-timing keep the GUC (its check
hook rejects any non-off value) and empty-result SQL stubs, so tooling
sees a uniform surface.

A later commit in the series exposes the per-backend overflow counters
(maintained here) and adds the reset functions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Surface the per-backend truncation counters maintained by the recording
path, and add the ability to reset wait-event-timing statistics.

pg_stat_get_wait_event_timing_overflow(pid) and the
pg_stat_wait_event_timing_overflow view report, per backend,
lwlock_overflow_count (LWLock waits dropped because the per-backend
tranche hash was full), flat_overflow_count (events whose class index was
out of range), and reset_count.

Resets use a lock-free request/response so the hot path stays single
writer: each slot carries an atomic reset_generation, bumped by the
resetter; the owning backend compares it against a backend-local
last-seen value at its next wait_end and clears its own counters,
incrementing reset_count.  pg_stat_reset_wait_event_timing(pid) resets one
backend -- synchronously when it targets the caller's own session (any
user; pid defaults to NULL), or asynchronously for another backend
(requiring pg_signal_backend, matching pg_stat_reset_backend_stats).
pg_stat_reset_wait_event_timing_all() resets every backend and is
superuser-only.

Builds without --enable-wait-event-timing keep empty-result/feature-not-
supported stubs for the new functions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add --enable-wait-event-timing to the "Linux - Autoconf" GitHub Actions
task so the wait-event-timing build path -- including the expected output
src/test/regress/expected/wait_event_timing.out -- is exercised on every
push.  That task already runs check-world under the undefined/alignment
sanitizers with a small segment size, giving the timing code meaningful
coverage.  Every other CI task keeps building without the flag, so the
stub path and its alternate output wait_event_timing_1.out remain covered
as well.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@DmitryNFomin

Copy link
Copy Markdown
Owner Author

Project A — full validation (correctness + performance)

Run on a dedicated Hetzner CCX43 (16 dedicated AMD EPYC-Milan vCPU, 64 GB), gcc 13 -O2. Branch wet-series @ 74bd22c (commits: flag+GUC → stats engine → overflow+reset → CI).

Correctness

Full meson test — the entire suite (main regression, isolation, and the recovery / subscription / pg_upgrade TAP tests), run as a non-root user:

build result
--enable-wait-event-timing (timing) 365 Ok / 0 Fail
feature off (stub) 365 Ok / 0 Fail

The inline gate touches every wait-event call site in the backend; a clean full suite in both build configs confirms nothing regressed. Also UBSan-clean (undefined + alignment sanitizers, halt_on_error=1).

Targeted functional tests — all pass:

  • LWLock / IO / IPC / Client capture; histogram-sum == calls invariant holds across all captured events
  • wait_event_capture toggle: stats persist across off → stats (cleared only by reset)
  • wait_event_timing_max_tranches = 16: overflow counters present and correct
  • own-reset clears + reset_count 0→1
  • parallel query runs; workers register their own slots
  • cross-backend reset, two concurrent sessions: session B requests a reset of session A; A services it asynchronously at its next wait_end (reset_count N→N+1)

The full suite also caught a real gap — A2/A3 added three system views without updating expected/rules.out (the test that dumps every pg_catalog view definition). Fixed by folding the rules.out delta into the A2/A3 commits so each stays independently green.

Performance — pgbench, 5 interleaved runs/cell

BASELINE = series base (no patch); WET = this branch with the flag. off/stats via GUC.

workload BASELINE WET / off WET / stats
-S read-only (4 GB SB, cached) 169,513 168,553 (−0.57 %) 168,079 (−0.85 %)
TPC-B (4 GB SB, read-write) 20,538 19,923 (−3.00 %) 19,881 (−3.20 %)
-S wait-saturated (32 MB SB, ~98 % buffer miss) 145,112 144,498 (−0.42 %) 144,914 (−0.14 %)

Read-only and saturated are within run-to-run noise. The TPC-B −3 % looked alarming, so it was investigated rather than reported as-is.

The TPC-B −3 % is a binary-layout artifact, not the gate

The gate is one global load + a predicted branch per wait event: at ~20 k TPS × ~15 wait events/txn that is ~0.001 % of the cycle budget — it cannot cost 3 %. A controlled re-run (8 interleaved runs each) adds a stub-wet config: the patch built with the gate #ifdef'd out (so its inlined wait_event functions are byte-identical to baseline):

config mean TPS stdev vs BASELINE
baseline (no patch) 20,343 123
stub_wet (patch, gate compiled out) 19,857 140 −2.39 %
wet_off (patch, flag on, GUC off) 20,380 123 +0.18 %
wet_stats (flag on, GUC stats) 20,277 114 −0.32 %

Two tells, both impossible if the cost were real:

  1. wet_off is +0.18 % — indistinguishable from baseline. The same WET binary measured −3.00 % then +0.18 % across two sessions → the −3 % was not reproducible.
  2. The control is upside-down: the gate-removed build (stub_wet) is the slowest (−2.39 %), and adding the gate back makes it +2.64 % faster. Adding code cannot speed up execution — this only happens when the ±2–3 % band is driven by code layout (alignment / I-cache placement of unrelated hot functions like XLogInsert), the classic "Producing Wrong Data Without Doing Anything Obviously Wrong" hazard.

Verdict

  • Off-mode overhead is within noise — compiling the feature in costs nothing when it is off (gate cost below the measurement floor), corroborated across three workloads with a gate-compiled-out control.
  • Stats-mode runtime cost ≤ ~0.5 % vs off, including the wait-event-saturated worst case.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant