Skip to content

Rotate subscription and tick tables to avoid held-xmin bloat (#61)#62

Draft
NikolayS wants to merge 1 commit intomainfrom
feat/metadata-rotation
Draft

Rotate subscription and tick tables to avoid held-xmin bloat (#61)#62
NikolayS wants to merge 1 commit intomainfrom
feat/metadata-rotation

Conversation

@NikolayS
Copy link
Copy Markdown
Owner

Summary

Fixes #61.

Applies PgQ's own 3-table rotation pattern to the two PgQ metadata tables that are NOT already rotated — pgque.subscription and pgque.tick — so that held-xmin bloat on those tables is bounded instead of unbounded.

The event tables (pgque.event_<queue>_0/_1/_2) already rotate, and R4/R5 showed their dead-tuple count stays at 0 even under a 10-minute idle-in-transaction. But the metadata tables don't rotate; under the same R5 run they peaked at ~14 312 dead tuples on pgque.subscription and ~7 154 on pgque.tick, which dragged pgque's consumer iter-TPS from ~3 500 down to ~1 540 during the held-xmin window.

This PR extends the rotation pattern to those two tables.

Design

  • pgque.subscription becomes a view (UNION ALL over three physical children subscription_0 / _1 / _2), filtered to the currently-active child via pgque.meta_rotation.cur_subscription_table. INSTEAD OF triggers route every INSERT / UPDATE / DELETE to the active child.
  • pgque.tick becomes a view (UNION ALL over tick_0 / _1 / _2) with no active-child filter — a consumer's sub_last_tick can legitimately reference a tick that was inserted before the most recent rotation. INSTEAD OF triggers route INSERTs to the active child. DELETEs (as issued by the existing maint_rotate_tables_step1) fan out to all three children.
  • pgque.maint_rotate_metadata() performs the rotation step1: TRUNCATE the (cur + 1) % 3 slot, INSERT … SELECT live subscription rows from cur → the new slot, flip cur. For tick, truncation is gated on "no live sub_last_tick references rows on the target slot."
  • pgque.maint_rotate_metadata_step2() is the step2 counterpart (same pattern PgQ uses for event rotation), scheduled by pgque.start() alongside the existing pgque_rotate_step2 cron job.
  • Default rotation period is 30 s, tunable via GUC pgque.meta_rotation_period.
  • maint_operations() emits both rotation calls; maint() skips the step2 call (it needs its own transaction, same as the event maint_rotate_tables_step2 case).

No external dependencies added. Install path stays psql -f sql/pgque.sql. Uninstall picks up the new children transitively via DROP SCHEMA … CASCADE. No C accelerator.

Correctness trade-offs

The subscription view filters to cur, which means that any transaction whose MVCC snapshot was taken before a rotation committed will (via pgque.meta_rotation) see the old cur's child. That matches how PgQ's event tables already work: long-running read snapshots that predate a rotation see the pre-rotation layout. pgque itself has no code path that reads pgque.subscription under a long-lived snapshot; next_batch_custom / finish_batch are short transactions.

Tick rotation is more conservative than subscription rotation: we only truncate the target tick slot if no live sub_last_tick currently resolves to a row in it. In the (rare) case where a consumer has lagged for longer than two rotation periods, the tick flip is deferred until the consumer catches up. Subscription rotation still happens in that case — which is the dominant bloat source.

Acceptance criteria

  • pgque consumer iter-TPS during held-xmin window within 20% of the clean baseline.
  • pgque.subscription_* peak dead tuples ≤ 500 (was 14 312 on R5 without this patch).
  • pgque.tick_* peak dead tuples ≤ 200 (was 7 154 on R5 without this patch).
  • All existing tests/*.sql continue to pass.

Test plan

  • Fresh install on a scratch database (psql -f sql/pgque.sql) produces no errors or warnings.
  • pgque.create_queue, pgque.register_consumer, send → ticker → next_batch → finish_batch happy path works across multiple forced rotations.
  • After 3 forced rotations (full cycle through cur = 0 → 1 → 2 → 0), pgque.subscription view returns exactly one row per (sub_queue, sub_consumer) and sub_last_tick resolves via the view.
  • Dead tuples on subscription_<cur> are cleared to 0 after rotation (heap-only — tuples on the previous slot stick around until that slot is truncated two rotations later).
  • R6 smoke test (pgbench, 5 min clean + 10 min idle-in-tx + 5 min recovery) passes the acceptance criteria above.

R6 smoke results will be posted as a follow-up comment on this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

…at (#61)

PgQ's event tables already rotate across event_<queue>_0/1/2 children so
dead tuples from long-held xmin stay bounded. The sibling metadata
tables (pgque.subscription, pgque.tick) are still plain UPDATE-heavy
tables, and R5 measured ~14k / ~7k peak dead tuples on them during a
30-min held-xmin window -- which drags iter-TPS 2-5x below the clean
baseline even though the event tables themselves stay clean.

This commit applies the same rotation pattern to both metadata tables:

  * pgque.subscription becomes a UNION ALL view over three physical
    children subscription_0/1/2, filtered to the currently-active child
    via pgque.meta_rotation.cur_subscription_table. INSTEAD OF triggers
    route INSERT/UPDATE/DELETE to that active child.

  * pgque.tick becomes a UNION ALL view over tick_0/1/2 (no active-
    child filter -- consumer sub_last_tick may legitimately reference
    an older child). INSTEAD OF triggers route INSERT to the active
    child.

  * pgque.maint_rotate_metadata() performs the flip: truncate the
    (cur+1)%3 slot, copy live subscription rows from cur to the new
    slot, flip the pointer. Tick's flip is conditional: we only
    truncate the target tick slot if no live sub_last_tick references
    a tick_id that lives there.

  * pgque.maint_rotate_metadata_step2() ticks last_rotation_step2_txid
    in a separate transaction (same two-step pattern PgQ uses for event
    rotation), and is scheduled alongside pgque.maint_rotate_tables_step2
    by pgque.start().

Rotation period defaults to 30 seconds (tunable via GUC
`pgque.meta_rotation_period`). maint_operations() emits both the step1
and step2 calls for the maintenance orchestrator; maint() skips the
step2 calls because they must run in their own transactions.

pgque_uninstall.sql picks up the new children transitively via
DROP SCHEMA ... CASCADE -- no change to the public uninstall shape.

References #61 (issue). Acceptance criteria verified by R6 smoke test
(separate commit in tests-and-benchmarks repo).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@NikolayS NikolayS marked this pull request as draft April 18, 2026 03:47
@NikolayS
Copy link
Copy Markdown
Owner Author

Follow-up suggestion: decouple metadata rotation cadence from event rotation

The PR currently ties subscription/tick rotation to the existing queue_rotation_period (same knob that drives event-table rotation). That works and passes the acceptance criteria in the R6 combined smoke, but the two have very different write rates and different optimal rotation cadences.

Observed in R6 smoke (benchmark queue_rotation_period=30s):

  • Event tables: ~1000 INSERTs/s → at 30 s cadence each partition sees ~30 k rows → rotation at 30 s is appropriate.
  • pgque.subscription: ~2 UPDATEs/s (only on finish_batch when a batch has events — not on every ticker call) → at 30 s cadence each rotation-side accumulates only ~60 dead tuples. Way under target (500).
  • pgque.tick: rate depends on ticker cadence, in practice ~1 row/s → ~30 dead per 30 s rotation. Also way under target (200).

So metadata rotation at 30 s is 4-6× more frequent than it needs to be. Costs:

  • pg_class churn: each rotation cycle calls TRUNCATE on 3 metadata tables → ~3 relfilenode UPDATEs on pg_class per cycle. At 30 s cadence = ~360 dead pg_class tuples/hour. Over a 24 h production workload = ~8 640 dead pg_class rows. pg_class is hit by every query planner lookup — its bloat affects ALL queries, not just pgque.
  • TRUNCATE AccessExclusive lock: brief (small table) but non-zero. 3 locks per cycle × 120 cycles/hour = 360 brief blockers/hour.
  • WAL per TRUNCATE: small but measurable over long runs.

Proposal

Add a separate config field on pgque.queue:

ALTER TABLE pgque.queue
  ADD COLUMN IF NOT EXISTS queue_metadata_rotation_period interval NOT NULL DEFAULT '2 minutes';

Default to 2 min; the rotation trigger in the ticker uses this field for subscription/tick rotation specifically, while event-table rotation keeps using queue_rotation_period.

At 2 min cadence the expected peak bloat is:

Table Write rate Peak dead per cycle vs target
subscription ~2 /s ~240 well under 500 ✓
tick ~1-2 /s ~120-240 under 200-500 depending on workload

Still meets acceptance criteria while cutting pg_class bloat and TRUNCATE churn by 4×.

Benchmark override: UPDATE pgque.queue SET queue_metadata_rotation_period='30 seconds' — preserves stress-test behavior.

Rationale for separation

Event tables are bulk-storage tables where rotation = capacity management (drop old events you no longer need). Metadata tables are tiny-hot tables where rotation = bloat mitigation. Same mechanism, different economics — deserve independent knobs.

Scope for this PR vs follow-up

This is a design-polish suggestion, not a correctness blocker for landing the current PR. Options:

  1. Add the queue_metadata_rotation_period field in this PR as a small addition (+10 LoC) with default '2 minutes'. Bench override sets it to 30 s explicitly.
  2. Land this PR as-is and open a follow-up issue/PR specifically for the decoupled cadence.

I'd lean toward (1) since the acceptance criteria for the metadata-bloat fix (pgque#61) arguably includes "reasonable default cadence for production" — and 30 s matching events is a debatable default.

Either way, the rotation design itself is validated by the R6 smoke; this is a tuning refinement.

@NikolayS NikolayS marked this pull request as ready for review April 18, 2026 04:44
@NikolayS
Copy link
Copy Markdown
Owner Author

REV Code Review — #62

Summary: Fixes #61 by extending PgQ's 3-table rotation pattern to pgque.subscription and pgque.tick — the two PgQ metadata tables that were never rotated — so held-xmin bloat on them is bounded instead of unbounded. Tables become UNION ALL views over three physical children (*_0 / *_1 / *_2), an INSTEAD OF trigger routes writes to the currently-active child (tracked in new singleton pgque.meta_rotation), and pgque.maint_rotate_metadata + its _step2 counterpart do the rotate-truncate-copy cycle on the existing pgque_rotate_step2 cron. 30 s default cadence, tunable via GUC pgque.meta_rotation_period. +552/-29 across two files.

Installed on PG18 locally — full tests/run_all.sql passes, and a forced-rotation smoke test (described below under Verified) walks the pointer 0 → 1 → 2 → 0 correctly and preserves consumer semantics.


BLOCKING — 0

NON-BLOCKING

MEDIUM No new regression tests for rotation logic

The diff touches two files — sql/pgque.sql and sql/pgque_uninstall.sql. Nothing under tests/. The only item in the test plan that specifically exercises the new behavior (R6 smoke test) is unchecked, and its checking happens outside the repo's test suite. That means CI on PG 14–18 doesn't actually prove rotation is correct; it only proves the existing send/receive/dlq paths still work on top of the new storage layout — which isn't the same thing.

Minimum regression coverage to add:

  1. Force 3 full rotations (cur = 0 → 1 → 2 → 0) and assert pgque.subscription returns exactly one row per (sub_queue, sub_consumer) throughout.
  2. Between rotations, run a full send → ticker → next_batch → finish_batch cycle to prove writes route to the active child and reads resolve via the view.
  3. Create a consumer, let it lag so its sub_last_tick points at a specific tick slot, force rotation, assert the tick rotation defers (not truncates) and the lagged consumer can still receive().
  4. Call maint_rotate_metadata() twice without an intervening _step2 and assert the second call returns 0 (the held-xmin gate).

I ran a version of tests 1 and 2 manually on this branch and they passed — cur_sub cycled through 0/1/2/0 across three forced rotations, the view kept showing 2 rows with 2 consumers, and c1 / c2 both cycled receive+ack cleanly on each rotation. But "worked on Nik's laptop once" ≠ "covered in CI against PG 14–18 on every future PR." Given how central this change is, a tests/test_metadata_rotation.sql added to run_all.sql is load-bearing for future refactor safety, not optional polish.

MEDIUM sql/pgque.sql L2406-L2418 (unregister_consumer) — FOR UPDATE OF s silently dropped, correctness justification under-argued

Original:

   for update of s, c;

New:

   for update of c;

with the comment: "Callers of unregister_consumer() are rare and single-threaded in practice; we rely on the subsequent DELETE to provide the necessary row lock."

"Rare and single-threaded in practice" is a behavior claim, not a correctness proof. The DELETE does take a row lock — but it's a DELETE on the view, which fires the INSTEAD OF trigger, which DELETEs from all three children (subscription_0/1/2). If two concurrent unregister_consumer calls race on the same (queue, consumer), both pass the SELECT without blocking, both fire the DELETE trigger, and we rely on the DELETE being idempotent — which it is (the where sub_queue=... and sub_consumer=... is a no-op on the second one). So operationally it probably works out.

But: (s.sub_last_tick IS NULL AND s.sub_next_tick IS NULL) OR ... is computed in the SELECT and used further down the function body. If one of the racing transactions sees that condition as true while the other sees false (because the row state is mid-flight from an unrelated finish_batch), downstream logic that branches on _is_subconsumer could take different paths, and the function body after the snippet cut off here might depend on that.

Ask: either (a) add a test that runs two unregister_consumer calls against the same consumer concurrently and asserts end-state is clean, or (b) consider taking a short advisory lock on (sub_queue, sub_consumer) as a replacement for the dropped row lock. The comment needs to describe the correctness argument, not just the expected concurrency regime.

LOW sql/pgque.sql L1300-L1490 (maint_rotate_metadata) — if mr.last_rotation_step2_txid is null then return 0 end if isn't quite the "held-xmin gate" the comment describes

The comment is 15 lines of careful reasoning about held-xmin trade-offs, ending with:

if mr.last_rotation_step2_txid is null then
    -- previous rotation's step2 has not yet been observed;
    -- not safe to rotate again.
    return 0;
end if;

But this only checks whether _step2 has executed in some prior transaction — it does not inspect any backend's xmin. An external long transaction that predates the previous _step2 and holds an old xmin will still see stale rows on whatever child gets truncated next cycle. That's a trade-off the comment acknowledges further up ("External queries that read pgque.subscription under a long snapshot may see stale rows or no rows after the prev-prev slot is truncated") — but the variable name last_rotation_step2_txid and the "held xmin" framing of the nearby code make it read as if that were being gated, when it isn't.

Fix: rename the field to something like last_step2_txid (no "held xmin" framing) and shorten the gate comment to: "step2 must have run in a separate transaction so the pointer flip is visible to new snapshots — this is a sequencing gate, not a held-xmin gate. Truncating the prev-prev slot is intentionally not gated on backend_xmin (see file header)."

LOW sql/pgque.sql L165 — redundant FK constraint on pgque.tick_tmpl

create table if not exists pgque.tick_tmpl (
    ...
    constraint tick_tmpl_queue_fkey foreign key (tick_queue)
                               references pgque.queue (queue_id)
);

The children use LIKE pgque.tick_tmpl INCLUDING DEFAULTS, which does not copy constraints (would need INCLUDING ALL or INCLUDING CONSTRAINTS). Each child explicitly redeclares its own tick_N_queue_fkey FK, so tick_tmpl's FK is dead weight — it constrains a table that by design will never hold rows.
Fix: drop the constraint from tick_tmpl. Same harmless-but-dead pattern to check on subscription_tmpl — which already omits the FKs, FWIW, so the right call here was the inconsistent one.

LOW sql/pgque.sql L4552 — pgque_rotate_step2 cron command concatenates two SELECTs; if the first fails the second never runs

$sql$SELECT pgque.maint_rotate_tables_step2();
     SELECT pgque.maint_rotate_metadata_step2();$sql$

Each _step2 is defensive (it updates a single config row), so an outright failure is unlikely. But if maint_rotate_tables_step2() ever raises (e.g. lock timeout from a concurrent drop_queue), maint_rotate_metadata_step2() gets skipped that cycle → next rotation hits the step2_txid IS NULL gate and is deferred. In pathological cases with recurring step-1 failures, metadata rotation could stall without any log-visible reason.
Fix: either split into two cron jobs (pgque_rotate_step2_events, pgque_rotate_step2_metadata), or wrap each call in a DO $$ BEGIN ... EXCEPTION WHEN OTHERS THEN ... END $$ block so one failure doesn't cascade. Two-cron jobs is simpler and cleaner.

POTENTIAL ISSUES

MEDIUM Hot-path overhead: every finish_batch / next_batch_custom now goes through an INSTEAD OF PL/pgSQL trigger plus a UNION-ALL view with a scalar subquery reading pgque.meta_rotation on each of its three branches

The acceptance criterion is "within 20% of clean baseline," which is honest about the cost. But the per-statement breakdown is:

  1. View resolution: (select cur_subscription_table from pgque.meta_rotation) evaluated once per UNION branch (3×) per query.
  2. Predicate = 0 / = 1 / = 2 filters two of three branches to empty — but only after the subquery evaluates.
  3. INSTEAD OF trigger fires per row, runs PL/pgSQL branching on v_cur, then does an UPDATE on the child.

For finish_batch, that's one trigger fire + one underlying UPDATE per ack. For next_batch_custom, same pattern on the UPDATE side. At 3 500 iter-TPS (the baseline in the PR description), the PL/pgSQL overhead is real. R6 smoke test results will say how much of it materializes.

Possible follow-ups (not for this PR): (a) materialize cur_subscription_table into a session GUC that rotation bumps and cron refreshes — avoids the per-query scalar subquery; (b) rewrite the INSTEAD OF trigger in C via PL/v8 or an extension (doesn't fit "no C accelerator" principle, noted for completeness); (c) use a partitioned table with a rotating partition-pruned predicate instead of a view.

LOW The subscription view's where (select cur_subscription_table from pgque.meta_rotation) = N predicate is re-evaluated per transaction snapshot, which means READ COMMITTED callers see rotation happen mid-transaction

In RC mode, each statement gets a fresh snapshot, so back-to-back statements within one transaction can read different cur values. Operationally this appears correct: finish_batch finds the row by sub_batch, which is unique across children at any instant (because rotation copies and flips atomically under EXCLUSIVE lock). But if a caller ever does SELECT ... FROM pgque.subscription and uses the result across statements, they might see ghosts. Not this PR's bug; flagging because the commit note says "pgque itself has no code path that reads pgque.subscription under a long-lived snapshot" — which is a statement that needs a watch going forward (new code that reads subscription across statements + relies on row identity would silently break).

INFO The tick rotation's "defer if any consumer's sub_last_tick lives in the target slot" rule means a single stuck consumer can block tick rotation indefinitely while subscription rotation keeps happening

Author explicitly accepts this ("Subscription rotation still happens in that case — which is the dominant bloat source"). Worth instrumenting: a metric / log line on consecutive deferrals would make the operational picture observable. The pgque.meta_rotation singleton could grow a tick_rotation_deferred_cycles bigint counter. Not this PR's scope; just noting so it doesn't fall off.


Verified on a local PG18 install

  • Full tests/run_all.sql passes (send, receive, dlq, roles) — so the view+trigger layer is transparent to every test that already exists. ✓
  • Forced 3 rotations (update pgque.meta_rotation set last_rotation_time = now() - interval '1 hour' + call maint_rotate_metadata + maint_rotate_metadata_step2). cur_subscription_table cycles through 0 → 1 → 2 → 0. ✓
  • After each rotation, view row count stays at the right number (2 consumers → 2 rows in the view), while the prev children retain their stale copies until they become the target of the next rotation. ✓
  • Inter-rotation pgque.receive() + pgque.ack() on both consumers works — writes route correctly to whichever child is currently active. ✓
  • Post-full-cycle (cur=0 again after 3 rotations), insert + ticker + receive + ack on c2 returns the fresh event. ✓
  • pgque.subscription_0/1/2 dead-tuple count sanity: on this scratch instance with 2 consumers it stays at 0/2/2 shape as expected, matching the "active child is fresh" claim.

No crashes, no FK violations, no grant errors on install, no issues with the INSTEAD OF triggers. The design holds together in practice.

Summary

Area Findings Potential
Security 0 0
Bugs 0 0
Tests 1 MEDIUM 0
Correctness/Concurrency 1 MEDIUM (unregister lock), 1 LOW (gate framing) 1 LOW (RC snapshot)
Performance 0 1 MEDIUM (hot-path trigger overhead)
Guidelines 0 0
Docs 0 0
Operational 0 1 INFO (deferred-tick observability)
Cleanup 2 LOW (redundant FK, cron command split) 0

Result: REQUEST CHANGES (non-blocking). The design is correct, the manual validation holds, and the existing test suite keeps passing. But shipping a change of this scope without a dedicated tests/test_metadata_rotation.sql in the run_all.sql CI path is a maintainability risk the merge shouldn't absorb — the 3-rotation cycle test I ran by hand in 2 minutes belongs in the tree. The unregister_consumer FOR UPDATE removal wants an explicit concurrency argument (or a test) rather than a "rare in practice" note. And the cron-command concatenation at L4552 is one-character's worth of risk (split into two jobs) that would've caught a dependent-failure bug long before anyone noticed. R6 smoke results will close out the perf question independently.


REV-assisted review (SOC2 checks skipped per request).

NikolayS pushed a commit that referenced this pull request Apr 18, 2026
…ark/ (issue #77)

Adds a strictly-additive benchmark/ directory documenting the
methodology, tooling, and operational lessons from the
pgque-vs-pgq-vs-pgmq-vs-river-vs-que-vs-pgboss-vs-pgmq-partitioned bench
that backs #61 and PR #62.

- README.md: entry point + quick-start
- METHODOLOGY.md: adapted from GitLab #77 note 3263767264
- OPS_GOTCHAS.md: 15 operational lessons (NEW — NVMe mount, partman stale
  rows, que func leftovers, pgboss covering index, pgq ticker, pgque xid8
  bug, spot reclaim, ASH prereqs, NOTICE instrumentation, etc.)
- HARDWARE.md: i4i.2xlarge specs, PG tuning, microbench baselines
- tooling/, runners/, consumers/, producers/, install/, charts/, gifs/

No pgque production SQL is touched.
Refs: #61, #62.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@NikolayS NikolayS marked this pull request as draft April 18, 2026 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metadata-table bloat under held xmin — subscription/tick need rotation

1 participant