Skip to content

Reap expired and deleted streams from the object store#15

Draft
kristof-siket wants to merge 2 commits into
mainfrom
feat/retention-reaps-r2
Draft

Reap expired and deleted streams from the object store#15
kristof-siket wants to merge 2 commits into
mainfrom
feat/retention-reaps-r2

Conversation

@kristof-siket

Copy link
Copy Markdown

Why

Nothing ever deletes objects from R2 today. Stream expiry (Stream-TTL) is enforced only at read time, the expiry sweeper only soft-deletes the local SQLite row, and DELETE explicitly documents that remote objects survive. For workloads that create many short-lived streams (Prisma Compute build logs: one stream per build), storage grows without bound — and since eager --bootstrap-from-r2 restores every manifest before serving, boot time grows with it. This is the first of two PRs from the discussion on #14; the second is the lazy stream-resolution layer. Retention bounds the stream population, which both caps storage and keeps eager boot tractable.

What

One reapable state. Expiry and DELETE converge on the same thing: a local row with STREAM_FLAG_DELETED. The sweeper flags expired rows (unchanged); DELETE already flags + publishes a tombstone manifest. The new StreamReaper (src/retention.ts) consumes flagged rows and the local row is the durable resume token.

Crash-safe reap ordering (the core of the PR):

Step Action Crash here →
1 Verify the R2 manifest is terminal (deleted flag or past expires_at); republish tombstone if not row survives → re-run
2 Delete data objects first (bounded concurrency, ≤5 list passes) half-reaped prefix is behind a terminal manifest → safe
3 Delete manifest.json strictly last re-list is empty → re-delete is a no-op
4 Verify the prefix lists empty (absorbs late-landing uploads) row survives → re-run
5 Local commit: drop segment disk-cache entries + local segment files → hardDeleteStream done

The manifest is the remote commit point: while it exists (terminal), restore-from-R2 sees a tombstone; once it's gone the prefix is invisible. Nothing survives it.

Two companion fixes this surfaced (both load-bearing):

  • Bootstrap tombstone short-circuit (src/bootstrap.ts): tombstoned/expired manifests restore as a row-only tombstone — no segment head() checks. Without this, restoring mid-reap throws missing segment and bricks boot. This also fixes the pre-existing latent version of that bug (bootstrap of a deleted manifest fully head-checked segments).
  • Uploader leak-stop (db.pendingUploadHeads): segments of DELETED-flagged streams are no longer picked for upload, so an in-flight reap can't be re-populated (or have its manifest republished) by late uploads.

The durable driver — periodic R2 scan (src/retention_sweeper.ts): local-row-driven reaping alone breaks on ephemeral-disk deployments (Compute redeploys start with empty SQLite): a stream that expires after its creating node is gone has no row anywhere. A low-frequency scan (DS_RETENTION_SCAN_MS, default 6h) lists streams/*/manifest.json, and restores doomed manifests without a local row as row-only tombstones — re-arming the normal pipeline. Retention keeps the manifest population bounded, so the scan stays cheap. This is the same shape as JetStream/Kafka periodic retention enforcement.

Recreate is now safe. PUT on a deleted/expired name reaps the old incarnation inline before creating (reapForRecreate, serialized per stream with any background reap via a keyed lock). This fixes two pre-existing holes: recreates silently orphaned old R2 objects, and the segment disk cache could serve the old incarnation's bytes for the new stream (same object keys). On reap failure the PUT returns 503 retention_in_progress + retry-after: 1.

Renames: ExpirySweeperRetentionSweeper (src/retention_sweeper.ts) — it now owns three phases (flag expired → reap flagged → scan), no compat alias per repo rules.

Local mode

With NullObjectStore there are no remote objects: the reaper goes straight to local cleanup (preserving today's recreate semantics) and the scan is disabled.

Config

Env Default
DS_RETENTION_REAP_LIMIT 20 streams reaped per sweep tick
DS_RETENTION_DELETE_CONCURRENCY 4 parallel object deletes within one reap
DS_RETENTION_SCAN_MS 6h doomed-manifest scan interval (0 disables)

Cadence/disable reuse DS_EXPIRY_SWEEP_MS (0 still disables everything; tests rely on it).

Observability

tieredstore.retention.{streams_reaped, objects_deleted, list_passes, reap.latency, reap.failures, scan.*}; per-artifact delete counts/latency come free via the accounting wrapper. Errors are better-result typed (tombstone_publish_failed | list_failed | delete_failed | prefix_not_stable | stopped) with FailureTracker backoff in the sweep loop.

Tests

test/retention_reaper.test.ts, 13 cases: expiry reaps everything & hard-deletes; DELETE → tombstone → reap; manifest outlives every data object (delay-injected ordering probe); crash mid-reap resumes after restore-from-R2 on a fresh root; bootstrap restores tombstones with zero segment heads; PUT-recreate on deleted and on expired names (clean prefix, fresh epoch); recreate serializes with an in-flight reap; late-landing object swept by the verify pass; delete failures back off then converge; deleted streams stop uploading pending segments (+ never-uploaded local files reclaimed); the scan reaps a doomed manifest with no local row (the redeploy case); the internal metrics stream survives sweeping; accounting records delete ops. MockR2 gained failDeleteEvery/deleteDelayMs faults + delete counters.

test/assumptions.test.ts "delete is tombstone" updated to the new contract (tombstone until reap completes, then fully gone).

Docs (same change)

  • spec §10 DELETE (async remote deletion, data-first/manifest-last, recreate waits) + §3.2 TTL reclamation note
  • recovery runbook: new "Deletion commit points" (tombstone publish = delete-intent commit; manifest deletion = remote invisibility; local hard delete = final commit)
  • architecture: "Stream Deletion Enforcement and Retention"
  • operational-notes (env knobs), metrics.md (retention series), sqlite-schema.md (stream_flags lifecycle)
  • CHANGELOG "Upcoming"

Verification

  • bun run typecheck, bun run check:result-policy, bun run test:ci — green
  • bun run test:conformance + test:conformance:local — green (spec DELETE section changed)

Relationship to #14

Independent of and preparatory to the lazy-restore work: with retention bounding the stream population, eager boot stays tractable, and the follow-up resolution-layer PR (superseding #14 per review feedback there) makes boot O(active) regardless. The reaper exports withStreamLock + the tombstone-restore invariants that PR builds on.

A stream row with STREAM_FLAG_DELETED is the single reapable state: the
expiry sweeper flags expired rows, HTTP DELETE flags explicitly, and the
StreamReaper deletes the stream's objects data-first with manifest.json
strictly last, hard-deleting local rows only once the prefix is verifiably
empty. The manifest is kept terminal (deleted flag or past expiry) in the
object store before any data object disappears, so restore-from-R2 mid-reap
restores a row-only tombstone that re-arms the reap instead of aborting boot
on a missing segment head.

A periodic scan restores row-only tombstones for doomed manifests that no
longer have a local row (local SQLite is ephemeral on redeploy), making the
object store the durable retention driver. PUT-recreate now reaps the old
incarnation before creating, closing the orphaned-objects and stale
segment-cache holes; the uploader stops picking segments of deleted streams
so a reap cannot be re-populated mid-flight.
Thirteen lifecycle tests cover expiry and DELETE reaps, manifest-last
ordering, crash-resume across restore-from-R2, tombstone-lite bootstrap
(zero segment head checks), recreate on deleted and expired names,
recreate racing an in-flight reap, verify-pass stragglers, failure
backoff, the upload filter for deleted streams, the no-local-row scan
path, and internal-stream safety. MockR2 gains delete fault injection
and counters. The assumptions suite's delete test now asserts the new
contract: tombstone until the reap completes, then fully gone.

Docs: spec DELETE/TTL semantics, recovery-runbook deletion commit
points, architecture retention section, operational env knobs, metrics
series, sqlite-schema lifecycle note, CHANGELOG.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant