Reap expired and deleted streams from the object store#15
Draft
kristof-siket wants to merge 2 commits into
Draft
Reap expired and deleted streams from the object store#15kristof-siket wants to merge 2 commits into
kristof-siket wants to merge 2 commits into
Conversation
A stream row with STREAM_FLAG_DELETED is the single reapable state: the expiry sweeper flags expired rows, HTTP DELETE flags explicitly, and the StreamReaper deletes the stream's objects data-first with manifest.json strictly last, hard-deleting local rows only once the prefix is verifiably empty. The manifest is kept terminal (deleted flag or past expiry) in the object store before any data object disappears, so restore-from-R2 mid-reap restores a row-only tombstone that re-arms the reap instead of aborting boot on a missing segment head. A periodic scan restores row-only tombstones for doomed manifests that no longer have a local row (local SQLite is ephemeral on redeploy), making the object store the durable retention driver. PUT-recreate now reaps the old incarnation before creating, closing the orphaned-objects and stale segment-cache holes; the uploader stops picking segments of deleted streams so a reap cannot be re-populated mid-flight.
Thirteen lifecycle tests cover expiry and DELETE reaps, manifest-last ordering, crash-resume across restore-from-R2, tombstone-lite bootstrap (zero segment head checks), recreate on deleted and expired names, recreate racing an in-flight reap, verify-pass stragglers, failure backoff, the upload filter for deleted streams, the no-local-row scan path, and internal-stream safety. MockR2 gains delete fault injection and counters. The assumptions suite's delete test now asserts the new contract: tombstone until the reap completes, then fully gone. Docs: spec DELETE/TTL semantics, recovery-runbook deletion commit points, architecture retention section, operational env knobs, metrics series, sqlite-schema lifecycle note, CHANGELOG.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Nothing ever deletes objects from R2 today. Stream expiry (
Stream-TTL) is enforced only at read time, the expiry sweeper only soft-deletes the local SQLite row, andDELETEexplicitly documents that remote objects survive. For workloads that create many short-lived streams (Prisma Compute build logs: one stream per build), storage grows without bound — and since eager--bootstrap-from-r2restores every manifest before serving, boot time grows with it. This is the first of two PRs from the discussion on #14; the second is the lazy stream-resolution layer. Retention bounds the stream population, which both caps storage and keeps eager boot tractable.What
One reapable state. Expiry and DELETE converge on the same thing: a local row with
STREAM_FLAG_DELETED. The sweeper flags expired rows (unchanged); DELETE already flags + publishes a tombstone manifest. The newStreamReaper(src/retention.ts) consumes flagged rows and the local row is the durable resume token.Crash-safe reap ordering (the core of the PR):
expires_at); republish tombstone if notmanifest.jsonstrictly lasthardDeleteStreamThe manifest is the remote commit point: while it exists (terminal), restore-from-R2 sees a tombstone; once it's gone the prefix is invisible. Nothing survives it.
Two companion fixes this surfaced (both load-bearing):
src/bootstrap.ts): tombstoned/expired manifests restore as a row-only tombstone — no segmenthead()checks. Without this, restoring mid-reap throwsmissing segmentand bricks boot. This also fixes the pre-existing latent version of that bug (bootstrap of a deleted manifest fully head-checked segments).db.pendingUploadHeads): segments of DELETED-flagged streams are no longer picked for upload, so an in-flight reap can't be re-populated (or have its manifest republished) by late uploads.The durable driver — periodic R2 scan (
src/retention_sweeper.ts): local-row-driven reaping alone breaks on ephemeral-disk deployments (Compute redeploys start with empty SQLite): a stream that expires after its creating node is gone has no row anywhere. A low-frequency scan (DS_RETENTION_SCAN_MS, default 6h) listsstreams/*/manifest.json, and restores doomed manifests without a local row as row-only tombstones — re-arming the normal pipeline. Retention keeps the manifest population bounded, so the scan stays cheap. This is the same shape as JetStream/Kafka periodic retention enforcement.Recreate is now safe.
PUTon a deleted/expired name reaps the old incarnation inline before creating (reapForRecreate, serialized per stream with any background reap via a keyed lock). This fixes two pre-existing holes: recreates silently orphaned old R2 objects, and the segment disk cache could serve the old incarnation's bytes for the new stream (same object keys). On reap failure the PUT returns503 retention_in_progress+retry-after: 1.Renames:
ExpirySweeper→RetentionSweeper(src/retention_sweeper.ts) — it now owns three phases (flag expired → reap flagged → scan), no compat alias per repo rules.Local mode
With
NullObjectStorethere are no remote objects: the reaper goes straight to local cleanup (preserving today's recreate semantics) and the scan is disabled.Config
DS_RETENTION_REAP_LIMITDS_RETENTION_DELETE_CONCURRENCYDS_RETENTION_SCAN_MSCadence/disable reuse
DS_EXPIRY_SWEEP_MS(0 still disables everything; tests rely on it).Observability
tieredstore.retention.{streams_reaped, objects_deleted, list_passes, reap.latency, reap.failures, scan.*}; per-artifact delete counts/latency come free via the accounting wrapper. Errors arebetter-resulttyped (tombstone_publish_failed | list_failed | delete_failed | prefix_not_stable | stopped) withFailureTrackerbackoff in the sweep loop.Tests
test/retention_reaper.test.ts, 13 cases: expiry reaps everything & hard-deletes; DELETE → tombstone → reap; manifest outlives every data object (delay-injected ordering probe); crash mid-reap resumes after restore-from-R2 on a fresh root; bootstrap restores tombstones with zero segment heads; PUT-recreate on deleted and on expired names (clean prefix, fresh epoch); recreate serializes with an in-flight reap; late-landing object swept by the verify pass; delete failures back off then converge; deleted streams stop uploading pending segments (+ never-uploaded local files reclaimed); the scan reaps a doomed manifest with no local row (the redeploy case); the internal metrics stream survives sweeping; accounting records delete ops. MockR2 gainedfailDeleteEvery/deleteDelayMsfaults + delete counters.test/assumptions.test.ts"delete is tombstone" updated to the new contract (tombstone until reap completes, then fully gone).Docs (same change)
stream_flagslifecycle)Verification
bun run typecheck,bun run check:result-policy,bun run test:ci— greenbun run test:conformance+test:conformance:local— green (spec DELETE section changed)Relationship to #14
Independent of and preparatory to the lazy-restore work: with retention bounding the stream population, eager boot stays tractable, and the follow-up resolution-layer PR (superseding #14 per review feedback there) makes boot O(active) regardless. The reaper exports
withStreamLock+ the tombstone-restore invariants that PR builds on.