Skip to content

fix(pds): prevent relay desync after failed write#162

Merged
ascorbic merged 1 commit into
mainfrom
fix/relay-desync-after-failed-write
May 10, 2026
Merged

fix(pds): prevent relay desync after failed write#162
ascorbic merged 1 commit into
mainfrom
fix/relay-desync-after-failed-write

Conversation

@ascorbic
Copy link
Copy Markdown
Owner

Summary

  • After a write that failed mid-flight (most often an image post), the relay would silently stop tracking the PDS until a manual requestCrawl re-established it.
  • Root cause: applyWrites updated this.repo in memory before sequencing the firehose event. If anything threw between the in-memory assignment and a successful sequenceCommit, Cloudflare rolled back the SQLite writes but JS state stayed advanced. The next successful write then emitted a firehose commit whose since rev the relay had never seen, so it marked the repo desynced.
  • Fix: in all four write paths (rpcCreateRecord, rpcDeleteRecord, rpcPutRecord, rpcApplyWrites), only assign this.repo = updatedRepo after sequenceCommit and broadcastCommit succeed. Wrap the post-applyWrites block in try/catch that calls a new invalidateRepoCache() helper on failure (defense in depth — forces a reload from storage on the next access).

Test plan

  • pnpm --filter @getcirrus/pds test:unit — 273 tests pass
  • In production, monitor: induce a failed image post (or a transient sequencer error) and confirm the relay continues to track without manual requestCrawl

`applyWrites` was assigning the new `Repo` to in-memory state before sequencing the firehose event. If anything threw between then and `sequenceCommit` succeeding (e.g. mid-flight failure on an image post), Cloudflare rolled back the SQLite writes but the in-memory `Repo` stayed advanced. The next successful write then emitted a firehose commit whose `since` rev the relay had never seen, and the relay marked the repo desynced — requiring a manual `requestCrawl` to recover.

`this.repo` is now only assigned after the sequence + broadcast succeed in all four write paths (`rpcCreateRecord`, `rpcDeleteRecord`, `rpcPutRecord`, `rpcApplyWrites`), and any failure in that window invalidates the in-memory cache so the next access reloads from storage.
@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
✅ Deployment successful!
View logs
atproto-pds 159948e May 10 2026, 02:51 PM

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented May 10, 2026

Open in StackBlitz

npm i https://pkg.pr.new/create-pds@162
npm i https://pkg.pr.new/@getcirrus/oauth-provider@162
npm i https://pkg.pr.new/@getcirrus/pds@162

commit: 159948e

@ascorbic ascorbic merged commit 5920074 into main May 10, 2026
5 checks passed
@ascorbic ascorbic deleted the fix/relay-desync-after-failed-write branch May 10, 2026 14:52
@mixie-bot mixie-bot Bot mentioned this pull request May 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant