Skip to content

fix(chat-recovery): bound Durable Object memory-limit (OOM) crash loops (#1825)#1826

Merged
threepointone merged 2 commits into
mainfrom
fix/chat-recovery-oom-alarm-breaker
Jun 28, 2026
Merged

fix(chat-recovery): bound Durable Object memory-limit (OOM) crash loops (#1825)#1826
threepointone merged 2 commits into
mainfrom
fix/chat-recovery-oom-alarm-breaker

Conversation

@threepointone

@threepointone threepointone commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #1825: a chat-recovery turn whose Durable Object isolate exceeds its 128 MB memory limit could loop forever, re-running the (billable) turn on every platform alarm retry.

Why it looped

A reset isolate has usually already streamed a little content, which bumps the durable progress counter. On the next wake, recovery reads that as forward progress and resets both progress-keyed bounds — the attempt cap (maxAttempts) and the no-progress window (noProgressTimeoutMs). Because each crash lands inside the alarm-debounce window, the attempt counter is pinned too. With maxRecoveryWork defaulting to Infinity, no instrument could ever seal the turn, so the model re-ran indefinitely.

This matches the customer's logs exactly: OOM during boot/hydration → "failed to read recovery incident during give-up" (the give-up read itself OOMing mid-turn) → "~4 min later" platform alarm retry → "error executing callback _chatRecoveryContinue after 3 attempts" → repeat. Their workaround was an override alarm() that caught "exceeded its memory limit" and called deleteAlarm(); this PR builds that behavior into the base class — but surgical, bounded, attributable, and observable.

The fix (layered)

  1. Finite maxRecoveryWork default (1000, was Infinity). The work meter is the one signal that keeps climbing across the loop, so a finite default seals a runaway with reason="work_budget_exceeded". A normal interrupted turn never approaches it.

  2. OOM-specific in-DO budget (chatRecovery.maxOomRetries, default 3). A memory reset re-OOMs on re-run (the turn's working set, not the platform, is the cause), so it's classified as a distinct deterministic failure rather than a deploy-style transient — it is not deferred and retried forever. Each crash bumps a durable per-incident oomAttempts counter; after a small number of tries it seals with reason="out_of_memory". Fast and attributable.

  3. Alarm-boundary circuit breaker (Agent.alarm()) — the universal backstop for OOMs that bypass the in-DO budgets entirely: thrown before the budget code runs (boot-time state hydration), or whose own small writes also OOM under memory pressure. Left unhandled, such an error propagates out of alarm() and the platform auto-retries forever. alarm() now intercepts only Durable Object memory-limit resets at the outermost frame — where the heavy turn has unwound and GC has reclaimed its footprint, so the seal/purge writes can land where mid-turn ones OOMed. A durable strike counter (static maxAlarmMemoryLimitStrikes, default 3) tolerates a few resets (a transient spike may clear), backing off the looping rows so the retry isn't a hot loop, then seals the recovery and surgically purges only the looping schedule rows, leaving unrelated scheduled tasks intact. Emits a new alarm:memory_limit_reset observability event. Everything except memory-limit resets re-throws exactly as before.

Supporting changes

  • Broaden + export isDurableObjectMemoryLimitReset(error) — now matches the shared "exceeded its memory limit" fragment so truncated/reworded surfacings (observed in real Neverending retries during recovery #1825 logs) still classify. Sibling to isDurableObjectCodeUpdateReset / isPlatformTransientError.
  • _executeScheduleCallback now defers (re-throws) memory-limit resets for one-shot rows instead of swallowing them after in-process retries, so the error reaches the alarm-boundary breaker. Tracks the executing row id so the breaker can purge the exact looping row.
  • think / ai-chat override _cf_recoveryAlarmCallbacks() + _cf_sealMemoryLimitedRecovery() to target their recovery continuation callbacks and terminalize active incidents (banner + onExhausted + seal).
  • Remove the redundant result-path OOM handling in continueLastTurn: those turns are already terminalized, so it only risked wasteful reschedules and duplicate terminal signals.

Configuration

Option Default Scope
chatRecovery.maxRecoveryWork 1000 (was Infinity) chat recovery work backstop
chatRecovery.maxOomRetries 3 in-DO OOM budget
maxAlarmMemoryLimitStrikes 3 base-agent alarm circuit breaker

Test plan

  • pnpm run check (sherif + exports + oxfmt + oxlint + typecheck, 113 projects) — green
  • agents / think / ai-chat test suites — green
  • New unit coverage: broadened predicate, listActiveChatRecoveryIncidents
  • New integration coverage: alarm memory-limit circuit breaker (under budget → backoff/row preserved; at budget → seal/purge; truncated message match; non-memory errors pass through unchanged)
  • Reviewer: confirm default budgets (1000 / 3 / 3) feel right for shipped behavior

Notes

  • Two changesets (work-budget default flip + OOM budget/breaker), both patch on agents / @cloudflare/think / @cloudflare/ai-chat.
  • Does not close #1285 (zero-signal hard OOM kills during non-alarm requests): a true hard kill runs no in-isolate code, so nothing can emit. This PR does add a new signal (alarm:memory_limit_reset + downstream recovery exhaustion) for the catchable alarm-loop class that was previously also silent — a natural follow-up for Durable Object OOM kills produce zero observability signal #1285 is a boot-time "interrupted run" breadcrumb detector.
  • RFC follow-up section + user-facing docs updated.

Made with Cursor


Open in Devin Review

…ps (#1825)

A chat-recovery turn whose Durable Object isolate exceeds its 128 MB
memory limit could loop forever, re-running the (billable) turn on every
platform alarm retry. The isolate streams a little content before the
reset, which bumps the durable progress counter; on the next wake
recovery reads that as forward progress and resets both progress-keyed
bounds (maxAttempts, noProgressTimeoutMs), and because each crash lands
inside the alarm-debounce window the attempt counter is pinned too. With
maxRecoveryWork defaulting to Infinity, no instrument could ever seal the
turn, so the model ran forever.

This lands a layered fix:

1. Finite maxRecoveryWork default (1000, was Infinity). The work meter is
   the one signal that keeps climbing across the loop, so a finite default
   seals a runaway with reason="work_budget_exceeded".

2. OOM-specific in-DO budget (chatRecovery.maxOomRetries, default 3). A
   memory reset re-OOMs on re-run (the turn's working set, not the
   platform, is the cause), so it is classified as a distinct deterministic
   failure rather than a deploy-style transient: it is NOT deferred and
   retried forever. Each crash bumps a durable per-incident oomAttempts
   counter; after a small number of tries it seals with
   reason="out_of_memory". Fast and attributable.

3. Alarm-boundary circuit breaker (Agent.alarm()) as the universal
   backstop for OOMs that bypass the in-DO budgets entirely - thrown
   before the budget code runs (boot-time state hydration), or whose own
   small writes also OOM under memory pressure. Left unhandled such an
   error propagates out of alarm() and the platform auto-retries forever.
   alarm() now intercepts ONLY Durable Object memory-limit resets at the
   outermost frame, where the heavy turn has unwound and GC has reclaimed
   its footprint, so the seal/purge writes can land where mid-turn ones
   OOMed. A durable strike counter (static maxAlarmMemoryLimitStrikes,
   default 3) tolerates a few resets - backing off the looping rows so the
   retry is not a hot loop - then seals the recovery (out_of_memory) and
   surgically purges ONLY the looping schedule rows, leaving unrelated
   scheduled tasks intact. Emits a new alarm:memory_limit_reset event.
   Everything except memory-limit resets re-throws exactly as before.

Supporting changes:

- Broaden + export isDurableObjectMemoryLimitReset(error): matches the
  shared "exceeded its memory limit" fragment so truncated/reworded
  surfacings observed in real #1825 logs still classify. Sibling to
  isDurableObjectCodeUpdateReset / isPlatformTransientError.
- _executeScheduleCallback now DEFERS (re-throws) memory-limit resets for
  one-shot rows instead of swallowing them after in-process retries, so the
  error reaches the alarm-boundary breaker; track the executing row id so
  the breaker can purge the exact looping row.
- think/ai-chat override _cf_recoveryAlarmCallbacks() and
  _cf_sealMemoryLimitedRecovery() to target their recovery continuation
  callbacks and terminalize active incidents (banner + onExhausted + seal).
- Remove the redundant result-path OOM handling in continueLastTurn: those
  turns are already terminalized, so it only risked wasteful reschedules
  and duplicate terminal signals.

Adds unit + integration coverage (predicate, listActiveChatRecoveryIncidents,
alarm circuit breaker), an RFC follow-up section, docs, and changesets.

Co-authored-by: Cursor <cursoragent@cursor.com>
@changeset-bot

changeset-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: c69cfd5

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
@cloudflare/ai-chat Patch
@cloudflare/think Patch
agents Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

@pkg-pr-new

pkg-pr-new Bot commented Jun 28, 2026

Copy link
Copy Markdown

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1826

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1826

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1826

create-think

npm i https://pkg.pr.new/create-think@1826

hono-agents

npm i https://pkg.pr.new/hono-agents@1826

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1826

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1826

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1826

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1826

commit: c69cfd5

The alarm-boundary memory-limit strike counter (maxAlarmMemoryLimitStrikes,
#1825) is documented as counting CONSECUTIVE alarm OOM resets, but it was
only ever deleted when the breaker sealed — never after a clean alarm — so
it actually tracked LIFETIME resets. A Durable Object hitting rare,
non-consecutive transient spikes (e.g. one a month) would eventually reach
the strike budget and wrongly seal healthy recovery work.

alarm() now best-effort clears cf_agents:oom_alarm_strikes after a clean
_cf_runAlarmBody() so strikes must be consecutive to seal. The clear reads
first and only writes when a strike is recorded, so the common no-strike
path costs no write. Adds a regression test (strike recorded -> clean alarm
resets to 0 -> next OOM starts at strike 1).

Co-authored-by: Cursor <cursoragent@cursor.com>
@threepointone threepointone merged commit 1bbd9bc into main Jun 28, 2026
7 checks passed
@threepointone threepointone deleted the fix/chat-recovery-oom-alarm-breaker branch June 28, 2026 10:49
@github-actions github-actions Bot mentioned this pull request Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Neverending retries during recovery Durable Object OOM kills produce zero observability signal

1 participant