Skip to content

feat(codemode): add durable execution retries#1769

Draft
mattzcarey wants to merge 1 commit into
cloudflare:mainfrom
mattzcarey:feat/codemode-runtime-transient-retries
Draft

feat(codemode): add durable execution retries#1769
mattzcarey wants to merge 1 commit into
cloudflare:mainfrom
mattzcarey:feat/codemode-runtime-transient-retries

Conversation

@mattzcarey

@mattzcarey mattzcarey commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds durable execution retries to @cloudflare/codemode's runtime.tool() path.

A connector can now throw RetryableError to explicitly tell Code Mode that a failed tool boundary is safe to retry. The runtime retries inside the same durable execution, replays already-applied calls from the durable log, and re-executes only the failed boundary. This gives callers resilient Code Mode executions without making every application build its own replay/retry runtime.

This also adds attempt fencing and cooperative cancellation so stale work from timed-out or superseded sandbox passes cannot mutate the durable log.

Default retry policy

By default, runtime.tool() retries only failures that are explicitly marked retryable with RetryableError.

Default behavior:

  • 3 total attempts.
  • If the RetryableError includes retryAfterMs, the runtime honors it.
  • Otherwise the runtime uses bounded exponential backoff.
  • retry: false disables retries completely.
  • retry: { ... } allows callers to provide a custom retry policy.

The default policy intentionally does not retry arbitrary thrown errors, executor failures, or timeouts. Those may represent bugs, validation errors, non-idempotent writes, or ambiguous remote state.

How the runtime knows an error is retryable

The SDK does not infer retryability from tool names, HTTP methods, response codes, or exception messages.

A failure is retryable by default only when connector code explicitly throws RetryableError.

That keeps the semantic decision at the connector boundary, where the implementation actually understands the protocol. For example, a connector may know that a server explicitly rejected a request and asked the client to retry, but the generic Code Mode runtime cannot safely infer that from the outside.

Normal errors remain terminal unless the application opts into a custom retry policy.

Durable retry behavior

Retries happen inside the same Code Mode execution:

  • The execution ID stays stable.
  • Calls already recorded as applied are replayed from the durable log.
  • The failed boundary is re-executed.
  • Later calls are not run until the failed boundary succeeds.
  • If retries are exhausted, the execution becomes error and the durable log is preserved for audit/debugging.

For example, if this code runs:

await api.first();
await api.flaky();
await api.after();

and api.flaky() throws RetryableError once, the runtime behaves as follows:

  1. first runs and is logged.
  2. flaky fails with RetryableError.
  3. The next attempt replays first from the log.
  4. flaky runs again.
  5. after runs only after flaky succeeds.

What happens on execution timeout

Executor timeouts are surfaced as structured failures, but they are not retried by default.

Timeouts are ambiguous: the sandbox stopped waiting, but the runtime cannot prove whether an in-flight connector operation committed externally. Retrying a timed-out write automatically could duplicate side effects.

Applications that know a timeout is safe may opt into retrying it with a custom retry policy, but the SDK default remains conservative.

To make timeout/custom-retry behavior safe, this PR adds durable attempt fencing:

  • Every connector decision/result is tied to an execution attempt.
  • The runtime only records results for the currently active attempt.
  • Late results from a timed-out or superseded pass are atomically rejected.
  • A stale sandbox cannot corrupt the replay log or overwrite the final result.

Cooperative cancellation

Connector execute contexts now include an optional pass-scoped AbortSignal:

execute: async (args, ctx) => {
  return fetch(url, { signal: ctx?.signal });
}

The runtime aborts the signal when a pass completes, pauses, errors, times out, or moves to a retry. This lets connectors cancel in-flight work from old passes before the runtime continues.

Cancellation is cooperative only. It reduces overlap and wasted work, but it does not prove that a remote write was rolled back or never committed. Timeout retry policy remains conservative for that reason.

Storage changes

Adds a durable attempts store/table used to fence retries by execution attempt.

The migration is additive. Existing released Code Mode execution state is backfilled into the attempt table with INSERT OR IGNORE.

Public API

Adds:

  • RetryableError
  • retry configuration on Code Mode runtime creation
  • optional signal on connector execute context

No retry configuration is required for normal RetryableError usage.

Tests

Added coverage for:

  • default retry policy
  • custom retry policy
  • disabled retries
  • declined retries
  • throwing retry policies
  • invalid attempt limits
  • retryAfterMs handling
  • bounded backoff behavior
  • executor success/error/retryable boundaries
  • durable same-execution retry
  • replay of prior applied calls
  • re-execution of failed boundary only
  • exhausted retry log preservation
  • callback failure log preservation
  • terminal late-result fencing after completion/error
  • timeout stale-result fencing
  • caught RetryableError not allowing codemode.step() to persist
  • attempt-store lifecycle and backfill
  • pass-scoped abort signals on completion/error/pause/timeout/retry
  • rollback/revert signal behavior

Local validation:

pnpm --filter @cloudflare/codemode test
pnpm --filter @cloudflare/codemode build
pnpm run check

@changeset-bot

changeset-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: f82430e

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@cloudflare/codemode Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@pkg-pr-new

pkg-pr-new Bot commented Jun 16, 2026

Copy link
Copy Markdown

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1769

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1769

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1769

create-think

npm i https://pkg.pr.new/create-think@1769

hono-agents

npm i https://pkg.pr.new/hono-agents@1769

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1769

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1769

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1769

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1769

commit: f82430e

@mattzcarey mattzcarey force-pushed the feat/codemode-runtime-transient-retries branch 2 times, most recently from 6bb77f6 to 4d8e3d9 Compare June 22, 2026 09:14
@mattzcarey

Copy link
Copy Markdown
Contributor Author

Keeping this PR in draft until #1791 lands.

#1791 fixes the root runtime.tool() issue by returning a framework-independent, AI SDK-compatible tool object. Once it merges, I’ll rebase this PR onto main, remove the overlapping runtime.execute() API, preserve #1791’s Standard Schema/build guard changes, and keep this PR focused on durable retries, structured transient failures, attempt fencing, and retry lifecycle behavior.

@mattzcarey mattzcarey force-pushed the feat/codemode-runtime-transient-retries branch from 4d8e3d9 to 4dab36b Compare June 22, 2026 10:27
@mattzcarey

Copy link
Copy Markdown
Contributor Author

#1791 is now merged. I rebased this PR onto its merge commit and resolved the overlap in favor of #1791:

  • retained the framework-free CodemodeTool and Standard Schema input validator
  • retained the root-entry build guard and setup memoization
  • removed the overlapping runtime.execute() API and related docs/tests
  • retries now run exclusively through runtime.tool().execute()

Post-rebase verification: 297 unit tests, 43 durable-runtime tests, 33 browser tests, full 24-project build, and all 113 projects typecheck. The cloudflare-mcp consumer was also updated to runtime.tool().execute() and no longer needs an explicit ai dependency.

@mattzcarey mattzcarey force-pushed the feat/codemode-runtime-transient-retries branch from 4dab36b to 8c04b8a Compare June 24, 2026 10:46
@suzunn

suzunn commented Jun 24, 2026

Copy link
Copy Markdown

I noticed one retry-fencing edge that would be worth locking down in the tests. When a superseded pass finishes late, recordResult() now returns false and the connector binding maps that to the pause control path. The timeout retry test proves the newer result wins, but it does not assert that the stale pass cannot leave a paused/pass-end artifact or create a pending-looking state after the retry has already completed.

I would add an assertion around the late slow_once() case that the final execution has only the expected applied log entry, no pending actions, and no extra paused lifecycle status after the old sandbox returns. That would make the attempt fence contract explicit: stale work is inert, not converted into a recoverable pause.

@mattzcarey mattzcarey marked this pull request as ready for review June 24, 2026 11:05
devin-ai-integration[bot]

This comment was marked as resolved.

@mattzcarey mattzcarey marked this pull request as draft June 24, 2026 12:26
@mattzcarey

Copy link
Copy Markdown
Contributor Author

Taking this back to draft for a strict code-quality pass. I found two correctness gaps (late results after terminalization and optional attempt fencing), plus avoidable orchestration/SQL growth in already-large runtime files and the open durable-log finding. I’m addressing these in order with explicit regression coverage before requesting review again.

@mattzcarey mattzcarey force-pushed the feat/codemode-runtime-transient-retries branch from 8c04b8a to 5b51060 Compare June 24, 2026 12:57
@mattzcarey

Copy link
Copy Markdown
Contributor Author

Addressed the stale-pass review note in 5b510603. The attempt token is now mandatory on decide()/recordResult(), and result persistence is atomically conditional on both the matching attempt and execution status running. Added regressions for late results after both completed and terminal error; both return false and leave the log unchanged. The timeout test also still proves the winning retry result, and the final execution has no pending actions or extra paused lifecycle event.

The strict quality pass also extracted execution/retry orchestration from proxy-tool.ts (1,199 → 550 lines), isolated attempt SQL behind ExecutionAttemptStore, tested released-schema backfill and every attempt-store lifecycle branch, and tightened executor outcomes into a mutually exclusive success/error union.

@mattzcarey mattzcarey force-pushed the feat/codemode-runtime-transient-retries branch from 5b51060 to 4eb6c43 Compare June 24, 2026 13:01
@mattzcarey

Copy link
Copy Markdown
Contributor Author

Strict code-quality pass complete in 4eb6c43f:

  1. terminal late results are atomically rejected for both completed and errored executions;
  2. attempt tokens are mandatory on decide() and recordResult();
  3. execution/retry orchestration moved to runtime-execution.ts, reducing proxy-tool.ts from 1,199 to 550 lines;
  4. additive attempt SQL has one owner (ExecutionAttemptStore) with released-0.4.1 backfill and full lifecycle tests—no migration for unreleased retry schema;
  5. exhausted retry and callback-failure logs persist in both immediate output and durable audit history;
  6. executor outcomes are a mutually exclusive success/error union with direct boundary tests.

Coverage now includes default/custom/disabled/declined/throwing policies, invalid attempt limits, Retry-After and bounded backoff, caught retry signals, timeout fencing, stale terminal results, attempt-store lifecycle/backfill, and executor success/unclassified/structured failures. Verification: 310 unit, 51 real runtime, 33 browser, package build, full repo check (113 projects), cloudflare-mcp 256 tests, and staging dry-run.

@mattzcarey mattzcarey marked this pull request as ready for review June 24, 2026 13:11

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

Open in Devin Review

Comment on lines +420 to +433
__stepDecide: async (name: unknown) =>
runtime.decide(
executionId,
cursor.next(),
STEP_CONNECTOR,
String(name),
undefined,
false,
false,
attempt
),

__stepRecord: async (seq: unknown, value: unknown) =>
runtime.recordResult(executionId, Number(seq), value, attempt)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Steps executed after a caught RetryableError persist in the durable log and replay on retry

The control.failure guard in buildConnectorBindings (runtime-execution.ts:270) blocks subsequent connector calls after a RetryableError, but codemode.step() calls go through the platform provider's __stepDecide/__stepRecord handlers (runtime-execution.ts:420-433), which do NOT check control.failure. If model code catches the retryable error and then calls codemode.step(), the step executes and its result is recorded durably. On retry, the step is replayed from the log, potentially returning a value from the failed pass's context.

In practice, this is unlikely to cause issues: (1) model code catching connector errors and continuing with steps is not the normal pattern; (2) divergence detection (runtime.ts:473-491) catches cases where the retry takes a different code path at the same seq; (3) the replay semantic of codemode.step (record once, replay thereafter) is intentionally sticky. But for completeness, the __stepDecide handler could check control.failure and return a pause decision, consistent with the connector guard's behavior.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a4217dd. createPlatformProvider now receives the same per-pass control object as connector bindings, and __stepDecide returns a pause decision immediately once a retry signal exists, before executing or recording the step closure. Added a real runtime regression where generated code catches RetryableError and calls codemode.step; the retry succeeds, the closure never runs, and the final durable log contains only the connector entry (no __step entry). Verification remains green: 310 unit, 52 runtime, 33 browser, full build/check.

@mattzcarey mattzcarey force-pushed the feat/codemode-runtime-transient-retries branch from 4eb6c43 to a4217dd Compare June 24, 2026 14:10
@mattzcarey

Copy link
Copy Markdown
Contributor Author

Production smoke completed against local a4217dd4 build on Cloudflare's network using a throwaway Worker + real Durable Object facet + loader.load(). Assertions passed:

  • checkpoint replay: connector call counts before=1, flaky=2, after=1, all three final log entries applied;
  • caught retry + codemode.step: retry completed, step closure/log entry did not leak, final log contains only smoke.flaky;
  • timeout retry: first slow call finished late, second returned {call:2}, final durable log retained only {call:2} applied.

The throwaway Worker codemode-retry-smoke-1769 was deleted immediately after assertions. Deployment also confirmed facet classes must be exported through ctx.exports, not configured as Durable Object namespace bindings; the cloudflare-mcp config was corrected accordingly, with its facet class registered only in test Miniflare.

@mattzcarey mattzcarey marked this pull request as draft June 24, 2026 14:27
@mattzcarey

Copy link
Copy Markdown
Contributor Author

Taking this briefly back to draft to add cooperative cancellation. Attempt fencing prevents stale results from being recorded, but it does not stop an in-flight connector operation. Each pass will own an AbortController; connector ToolExecuteContext will receive its signal, and the pass will abort outstanding work before retry/terminalization. Consumers still must not assume abort rolls back a remotely committed write.

@mattzcarey mattzcarey force-pushed the feat/codemode-runtime-transient-retries branch from a4217dd to f82430e Compare June 24, 2026 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants