fix(webapp): recover from ClickHouse JSON parse failures in runs replication#3708
Conversation
…ication On a `Cannot parse JSON object` rejection, sanitize lone UTF-16 surrogates across the batch via the existing `sanitizeRows` helper and retry once. Drop the batch loudly if the sanitizer found nothing or the retry also fails, so the surrounding `#insertWithRetry` layer doesn't spin on a deterministic failure. Non-parse errors propagate unchanged. Mirrors the pattern shipped for ClickhouseEventRepository in #3659 — same root cause (lone surrogates in user-provided JSON), same recovery shape, same shared helpers. Fixes the customer-facing symptom from TRI-9755: one row's bad output JSON used to kill the COMPLETED updates for its 50+ batch-mates, stranding them in EXECUTING in ClickHouse forever and inflating "Running" counts on the Tasks page. Co-Authored-By: Claude Opus 4.7 <[email protected]>
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
🧰 Additional context used📓 Path-based instructions (7)**/*.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
{packages/core,apps/webapp}/**/*.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.{ts,tsx,js,jsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.ts📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
Files:
apps/webapp/**/*.{ts,tsx}📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
Files:
apps/webapp/**/*.server.ts📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)
Files:
**/*.{js,jsx,ts,tsx,json,md,yml,yaml}📄 CodeRabbit inference engine (AGENTS.md)
Files:
🧠 Learnings (9)📚 Learning: 2026-03-22T13:26:12.060ZApplied to files:
📚 Learning: 2026-03-22T19:24:14.403ZApplied to files:
📚 Learning: 2026-05-18T08:21:27.694ZApplied to files:
📚 Learning: 2026-05-18T08:21:27.694ZApplied to files:
📚 Learning: 2026-03-26T09:02:07.973ZApplied to files:
📚 Learning: 2026-04-20T14:50:16.440ZApplied to files:
📚 Learning: 2026-04-20T15:08:49.959ZApplied to files:
📚 Learning: 2026-05-05T09:38:02.512ZApplied to files:
📚 Learning: 2026-05-12T21:04:05.815ZApplied to files:
🔇 Additional comments (5)
WalkthroughThis PR adds UTF-16 JSON parse error recovery to Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
apps/webapp/app/services/runsReplicationService.server.ts (1)
958-961: 💤 Low valueConsider extracting error message extraction to a small helper.
The same pattern for safely extracting the message from an unknown error appears twice. A tiny helper would reduce duplication and make the intent clearer.
♻️ Optional: Extract helper
+function getErrorMessage(err: unknown): string { + if (typeof err === "object" && err !== null && "message" in err) { + return String((err as { message?: unknown }).message ?? ""); + } + return String(err); +} + async `#insertWithJsonParseRecovery`<T extends object>( ... ): Promise<...> { try { return { kind: "inserted", insertResult: await doInsert() }; } catch (firstError) { if (!isClickHouseJsonParseError(firstError)) throw firstError; - const firstMessage = - typeof firstError === "object" && firstError !== null && "message" in firstError - ? String((firstError as { message?: unknown }).message ?? "") - : String(firstError); + const firstMessage = getErrorMessage(firstError); ... try { return { kind: "sanitized", insertResult: await doInsert() }; } catch (retryError) { if (!isClickHouseJsonParseError(retryError)) throw retryError; this._permanentlyDroppedBatches += 1; - const retryMessage = - typeof retryError === "object" && retryError !== null && "message" in retryError - ? String((retryError as { message?: unknown }).message ?? "") - : String(retryError); + const retryMessage = getErrorMessage(retryError);Also applies to: 999-1002
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/webapp/app/services/runsReplicationService.server.ts` around lines 958 - 961, Extract the repeated safe-error-to-string logic into a small helper (e.g., function getErrorMessage(err: unknown): string) and replace the inline extraction used for firstMessage (which inspects firstError) and the similar block at the other occurrence (lines ~999-1002) with calls to this helper; the helper should handle null/undefined, check for object with a message property, and return a string fallback so callers like firstMessage = getErrorMessage(firstError) remain concise and type-safe.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@apps/webapp/app/services/runsReplicationService.server.ts`:
- Around line 958-961: Extract the repeated safe-error-to-string logic into a
small helper (e.g., function getErrorMessage(err: unknown): string) and replace
the inline extraction used for firstMessage (which inspects firstError) and the
similar block at the other occurrence (lines ~999-1002) with calls to this
helper; the helper should handle null/undefined, check for object with a message
property, and return a string fallback so callers like firstMessage =
getErrorMessage(firstError) remain concise and type-safe.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: ee3836b1-9dd1-48b3-b10d-31d3b24f2389
📒 Files selected for processing (2)
.server-changes/runs-replication-utf16-recovery.mdapps/webapp/app/services/runsReplicationService.server.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 8)
- GitHub Check: typecheck / typecheck
- GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
🧰 Additional context used
📓 Path-based instructions (7)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead
Files:
apps/webapp/app/services/runsReplicationService.server.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use zod for validation in packages/core and apps/webapp
Files:
apps/webapp/app/services/runsReplicationService.server.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use function declarations instead of default exports
**/*.{ts,tsx,js,jsx}: Prefer static imports over dynamic imports. Only use dynamicimport()when circular dependencies cannot be resolved otherwise, code splitting is needed for performance, or the module must be loaded conditionally at runtime.
Import from@trigger.dev/coreusing subpaths only - never import from the root.
When writing Trigger.dev tasks, always import from@trigger.dev/sdk. Never use@trigger.dev/sdk/v3or deprecatedclient.defineJob.
Add agentcrumbs markers (//@Crumbsor `#region `@crumbs) as you write code, not just when debugging. They stay on the branch throughout development and are stripped byagentcrumbs stripbefore merge.
Files:
apps/webapp/app/services/runsReplicationService.server.ts
**/*.ts
📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries
Files:
apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
apps/webapp/**/*.{ts,tsx}: Access environment variables through theenvexport ofenv.server.tsinstead of directly accessingprocess.env
Use subpath exports from@trigger.dev/corepackage instead of importing from the root@trigger.dev/corepathUse named constants for sentinel/placeholder values (e.g.
const UNSET_VALUE = '__unset__') instead of raw string literals scattered across comparisons
Files:
apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/**/*.server.ts
📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)
apps/webapp/**/*.server.ts: Never userequest.signalfor detecting client disconnects. UsegetRequestAbortSignal()fromapp/services/httpAsyncStorage.server.tsinstead, which is wired directly to Expressres.on('close')and fires reliably
Access environment variables viaenvexport fromapp/env.server.ts. Never useprocess.envdirectly
Always usefindFirstinstead offindUniquein Prisma queries.findUniquehas an implicit DataLoader that batches concurrent calls and has active bugs even in Prisma 6.x (uppercase UUIDs returning null, composite key SQL correctness issues, 5-10x worse performance).findFirstis never batched and avoids this entire class of issues
Files:
apps/webapp/app/services/runsReplicationService.server.ts
**/*.{js,jsx,ts,tsx,json,md,yml,yaml}
📄 CodeRabbit inference engine (AGENTS.md)
Code formatting must be enforced using Prettier before committing
Files:
apps/webapp/app/services/runsReplicationService.server.ts
🧠 Learnings (10)
📚 Learning: 2026-05-14T14:54:39.095Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3545
File: .server-changes/agent-view-sessions.md:10-10
Timestamp: 2026-05-14T14:54:39.095Z
Learning: In the `trigger.dev` repository, do not flag inconsistent dot vs slash notation in route/path strings inside `.server-changes/*.md` files. These markdown files are consumed verbatim into the changelog, so the mixed notation (e.g., `resources.orgs.../runs.$runParam/...`) is intentional and should be preserved as-is.
Applied to files:
.server-changes/runs-replication-utf16-recovery.md
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).
Applied to files:
apps/webapp/app/services/runsReplicationService.server.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.
Applied to files:
apps/webapp/app/services/runsReplicationService.server.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma error P1001 ("Can't reach database server") in TypeScript, don’t assume a single error shape. Prisma can surface P1001 via two different error classes/fields: `PrismaClientKnownRequestError` exposes it as `err.code === "P1001"` (common during mid-query connection drops), while `PrismaClientInitializationError` exposes it as `err.errorCode === "P1001"` (common on client startup failure). Therefore, predicates should use `err.code === "P1001" || err.errorCode === "P1001"`. Do not flag `err.code === "P1001"` as “unreachable/never matches,” as it is expected in production.
Applied to files:
apps/webapp/app/services/runsReplicationService.server.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma errors for P1001 ("Can't reach database server"), do not assume it only appears under a single property name. Prisma may surface P1001 via either `PrismaClientKnownRequestError` (`err.code === "P1001"`, e.g., mid-query connection drops) or `PrismaClientInitializationError` (`err.errorCode === "P1001"`, e.g., client startup connection failure). To reliably detect the condition, check `err.code === "P1001" || err.errorCode === "P1001"`, and avoid review rules that would incorrectly flag `err.code === "P1001"` as unreachable/never-matching.
Applied to files:
apps/webapp/app/services/runsReplicationService.server.ts
📚 Learning: 2026-03-26T09:02:07.973Z
Learnt from: myftija
Repo: triggerdotdev/trigger.dev PR: 3274
File: apps/webapp/app/services/runsReplicationService.server.ts:922-924
Timestamp: 2026-03-26T09:02:07.973Z
Learning: When parsing Trigger.dev task run annotations in server-side services, keep `TaskRun.annotations` strictly conforming to the `RunAnnotations` schema from `trigger.dev/core/v3`. If the code already uses `RunAnnotations.safeParse` (e.g., in a `#parseAnnotations` helper), treat that as intentional/necessary for atomic, schema-accurate annotation handling. Do not recommend relaxing the annotation payload schema or using a permissive “passthrough” parse path, since the annotations are expected to be written atomically in one operation and should not contain partial/legacy payloads that would require a looser parser.
Applied to files:
apps/webapp/app/services/runsReplicationService.server.ts
📚 Learning: 2026-04-20T14:50:16.440Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3417
File: apps/webapp/app/services/sessionsReplicationService.server.ts:224-231
Timestamp: 2026-04-20T14:50:16.440Z
Learning: In Trigger.dev’s replication services (e.g., sessionsReplicationService.server.ts and runsReplicationService.server.ts), the “acknowledge-before-flush” behavior is intentional. The `_latestCommitEndLsn` should be updated at Postgres commit time and acknowledged on a periodic interval (via methods like `#acknowledgeLatestTransaction`) without waiting for ClickHouse batch flush to complete. Reviewers should not flag this as a durability/ordering bug; it is an established project-wide at-least-once delivery trade-off used across both runs and sessions replication services.
Applied to files:
apps/webapp/app/services/runsReplicationService.server.ts
📚 Learning: 2026-04-20T15:08:49.959Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3417
File: apps/webapp/app/services/sessionsReplicationService.server.ts:204-215
Timestamp: 2026-04-20T15:08:49.959Z
Learning: For replication services in `apps/webapp/app/services/*ReplicationService.server.ts`, keep the `ConcurrentFlushScheduler` deduplication key shape consistent across the related services (e.g., sessions vs runs) by using the same `${item.event}_${item.session.id}` / `${item.event}_${item.run.id}` pattern. If the key format ever needs to change (such as keying only by session/run id), make the update in all related replication services together—never in just one—so deduplication behavior stays aligned across services.
Applied to files:
apps/webapp/app/services/runsReplicationService.server.ts
📚 Learning: 2026-05-05T09:38:02.512Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3523
File: apps/webapp/app/routes/api.v3.batches.ts:178-181
Timestamp: 2026-05-05T09:38:02.512Z
Learning: When reviewing code that catches `ServiceValidationError` in `*.server.ts` files, do not blindly forward `error.status` to HTTP responses, because SVEs may be thrown with non-default statuses (e.g., 400/500) and forwarding them can cause client-visible behavioral regressions (e.g., surfacing 500s to clients). Prefer a safe default response status of `error.status ?? 422`, but only after confirming via the reachable call graph that the caught `ServiceValidationError` instances are expected to carry those non-default statuses; otherwise, normalize to `422` to avoid unexpected client-visible 5xx behavior.
Applied to files:
apps/webapp/app/services/runsReplicationService.server.ts
📚 Learning: 2026-05-12T21:04:05.815Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3542
File: apps/webapp/app/components/sessions/v1/SessionStatus.tsx:1-3
Timestamp: 2026-05-12T21:04:05.815Z
Learning: In this Remix + TypeScript codebase, do not flag a server/client boundary violation when a file imports only types from a module matching `*.server`.
Specifically, it’s safe to import types using `import type { Foo } from "*.server"` or `import { type Foo } from "*.server"` because TypeScript erases type-only imports at compile time and they emit no JavaScript, so they won’t cross the Remix server/client bundle boundary.
Only raise the boundary concern for value imports (e.g., `import { Foo }` without `type`, or `import Foo`), since those produce JavaScript output.
Applied to files:
apps/webapp/app/services/runsReplicationService.server.ts
🔇 Additional comments (8)
.server-changes/runs-replication-utf16-recovery.md (2)
6-20: LGTM!
1-21: ⚡ Quick winPrettier formatting compliant for
.server-changes/runs-replication-utf16-recovery.md
npx prettier --check .server-changes/runs-replication-utf16-recovery.mdreports “All matched files use Prettier code style!”, and the changelog content is clear and comprehensive.apps/webapp/app/services/runsReplicationService.server.ts (6)
41-45: LGTM!
137-144: LGTM!
300-303: LGTM!
859-882: LGTM!
894-918: LGTM!
920-955: LGTM!Also applies to: 966-1018
…rics
Devin caught this: when #insertWithJsonParseRecovery drops a batch
(sanitizer no-op, or sanitize-retry still hit a parse error),
#insertTaskRunInserts was previously converting `{kind: "dropped"}` to
`undefined`, so #insertWithRetry saw `[null, undefined]` (no error) and
#flushBatch ticked `_taskRunsInsertedCounter` / `_payloadsInsertedCounter`
for rows that never landed in ClickHouse.
Fix: return the recovery wrapper's outcome straight through. #flushBatch
now reads the outcome and only increments the success counter when both
`!err` AND `outcome?.kind !== "dropped"`. Matches the pattern in
ClickhouseEventRepository where the caller explicitly bails on
`outcome.kind === "dropped"` before downstream success work.
Co-Authored-By: Claude Opus 4.7 <[email protected]>
Summary
On a ClickHouse
Cannot parse JSON objectrejection,RunsReplicationServicenow sanitizes lone UTF-16 surrogates across the failing batch via the existingsanitizeRowshelper and retries once. If the sanitizer found nothing or the retry also fails, the batch is dropped loudly with a counter increment, so the surrounding#insertWithRetrylayer doesn't spin three more times on a deterministic failure. Non-parse errors propagate unchanged.Mirrors the pattern from #3659 (for
ClickhouseEventRepository) — same root cause (lone UTF-16 surrogates in user-provided JSON), same recovery shape, reusing the same shared helpers (sanitizeRows,isClickHouseJsonParseError,parseRowNumberFromError).Fixes the customer-facing symptom from TRI-9755: a single row's poisoned
outputJSON used to take down theCOMPLETED_SUCCESSFULLYUPDATE events for its 50+ batch-mates, stranding them inEXECUTINGin ClickHouse forever and inflating "Running" counts on the Tasks page. Confirmed in production this is ongoing — ~120k stale rows accumulated in a single 5-hour burst on 2026-05-18; smaller continuous leak before and after.What changed
apps/webapp/app/services/runsReplicationService.server.ts:~/v3/eventRepository/sanitizeRowsOnParseError.server(no duplication; no move).#insertWithJsonParseRecovery<T>(rows, doInsert, contextLabel, attempt)— generic overTaskRunInsertArray[]andPayloadInsertArray[], structurally identical toClickhouseEventRepository.#insertWithJsonParseRecovery. Try → on parse error sanitize the whole batch (theat row Nhint is logged but not used to slice — semantics underinput_format_parallel_parsingaren't stable) → retry once → drop with loud log if sanitizer found nothing OR retry still fails.#insertTaskRunInsertsand#insertPayloadInsertsextract adoInsertclosure and hand it to the wrapper. Existing error logging, span recording, andrecordSpanErrorare preserved inside the closure.private _permanentlyDroppedBatches = 0counter with a public getter, for ops dashboards and tests (matches the events-repo convention). One shared counter for both insert sites — granularity comes from thecontextLabel(task_runs_v2/raw_task_runs_payload_v1) on every log line..server-changes/runs-replication-utf16-recovery.md— release notes entry.Why no new tests
The shared helpers already have full unit + real-ClickHouse contract coverage from #3659 (
apps/webapp/test/sanitizeRowsOnParseError.test.ts,apps/webapp/test/otlpUtf16Sanitization.integration.test.ts). The new wrapper is a line-for-line structural port. Adding a parallel integration test would require synthesizing bad data that escapes the preemptivedetectBadJsonStringscheck in#prepareJsonbut still trips ClickHouse — non-trivial without hand-crafted fixtures and wouldn't cover any new logic.What this does NOT do
EXECUTINGrows in production. That needs a reconciliation/backfill sweep (separate ticket — TRI-9755 fix [Docs] Add initial files and configuration #3).errorcolumn path (runsReplicationService.server.ts:932 const errorData = { data: run.error };). Reactive recovery will catch it if it ever poisons a batch, but feeding it through#prepareJsonlikeoutputis a cheap follow-up.Test plan
pnpm run typecheck --filter webapp— cleanpermanentlyDroppedBatchescounter stays at zero (or near-zero) in/stp/trigger-app-prod/ecs/replication/service-container/process-logs, and watch forSanitizing batch after ClickHouse JSON parse errorwarns to confirm recovery is firing on real traffic🤖 Generated with Claude Code