perf(run-engine,webapp): look up PENDING_VERSION runs via ClickHouse#3707
Conversation
When a background worker registers, the engine resolves runs that were queued before the worker was ready. That lookup used to scan a Postgres status index. Move it to ClickHouse: query candidate run ids from `task_runs_v2`, then refetch the actual rows from Postgres by primary key with a `status = 'PENDING_VERSION'` guard for idempotency. The lookup is a pluggable interface on the run engine (`PendingVersionRunIdLookup`); the webapp wires a ClickHouse-backed implementation through the org-scoped `clickhouseFactory` using a new "engine" client type, configured by `RUN_ENGINE_CLICKHOUSE_*` env vars. When the lookup returns no candidates, one bounded retry is scheduled ~5s later to cover ClickHouse replication lag. The Postgres status guard prevents double-promotion when retries race with concurrent deploys.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
🧰 Additional context used📓 Path-based instructions (8)**/*.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
{packages/core,apps/webapp}/**/*.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.{ts,tsx,js,jsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.ts📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
Files:
apps/webapp/**/*.{ts,tsx}📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
Files:
apps/webapp/**/*.server.ts📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)
Files:
**/*.{js,jsx,ts,tsx,json,md,yml,yaml}📄 CodeRabbit inference engine (AGENTS.md)
Files:
internal-packages/run-engine/src/engine/systems/**/*.ts📄 CodeRabbit inference engine (internal-packages/run-engine/CLAUDE.md)
Files:
🧠 Learnings (9)📚 Learning: 2026-03-10T17:56:20.938ZApplied to files:
📚 Learning: 2026-03-22T13:26:12.060ZApplied to files:
📚 Learning: 2026-03-22T19:24:14.403ZApplied to files:
📚 Learning: 2026-05-18T08:21:27.694ZApplied to files:
📚 Learning: 2026-05-18T08:21:27.694ZApplied to files:
📚 Learning: 2026-03-29T19:16:28.864ZApplied to files:
📚 Learning: 2026-05-05T09:38:02.512ZApplied to files:
📚 Learning: 2026-05-12T21:04:05.815ZApplied to files:
📚 Learning: 2026-05-14T08:21:07.614ZApplied to files:
🔇 Additional comments (5)
WalkthroughThis PR implements a pluggable ClickHouse-backed lookup for the Run Engine's pending-version discovery, reducing Postgres index reads by offloading candidate discovery to ClickHouse while re-validating by primary key in Postgres. It introduces a PendingVersionRunIdLookup contract (with noop and ClickHouse implementations), a dedicated run-engine ClickHouse client, a lightweight run_id query builder, and updates PendingVersionSystem and RunEngine wiring to perform two-step lookup, idempotent status promotion, and bounded lag-retries. Environment variables and test support are included. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts (1)
176-178:⚠️ Potential issue | 🟠 Major | ⚡ Quick winContinuation check should use lookup page saturation, not filtered row count.
pendingRuns.lengthis after Postgres re-validation, so stale candidates can hide that the lookup already returned a fullmaxCount + 1page. That can stop pagination early and leave remainingPENDING_VERSIONruns for later.🐛 Suggested fix
- //enqueue more if needed - if (pendingRuns.length > maxCount) { - await this.scheduleResolvePendingVersionRuns(backgroundWorkerId); + // Enqueue more if lookup page was saturated (maxCount + 1 sentinel) + if (candidateIds.length > maxCount) { + await this.scheduleResolvePendingVersionRuns(backgroundWorkerId, { attempt }); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts` around lines 176 - 178, The pagination early-stop check should use the original lookup saturation count rather than the post-DB-revalidation length; change the if condition in pendingVersionSystem.ts to trigger scheduleResolvePendingVersionRuns(backgroundWorkerId) when the lookup returned more than maxCount (e.g., use the variable that holds the raw lookup result count or a boolean like lookupPageSaturated) instead of checking pendingRuns.length, so that revalidated/stale filtering does not prevent scheduling additional pages.
🧹 Nitpick comments (1)
internal-packages/run-engine/src/engine/services/pendingVersionLookup.ts (1)
28-35: ⚡ Quick winUse a type alias instead of an interface for the lookup contract.
This contract should follow the repo rule to prefer
typeoverinterfacein TypeScript.♻️ Suggested change
-export interface PendingVersionRunIdLookup { +export type PendingVersionRunIdLookup = { /** Stable identifier for logs and metrics, e.g. "clickhouse", "test-noop". */ readonly name: string; lookupPendingVersionRunIds( options: PendingVersionRunIdLookupOptions ): Promise<PendingVersionRunIdLookupResult>; -} +};As per coding guidelines
**/*.{ts,tsx}: Use types over interfaces for TypeScript.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal-packages/run-engine/src/engine/services/pendingVersionLookup.ts` around lines 28 - 35, Replace the PendingVersionRunIdLookup interface with a type alias named PendingVersionRunIdLookup that models the same shape (readonly name: string and the lookupPendingVersionRunIds method returning Promise<PendingVersionRunIdLookupResult>) to follow the repo rule preferring type over interface; keep the existing PendingVersionRunIdLookupOptions and PendingVersionRunIdLookupResult references and the exact method signature so callers remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@internal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts`:
- Around line 176-178: The pagination early-stop check should use the original
lookup saturation count rather than the post-DB-revalidation length; change the
if condition in pendingVersionSystem.ts to trigger
scheduleResolvePendingVersionRuns(backgroundWorkerId) when the lookup returned
more than maxCount (e.g., use the variable that holds the raw lookup result
count or a boolean like lookupPageSaturated) instead of checking
pendingRuns.length, so that revalidated/stale filtering does not prevent
scheduling additional pages.
---
Nitpick comments:
In `@internal-packages/run-engine/src/engine/services/pendingVersionLookup.ts`:
- Around line 28-35: Replace the PendingVersionRunIdLookup interface with a type
alias named PendingVersionRunIdLookup that models the same shape (readonly name:
string and the lookupPendingVersionRunIds method returning
Promise<PendingVersionRunIdLookupResult>) to follow the repo rule preferring
type over interface; keep the existing PendingVersionRunIdLookupOptions and
PendingVersionRunIdLookupResult references and the exact method signature so
callers remain unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 81b91b7c-1246-4f22-832b-90bad66f0a8d
📒 Files selected for processing (17)
.server-changes/pending-version-clickhouse-lookup.mdapps/webapp/app/env.server.tsapps/webapp/app/services/clickhouse/clickhouseFactory.server.tsapps/webapp/app/v3/runEngine.server.tsapps/webapp/app/v3/runEnginePendingVersionLookup.server.tsapps/webapp/app/v3/services/clickhousePendingVersionLookup.server.tsinternal-packages/clickhouse/src/index.tsinternal-packages/clickhouse/src/taskRuns.tsinternal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/services/pendingVersionLookup.tsinternal-packages/run-engine/src/engine/systems/pendingVersionSystem.tsinternal-packages/run-engine/src/engine/systems/systems.tsinternal-packages/run-engine/src/engine/tests/pendingVersion.test.tsinternal-packages/run-engine/src/engine/tests/postgresPendingVersionLookup.tsinternal-packages/run-engine/src/engine/types.tsinternal-packages/run-engine/src/engine/workerCatalog.tsinternal-packages/run-engine/src/index.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: internal / 🧪 Unit Tests: Internal (4, 8)
- GitHub Check: internal / 🧪 Unit Tests: Internal (1, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 8)
- GitHub Check: internal / 🧪 Unit Tests: Internal (7, 8)
- GitHub Check: internal / 🧪 Unit Tests: Internal (8, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 8)
- GitHub Check: internal / 🧪 Unit Tests: Internal (6, 8)
- GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
- GitHub Check: internal / 🧪 Unit Tests: Internal (2, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 8)
- GitHub Check: internal / 🧪 Unit Tests: Internal (3, 8)
- GitHub Check: internal / 🧪 Unit Tests: Internal (5, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 8)
- GitHub Check: typecheck / typecheck
- GitHub Check: Analyze (javascript-typescript)
🧰 Additional context used
📓 Path-based instructions (11)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead
Files:
apps/webapp/app/v3/runEnginePendingVersionLookup.server.tsapps/webapp/app/v3/runEngine.server.tsinternal-packages/run-engine/src/index.tsinternal-packages/run-engine/src/engine/systems/systems.tsinternal-packages/clickhouse/src/taskRuns.tsinternal-packages/run-engine/src/engine/types.tsinternal-packages/run-engine/src/engine/tests/postgresPendingVersionLookup.tsinternal-packages/run-engine/src/engine/tests/pendingVersion.test.tsapps/webapp/app/services/clickhouse/clickhouseFactory.server.tsinternal-packages/run-engine/src/engine/services/pendingVersionLookup.tsapps/webapp/app/env.server.tsinternal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/pendingVersionSystem.tsapps/webapp/app/v3/services/clickhousePendingVersionLookup.server.tsinternal-packages/run-engine/src/engine/workerCatalog.tsinternal-packages/clickhouse/src/index.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use zod for validation in packages/core and apps/webapp
Files:
apps/webapp/app/v3/runEnginePendingVersionLookup.server.tsapps/webapp/app/v3/runEngine.server.tsapps/webapp/app/services/clickhouse/clickhouseFactory.server.tsapps/webapp/app/env.server.tsapps/webapp/app/v3/services/clickhousePendingVersionLookup.server.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use function declarations instead of default exports
**/*.{ts,tsx,js,jsx}: Prefer static imports over dynamic imports. Only use dynamicimport()when circular dependencies cannot be resolved otherwise, code splitting is needed for performance, or the module must be loaded conditionally at runtime.
Import from@trigger.dev/coreusing subpaths only - never import from the root.
When writing Trigger.dev tasks, always import from@trigger.dev/sdk. Never use@trigger.dev/sdk/v3or deprecatedclient.defineJob.
Add agentcrumbs markers (//@Crumbsor `#region `@crumbs) as you write code, not just when debugging. They stay on the branch throughout development and are stripped byagentcrumbs stripbefore merge.
Files:
apps/webapp/app/v3/runEnginePendingVersionLookup.server.tsapps/webapp/app/v3/runEngine.server.tsinternal-packages/run-engine/src/index.tsinternal-packages/run-engine/src/engine/systems/systems.tsinternal-packages/clickhouse/src/taskRuns.tsinternal-packages/run-engine/src/engine/types.tsinternal-packages/run-engine/src/engine/tests/postgresPendingVersionLookup.tsinternal-packages/run-engine/src/engine/tests/pendingVersion.test.tsapps/webapp/app/services/clickhouse/clickhouseFactory.server.tsinternal-packages/run-engine/src/engine/services/pendingVersionLookup.tsapps/webapp/app/env.server.tsinternal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/pendingVersionSystem.tsapps/webapp/app/v3/services/clickhousePendingVersionLookup.server.tsinternal-packages/run-engine/src/engine/workerCatalog.tsinternal-packages/clickhouse/src/index.ts
**/*.ts
📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries
Files:
apps/webapp/app/v3/runEnginePendingVersionLookup.server.tsapps/webapp/app/v3/runEngine.server.tsinternal-packages/run-engine/src/index.tsinternal-packages/run-engine/src/engine/systems/systems.tsinternal-packages/clickhouse/src/taskRuns.tsinternal-packages/run-engine/src/engine/types.tsinternal-packages/run-engine/src/engine/tests/postgresPendingVersionLookup.tsinternal-packages/run-engine/src/engine/tests/pendingVersion.test.tsapps/webapp/app/services/clickhouse/clickhouseFactory.server.tsinternal-packages/run-engine/src/engine/services/pendingVersionLookup.tsapps/webapp/app/env.server.tsinternal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/pendingVersionSystem.tsapps/webapp/app/v3/services/clickhousePendingVersionLookup.server.tsinternal-packages/run-engine/src/engine/workerCatalog.tsinternal-packages/clickhouse/src/index.ts
apps/webapp/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
apps/webapp/**/*.{ts,tsx}: Access environment variables through theenvexport ofenv.server.tsinstead of directly accessingprocess.env
Use subpath exports from@trigger.dev/corepackage instead of importing from the root@trigger.dev/corepathUse named constants for sentinel/placeholder values (e.g.
const UNSET_VALUE = '__unset__') instead of raw string literals scattered across comparisons
Files:
apps/webapp/app/v3/runEnginePendingVersionLookup.server.tsapps/webapp/app/v3/runEngine.server.tsapps/webapp/app/services/clickhouse/clickhouseFactory.server.tsapps/webapp/app/env.server.tsapps/webapp/app/v3/services/clickhousePendingVersionLookup.server.ts
apps/webapp/**/*.server.ts
📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)
apps/webapp/**/*.server.ts: Never userequest.signalfor detecting client disconnects. UsegetRequestAbortSignal()fromapp/services/httpAsyncStorage.server.tsinstead, which is wired directly to Expressres.on('close')and fires reliably
Access environment variables viaenvexport fromapp/env.server.ts. Never useprocess.envdirectly
Always usefindFirstinstead offindUniquein Prisma queries.findUniquehas an implicit DataLoader that batches concurrent calls and has active bugs even in Prisma 6.x (uppercase UUIDs returning null, composite key SQL correctness issues, 5-10x worse performance).findFirstis never batched and avoids this entire class of issues
Files:
apps/webapp/app/v3/runEnginePendingVersionLookup.server.tsapps/webapp/app/v3/runEngine.server.tsapps/webapp/app/services/clickhouse/clickhouseFactory.server.tsapps/webapp/app/env.server.tsapps/webapp/app/v3/services/clickhousePendingVersionLookup.server.ts
**/*.{js,jsx,ts,tsx,json,md,yml,yaml}
📄 CodeRabbit inference engine (AGENTS.md)
Code formatting must be enforced using Prettier before committing
Files:
apps/webapp/app/v3/runEnginePendingVersionLookup.server.tsapps/webapp/app/v3/runEngine.server.tsinternal-packages/run-engine/src/index.tsinternal-packages/run-engine/src/engine/systems/systems.tsinternal-packages/clickhouse/src/taskRuns.tsinternal-packages/run-engine/src/engine/types.tsinternal-packages/run-engine/src/engine/tests/postgresPendingVersionLookup.tsinternal-packages/run-engine/src/engine/tests/pendingVersion.test.tsapps/webapp/app/services/clickhouse/clickhouseFactory.server.tsinternal-packages/run-engine/src/engine/services/pendingVersionLookup.tsapps/webapp/app/env.server.tsinternal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/pendingVersionSystem.tsapps/webapp/app/v3/services/clickhousePendingVersionLookup.server.tsinternal-packages/run-engine/src/engine/workerCatalog.tsinternal-packages/clickhouse/src/index.ts
internal-packages/run-engine/src/engine/systems/**/*.ts
📄 CodeRabbit inference engine (internal-packages/run-engine/CLAUDE.md)
Integrate OpenTelemetry tracer and meter instrumentation in RunEngine systems for observability
Files:
internal-packages/run-engine/src/engine/systems/systems.tsinternal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts
**/*.{test,spec}.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use vitest for all tests in the Trigger.dev repository
Files:
internal-packages/run-engine/src/engine/tests/pendingVersion.test.ts
internal-packages/run-engine/src/engine/tests/**/*.test.ts
📄 CodeRabbit inference engine (internal-packages/run-engine/CLAUDE.md)
Implement tests for RunEngine in
src/engine/tests/using testcontainers for Redis and PostgreSQL containerization
Files:
internal-packages/run-engine/src/engine/tests/pendingVersion.test.ts
**/*.test.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (AGENTS.md)
**/*.test.{ts,tsx,js,jsx}: Test files should live beside the files under test and use descriptive describe and it blocks
Unit tests should use vitest framework
Tests should avoid mocks or stubs and use helpers from@internal/testcontainerswhen Redis or Postgres are needed
**/*.test.{ts,tsx,js,jsx}: Never mock anything in tests - use testcontainers instead.
Test files should be placed next to source files (e.g.,MyService.ts->MyService.test.ts).
Files:
internal-packages/run-engine/src/engine/tests/pendingVersion.test.ts
🧠 Learnings (13)
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).
Applied to files:
apps/webapp/app/v3/runEnginePendingVersionLookup.server.tsapps/webapp/app/v3/runEngine.server.tsinternal-packages/run-engine/src/index.tsinternal-packages/run-engine/src/engine/systems/systems.tsinternal-packages/clickhouse/src/taskRuns.tsinternal-packages/run-engine/src/engine/types.tsinternal-packages/run-engine/src/engine/tests/postgresPendingVersionLookup.tsinternal-packages/run-engine/src/engine/tests/pendingVersion.test.tsapps/webapp/app/services/clickhouse/clickhouseFactory.server.tsinternal-packages/run-engine/src/engine/services/pendingVersionLookup.tsapps/webapp/app/env.server.tsinternal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/pendingVersionSystem.tsapps/webapp/app/v3/services/clickhousePendingVersionLookup.server.tsinternal-packages/run-engine/src/engine/workerCatalog.tsinternal-packages/clickhouse/src/index.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.
Applied to files:
apps/webapp/app/v3/runEnginePendingVersionLookup.server.tsapps/webapp/app/v3/runEngine.server.tsinternal-packages/run-engine/src/index.tsinternal-packages/run-engine/src/engine/systems/systems.tsinternal-packages/clickhouse/src/taskRuns.tsinternal-packages/run-engine/src/engine/types.tsinternal-packages/run-engine/src/engine/tests/postgresPendingVersionLookup.tsinternal-packages/run-engine/src/engine/tests/pendingVersion.test.tsapps/webapp/app/services/clickhouse/clickhouseFactory.server.tsinternal-packages/run-engine/src/engine/services/pendingVersionLookup.tsapps/webapp/app/env.server.tsinternal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/pendingVersionSystem.tsapps/webapp/app/v3/services/clickhousePendingVersionLookup.server.tsinternal-packages/run-engine/src/engine/workerCatalog.tsinternal-packages/clickhouse/src/index.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma error P1001 ("Can't reach database server") in TypeScript, don’t assume a single error shape. Prisma can surface P1001 via two different error classes/fields: `PrismaClientKnownRequestError` exposes it as `err.code === "P1001"` (common during mid-query connection drops), while `PrismaClientInitializationError` exposes it as `err.errorCode === "P1001"` (common on client startup failure). Therefore, predicates should use `err.code === "P1001" || err.errorCode === "P1001"`. Do not flag `err.code === "P1001"` as “unreachable/never matches,” as it is expected in production.
Applied to files:
apps/webapp/app/v3/runEnginePendingVersionLookup.server.tsapps/webapp/app/v3/runEngine.server.tsinternal-packages/run-engine/src/index.tsinternal-packages/run-engine/src/engine/systems/systems.tsinternal-packages/clickhouse/src/taskRuns.tsinternal-packages/run-engine/src/engine/types.tsinternal-packages/run-engine/src/engine/tests/postgresPendingVersionLookup.tsinternal-packages/run-engine/src/engine/tests/pendingVersion.test.tsapps/webapp/app/services/clickhouse/clickhouseFactory.server.tsinternal-packages/run-engine/src/engine/services/pendingVersionLookup.tsapps/webapp/app/env.server.tsinternal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/pendingVersionSystem.tsapps/webapp/app/v3/services/clickhousePendingVersionLookup.server.tsinternal-packages/run-engine/src/engine/workerCatalog.tsinternal-packages/clickhouse/src/index.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma errors for P1001 ("Can't reach database server"), do not assume it only appears under a single property name. Prisma may surface P1001 via either `PrismaClientKnownRequestError` (`err.code === "P1001"`, e.g., mid-query connection drops) or `PrismaClientInitializationError` (`err.errorCode === "P1001"`, e.g., client startup connection failure). To reliably detect the condition, check `err.code === "P1001" || err.errorCode === "P1001"`, and avoid review rules that would incorrectly flag `err.code === "P1001"` as unreachable/never-matching.
Applied to files:
apps/webapp/app/v3/runEnginePendingVersionLookup.server.tsapps/webapp/app/v3/runEngine.server.tsinternal-packages/run-engine/src/index.tsinternal-packages/run-engine/src/engine/systems/systems.tsinternal-packages/clickhouse/src/taskRuns.tsinternal-packages/run-engine/src/engine/types.tsinternal-packages/run-engine/src/engine/tests/postgresPendingVersionLookup.tsinternal-packages/run-engine/src/engine/tests/pendingVersion.test.tsapps/webapp/app/services/clickhouse/clickhouseFactory.server.tsinternal-packages/run-engine/src/engine/services/pendingVersionLookup.tsapps/webapp/app/env.server.tsinternal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/pendingVersionSystem.tsapps/webapp/app/v3/services/clickhousePendingVersionLookup.server.tsinternal-packages/run-engine/src/engine/workerCatalog.tsinternal-packages/clickhouse/src/index.ts
📚 Learning: 2026-03-29T19:16:28.864Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3291
File: apps/webapp/app/v3/featureFlags.ts:53-65
Timestamp: 2026-03-29T19:16:28.864Z
Learning: When reviewing TypeScript code that uses Zod v3, treat `z.coerce.*()` schemas as their direct Zod type (e.g., `z.coerce.boolean()` returns a `ZodBoolean` with `_def.typeName === "ZodBoolean"`) rather than a `ZodEffects`. Only `.preprocess()`, `.refine()`/`.superRefine()`, and `.transform()` are expected to wrap schemas in `ZodEffects`. Therefore, in reviewers’ logic like `getFlagControlType`, do not flag/unblock failures that require unwrapping `ZodEffects` when the input schema is a `z.coerce.*` schema.
Applied to files:
apps/webapp/app/v3/runEnginePendingVersionLookup.server.tsapps/webapp/app/v3/runEngine.server.tsapps/webapp/app/v3/services/clickhousePendingVersionLookup.server.ts
📚 Learning: 2026-05-05T09:38:02.512Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3523
File: apps/webapp/app/routes/api.v3.batches.ts:178-181
Timestamp: 2026-05-05T09:38:02.512Z
Learning: When reviewing code that catches `ServiceValidationError` in `*.server.ts` files, do not blindly forward `error.status` to HTTP responses, because SVEs may be thrown with non-default statuses (e.g., 400/500) and forwarding them can cause client-visible behavioral regressions (e.g., surfacing 500s to clients). Prefer a safe default response status of `error.status ?? 422`, but only after confirming via the reachable call graph that the caught `ServiceValidationError` instances are expected to carry those non-default statuses; otherwise, normalize to `422` to avoid unexpected client-visible 5xx behavior.
Applied to files:
apps/webapp/app/v3/runEnginePendingVersionLookup.server.tsapps/webapp/app/v3/runEngine.server.tsapps/webapp/app/services/clickhouse/clickhouseFactory.server.tsapps/webapp/app/env.server.tsapps/webapp/app/v3/services/clickhousePendingVersionLookup.server.ts
📚 Learning: 2026-05-12T21:04:05.815Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3542
File: apps/webapp/app/components/sessions/v1/SessionStatus.tsx:1-3
Timestamp: 2026-05-12T21:04:05.815Z
Learning: In this Remix + TypeScript codebase, do not flag a server/client boundary violation when a file imports only types from a module matching `*.server`.
Specifically, it’s safe to import types using `import type { Foo } from "*.server"` or `import { type Foo } from "*.server"` because TypeScript erases type-only imports at compile time and they emit no JavaScript, so they won’t cross the Remix server/client bundle boundary.
Only raise the boundary concern for value imports (e.g., `import { Foo }` without `type`, or `import Foo`), since those produce JavaScript output.
Applied to files:
apps/webapp/app/v3/runEnginePendingVersionLookup.server.tsapps/webapp/app/v3/runEngine.server.tsapps/webapp/app/services/clickhouse/clickhouseFactory.server.tsapps/webapp/app/env.server.tsapps/webapp/app/v3/services/clickhousePendingVersionLookup.server.ts
📚 Learning: 2026-05-14T08:21:07.614Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3614
File: apps/webapp/app/v3/mollifier/mollifierGate.server.ts:48-52
Timestamp: 2026-05-14T08:21:07.614Z
Learning: When using Trigger.dev v3 feature flags in the webapp, prefer the existing per-org gating mechanism supported by `flag()` via the `overrides` argument. Pass `Organization.featureFlags` (from `environment.organization.featureFlags`) as the `overrides` value; overrides must take precedence over the global `featureFlag` row. Do not require schema changes or add an `orgId` field to `FlagsOptions` for per-org gating—use the overrides pattern consistently (e.g., in gate flows like `resolveOrgFlag` and any server code that threads `environment.organization.featureFlags` into the gate call).
Applied to files:
apps/webapp/app/v3/runEnginePendingVersionLookup.server.tsapps/webapp/app/v3/runEngine.server.tsapps/webapp/app/v3/services/clickhousePendingVersionLookup.server.ts
📚 Learning: 2026-05-18T14:40:02.173Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3658
File: packages/core/src/v3/realtimeStreams/manager.test.ts:1-147
Timestamp: 2026-05-18T14:40:02.173Z
Learning: In the triggerdotdev/trigger.dev repo, the policy “Never mock anything — use testcontainers instead” should only be enforced for integration tests that interact with real external services (e.g., Redis, Postgres) via actual infrastructure. For unit tests that exercise pure in-memory logic (e.g., cache semantics) it is OK to stub collaborators such as `ApiClient` using Vitest (`vi.fn()`) to assert call counts or control behavior. Do not flag `vi.fn()`-based `ApiClient` stubs in unit tests as violations of the testcontainers policy.
Applied to files:
internal-packages/run-engine/src/engine/tests/pendingVersion.test.ts
📚 Learning: 2026-05-14T14:54:39.095Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3545
File: .server-changes/agent-view-sessions.md:10-10
Timestamp: 2026-05-14T14:54:39.095Z
Learning: In the `trigger.dev` repository, do not flag inconsistent dot vs slash notation in route/path strings inside `.server-changes/*.md` files. These markdown files are consumed verbatim into the changelog, so the mixed notation (e.g., `resources.orgs.../runs.$runParam/...`) is intentional and should be preserved as-is.
Applied to files:
.server-changes/pending-version-clickhouse-lookup.md
📚 Learning: 2026-03-26T09:02:07.973Z
Learnt from: myftija
Repo: triggerdotdev/trigger.dev PR: 3274
File: apps/webapp/app/services/runsReplicationService.server.ts:922-924
Timestamp: 2026-03-26T09:02:07.973Z
Learning: When parsing Trigger.dev task run annotations in server-side services, keep `TaskRun.annotations` strictly conforming to the `RunAnnotations` schema from `trigger.dev/core/v3`. If the code already uses `RunAnnotations.safeParse` (e.g., in a `#parseAnnotations` helper), treat that as intentional/necessary for atomic, schema-accurate annotation handling. Do not recommend relaxing the annotation payload schema or using a permissive “passthrough” parse path, since the annotations are expected to be written atomically in one operation and should not contain partial/legacy payloads that would require a looser parser.
Applied to files:
apps/webapp/app/services/clickhouse/clickhouseFactory.server.ts
📚 Learning: 2026-05-20T17:21:18.543Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3678
File: apps/webapp/app/entry.server.tsx:0-0
Timestamp: 2026-05-20T17:21:18.543Z
Learning: In env.server.ts (Zod env schema), any environment variable you plan to access via the typed `env` export (e.g., `env.SENTRY_DSN`) must be explicitly declared in the schema. For `SENTRY_DSN`, include `SENTRY_DSN: z.string().optional()`; otherwise switching from `process.env.SENTRY_DSN` to `env.SENTRY_DSN` will fail TypeScript typechecking.
Applied to files:
apps/webapp/app/env.server.ts
📚 Learning: 2026-03-10T17:56:20.938Z
Learnt from: samejr
Repo: triggerdotdev/trigger.dev PR: 3201
File: apps/webapp/app/v3/services/setSeatsAddOn.server.ts:25-29
Timestamp: 2026-03-10T17:56:20.938Z
Learning: Do not implement local userId-to-organizationId authorization checks inside org-scoped service classes (e.g., SetSeatsAddOnService, SetBranchesAddOnService) in the web app. Rely on route-layer authentication (requireUserId(request)) and org membership enforcement via the _app.orgs.$organizationSlug layout route. Any userId/organizationId that reaches these services from org-scoped routes has already been validated. Apply this pattern across all org-scoped services to avoid redundant auth checks and maintain consistency.
Applied to files:
apps/webapp/app/v3/services/clickhousePendingVersionLookup.server.ts
🔇 Additional comments (17)
apps/webapp/app/env.server.ts (1)
1474-1488: LGTM!.server-changes/pending-version-clickhouse-lookup.md (1)
1-6: LGTM!apps/webapp/app/services/clickhouse/clickhouseFactory.server.ts (1)
184-213: LGTM!Also applies to: 259-260, 319-332, 399-400
internal-packages/clickhouse/src/taskRuns.ts (1)
382-399: LGTM!internal-packages/clickhouse/src/index.ts (1)
16-17: LGTM!Also applies to: 228-229
apps/webapp/app/v3/services/clickhousePendingVersionLookup.server.ts (1)
1-92: LGTM!internal-packages/run-engine/src/engine/index.ts (1)
68-68: LGTM!Also applies to: 248-250, 300-302, 339-340
apps/webapp/app/v3/runEngine.server.ts (1)
9-9: LGTM!Also applies to: 135-135
apps/webapp/app/v3/runEnginePendingVersionLookup.server.ts (1)
1-24: LGTM!internal-packages/run-engine/src/engine/tests/postgresPendingVersionLookup.ts (1)
1-43: LGTM!internal-packages/run-engine/src/engine/tests/pendingVersion.test.ts (1)
7-7: LGTM!Also applies to: 48-49, 196-197, 361-362
internal-packages/run-engine/src/engine/workerCatalog.ts (1)
44-51: LGTM!internal-packages/run-engine/src/engine/services/pendingVersionLookup.ts (1)
14-27: LGTM!Also applies to: 37-48
internal-packages/run-engine/src/engine/types.ts (1)
22-23: LGTM!Also applies to: 182-202
internal-packages/run-engine/src/engine/systems/systems.ts (1)
7-7: LGTM!Also applies to: 22-22
internal-packages/run-engine/src/index.ts (1)
9-14: LGTM!internal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts (1)
8-175: LGTM!Also applies to: 181-223
When the idempotency guard fires (concurrent worker already promoted the
run, updateMany returns count=0), the transaction returned early — but
the eventBus.emit('runStatusChanged') call was outside the transaction
and fired unconditionally. Have the transaction return a boolean and
guard the emit on it.
…rsionSystem The RunEngineOptions has carried queueRunsWaitingForWorkerBatchSize for a while but never threaded it through to PendingVersionSystem's queueRunsPendingVersionBatchSize, so the option silently no-op'd and the system always used the default of 200.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts (1)
76-84:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep the continuation sentinel from the lookup page, not the filtered Postgres rows.
maxCount + 1is acting as the "has more" probe, but Line 180 derives that frompendingRuns.length. If ClickHouse returnsmaxCount + 1IDs and one of them is already pastPENDING_VERSION, the Postgres guard drops it,pendingRuns.lengthfalls back tomaxCount, and the follow-up job is never scheduled. That can leave laterPENDING_VERSIONruns stranded until the next worker registration.💡 Suggested fix
const { runIds: candidateIds } = await this.$.pendingVersionRunIdLookup .lookupPendingVersionRunIds({ organizationId: backgroundWorker.runtimeEnvironment.organizationId, projectId: backgroundWorker.projectId, environmentId: backgroundWorker.runtimeEnvironmentId, taskIdentifiers, queues, limit: maxCount + 1, }); + const hasMoreCandidates = candidateIds.length > maxCount; if (!candidateIds.length) { await this.#maybeScheduleLagRetry(backgroundWorkerId, attempt, "lookup_empty"); return; } const pendingRuns = await this.$.prisma.taskRun.findMany({ where: { id: { in: candidateIds }, status: "PENDING_VERSION", }, orderBy: { createdAt: "asc", }, }); + const runsToProcess = pendingRuns.slice(0, maxCount); - for (const run of pendingRuns) { + for (const run of runsToProcess) { // ... } - if (pendingRuns.length > maxCount) { + if (hasMoreCandidates) { await this.scheduleResolvePendingVersionRuns(backgroundWorkerId); }Also applies to: 96-104, 127-127, 179-181
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts` around lines 76 - 84, The lookup uses maxCount + 1 as a continuation probe but the code later computes the "has more" sentinel from the filtered Postgres rows (pendingRuns), which can drop IDs and mistakenly clear continuation; update the logic in the pendingVersionSystem flow around pendingVersionRunIdLookup.lookupPendingVersionRunIds so that any "hasMore"/continuation decision is derived from the raw candidateIds length returned by lookup (the runIds field) and not from the filtered pendingRuns list produced after checking Postgres state (adjust code paths that compute hasMore/next-page using pendingRuns to use candidateIds instead: the lookup call site and subsequent continuation checks). Ensure you still cap processing to maxCount but preserve the continuation flag based on candidateIds.length > maxCount.
🧹 Nitpick comments (1)
internal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts (1)
74-89: ⚡ Quick winInstrument the new lookup / lag-retry path.
This new cross-database branch only logs today, so rollout/debugging will be guesswork when runs stay in
PENDING_VERSION. A span plus a low-cardinality metric for outcomes likeempty | matched | filtered_outand whether a lag retry was scheduled would make this path observable without blowing up cardinality.As per coding guidelines,
internal-packages/run-engine/src/engine/systems/**/*.ts: Integrate OpenTelemetry tracer and meter instrumentation in RunEngine systems for observability, and**/*.ts: ensure OTEL metric attributes have low cardinality.Also applies to: 113-125, 202-227
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts` around lines 74 - 89, Wrap the lookupPendingVersionRunIds call in an OpenTelemetry span and record a low-cardinality meter metric for the outcome (values: "empty" | "matched" | "filtered_out") plus a boolean attribute indicating whether a lag retry was scheduled; specifically, start a span before calling this.$.pendingVersionRunIdLookup.lookupPendingVersionRunIds and set span attributes for taskIdentifiers/queues size (not full contents), then after getting candidateIds: emit a metric via a tracer/meter with attributes outcome="empty" when candidateIds.length === 0 (or "matched" / "filtered_out" for other branches), and when calling this.#maybeScheduleLagRetry include an attribute lag_retry_scheduled=true (otherwise false); follow existing OTEL helper usage in other RunEngine systems, keep attributes low-cardinality (counts/booleans/enums only), and ensure the span is ended in all code paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@internal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts`:
- Around line 76-84: The lookup uses maxCount + 1 as a continuation probe but
the code later computes the "has more" sentinel from the filtered Postgres rows
(pendingRuns), which can drop IDs and mistakenly clear continuation; update the
logic in the pendingVersionSystem flow around
pendingVersionRunIdLookup.lookupPendingVersionRunIds so that any
"hasMore"/continuation decision is derived from the raw candidateIds length
returned by lookup (the runIds field) and not from the filtered pendingRuns list
produced after checking Postgres state (adjust code paths that compute
hasMore/next-page using pendingRuns to use candidateIds instead: the lookup call
site and subsequent continuation checks). Ensure you still cap processing to
maxCount but preserve the continuation flag based on candidateIds.length >
maxCount.
---
Nitpick comments:
In `@internal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts`:
- Around line 74-89: Wrap the lookupPendingVersionRunIds call in an
OpenTelemetry span and record a low-cardinality meter metric for the outcome
(values: "empty" | "matched" | "filtered_out") plus a boolean attribute
indicating whether a lag retry was scheduled; specifically, start a span before
calling this.$.pendingVersionRunIdLookup.lookupPendingVersionRunIds and set span
attributes for taskIdentifiers/queues size (not full contents), then after
getting candidateIds: emit a metric via a tracer/meter with attributes
outcome="empty" when candidateIds.length === 0 (or "matched" / "filtered_out"
for other branches), and when calling this.#maybeScheduleLagRetry include an
attribute lag_retry_scheduled=true (otherwise false); follow existing OTEL
helper usage in other RunEngine systems, keep attributes low-cardinality
(counts/booleans/enums only), and ensure the span is ended in all code paths.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 57c8f989-118f-4b2d-a9c1-4ec763d412fb
📒 Files selected for processing (2)
internal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
- GitHub Check: internal / 🧪 Unit Tests: Internal (5, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 8)
- GitHub Check: internal / 🧪 Unit Tests: Internal (4, 8)
- GitHub Check: internal / 🧪 Unit Tests: Internal (8, 8)
- GitHub Check: internal / 🧪 Unit Tests: Internal (6, 8)
- GitHub Check: internal / 🧪 Unit Tests: Internal (2, 8)
- GitHub Check: internal / 🧪 Unit Tests: Internal (3, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 8)
- GitHub Check: internal / 🧪 Unit Tests: Internal (1, 8)
- GitHub Check: internal / 🧪 Unit Tests: Internal (7, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 8)
- GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 8)
- GitHub Check: typecheck / typecheck
- GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
- GitHub Check: audit
- GitHub Check: audit
- GitHub Check: Analyze (javascript-typescript)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead
Files:
internal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use function declarations instead of default exports
**/*.{ts,tsx,js,jsx}: Prefer static imports over dynamic imports. Only use dynamicimport()when circular dependencies cannot be resolved otherwise, code splitting is needed for performance, or the module must be loaded conditionally at runtime.
Import from@trigger.dev/coreusing subpaths only - never import from the root.
When writing Trigger.dev tasks, always import from@trigger.dev/sdk. Never use@trigger.dev/sdk/v3or deprecatedclient.defineJob.
Add agentcrumbs markers (//@Crumbsor `#region `@crumbs) as you write code, not just when debugging. They stay on the branch throughout development and are stripped byagentcrumbs stripbefore merge.
Files:
internal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts
**/*.ts
📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries
Files:
internal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts
**/*.{js,jsx,ts,tsx,json,md,yml,yaml}
📄 CodeRabbit inference engine (AGENTS.md)
Code formatting must be enforced using Prettier before committing
Files:
internal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts
internal-packages/run-engine/src/engine/systems/**/*.ts
📄 CodeRabbit inference engine (internal-packages/run-engine/CLAUDE.md)
Integrate OpenTelemetry tracer and meter instrumentation in RunEngine systems for observability
Files:
internal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts
🧠 Learnings (4)
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).
Applied to files:
internal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.
Applied to files:
internal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma error P1001 ("Can't reach database server") in TypeScript, don’t assume a single error shape. Prisma can surface P1001 via two different error classes/fields: `PrismaClientKnownRequestError` exposes it as `err.code === "P1001"` (common during mid-query connection drops), while `PrismaClientInitializationError` exposes it as `err.errorCode === "P1001"` (common on client startup failure). Therefore, predicates should use `err.code === "P1001" || err.errorCode === "P1001"`. Do not flag `err.code === "P1001"` as “unreachable/never matches,” as it is expected in production.
Applied to files:
internal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma errors for P1001 ("Can't reach database server"), do not assume it only appears under a single property name. Prisma may surface P1001 via either `PrismaClientKnownRequestError` (`err.code === "P1001"`, e.g., mid-query connection drops) or `PrismaClientInitializationError` (`err.errorCode === "P1001"`, e.g., client startup connection failure). To reliably detect the condition, check `err.code === "P1001" || err.errorCode === "P1001"`, and avoid review rules that would incorrectly flag `err.code === "P1001"` as unreachable/never-matching.
Applied to files:
internal-packages/run-engine/src/engine/index.tsinternal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts
🔇 Additional comments (2)
internal-packages/run-engine/src/engine/systems/pendingVersionSystem.ts (1)
128-157: LGTM!internal-packages/run-engine/src/engine/index.ts (1)
246-250: LGTM!Also applies to: 300-301, 336-341
…ndingRuns After the ClickHouse migration, pendingRuns.length is post-status-guard; runs that have already left PENDING_VERSION between the CH lookup and the Postgres refetch get filtered out. Using it as the more-work signal under-reports when more candidates exist on the worker and stops short. Switch to candidateIds.length, which is the raw lookup result.
Factory resolution failures (registry misload, missing data store, ClientType mismatch) are configuration problems, not transient blips, and ops loses observability if they only surface as warnings. Query- level errors stay at warn since those are expected to be transient.
Summary
When a background worker registers, the engine resolves runs that were queued before the worker was ready (status
PENDING_VERSION). That lookup used to scan a Postgres status index onTaskRun. Move it to ClickHouse: query candidate run ids fromtask_runs_v2, then refetch the actual rows from Postgres by primary key with astatus = 'PENDING_VERSION'guard for idempotency.Design
The lookup is a pluggable interface on the run engine (
PendingVersionRunIdLookup). The webapp wires a ClickHouse-backed implementation through the org-scopedclickhouseFactoryusing a new"engine"client type, configured byRUN_ENGINE_CLICKHOUSE_*env vars. The URL falls back toCLICKHOUSE_URLwhen unset, so self-hosted deployments don't need new config to keep working.When the lookup returns no candidates, one bounded retry is scheduled ~5s later to cover ClickHouse replication lag against
task_runs_v2. The Postgres status guard on both the candidate refetch and the innerupdateManyprevents double-promotion when a retry races with a concurrent deploy.Tests cover three existing PENDING_VERSION cases via a small Postgres-backed test adapter; new ClickHouse-backed integration tests will follow.