Skip to content

feat(supervisor): compute workload manager#3114

Draft
nicktrn wants to merge 30 commits intomainfrom
feat/compute-workload-manager
Draft

feat(supervisor): compute workload manager#3114
nicktrn wants to merge 30 commits intomainfrom
feat/compute-workload-manager

Conversation

@nicktrn
Copy link
Collaborator

@nicktrn nicktrn commented Feb 23, 2026

Adds the ComputeWorkloadManager for routing task execution through the compute gateway, including full checkpoint/restore support.

Changes

Compute workload manager (apps/supervisor/src/workloadManager/compute.ts)

  • Routes VM create, snapshot, delete, and restore through the compute gateway API
  • Wide event logging on create with full timing and context
  • Configurable gateway timeout, auth token, image digest stripping
  • Restore sends name, env override metadata, CPU and memory so the agent can inject them before the VM resumes

Supervisor wiring (apps/supervisor/src/index.ts)

  • Compute mode activated when gateway URL is configured
  • Restore branch derives a unique runnerId per restore cycle, matching iceman's convention
  • Suspend/restore gated behind snapshots enabled flag

Workload server (apps/supervisor/src/workloadServer/index.ts)

  • Suspend handler triggers a compute snapshot (fire-and-forget) when in compute mode with snapshots enabled
  • Snapshot-complete callback endpoint receives the snapshot ID and calls submitSuspendCompletion

Env validation (apps/supervisor/src/env.ts)

  • Compute gateway URL, auth token, and timeout settings
  • Snapshots enabled flag (defaults off — compute mode can run without checkpoints)
  • Metadata URL required when snapshots enabled (validated at startup)

Add a third WorkloadManager implementation that creates sandboxes via
the compute gateway HTTP API (POST /api/sandboxes). Uses native fetch
with no new dependencies. Enabled by setting COMPUTE_GATEWAY_URL, which
takes priority over Kubernetes and Docker providers.
The fetch() call had no timeout, causing infinite hangs when the gateway
accepted requests but never returned responses. Adds AbortSignal.timeout
(30s) and consolidates all logging into a single structured event per
create() call with timing, status, and error context.
Emit a single canonical log line in a finally block instead of scattered
log calls at each early return. Adds business context (envId, envType,
orgId, projectId, deploymentVersion, machine) and instanceName to the
event. Always emits at info level with ok=true/false for queryability.
Pass business context (runId, envId, orgId, projectId, machine, etc.)
as metadata on CreateSandboxRequest instead of relying on env vars.
This enables wide event logging in the compute stack without parsing
env or leaking secrets.
Passes machine preset cpu and memory as top-level fields on the
CreateSandboxRequest so the compute stack can use them for admission
control and resource allocation.
Thread timing context from queue consumer through to the compute
workload manager's wide event:

- dequeueResponseMs: platform dequeue HTTP round-trip
- pollingIntervalMs: which polling interval was active (idle vs active)
- warmStartCheckMs: warm start check duration

All fields are optional to avoid breaking existing consumers.
- Fix instance creation URL from /api/sandboxes to /api/instances
- Pass name: runnerId when creating compute instances
- Add snapshot(), deleteInstance(), and restore() methods to ComputeWorkloadManager
- Add /api/v1/compute/snapshot-complete callback endpoint to WorkloadServer
- Handle suspend requests in compute mode via fire-and-forget snapshot with callback
- Handle restore in compute mode by calling gateway restore API directly
- Wire computeManager into WorkloadServer for compute mode suspend/restore
…re request

Restore calls now send a request body with the runner name, env override metadata,
cpu, and memory so the agent can inject them before the VM resumes. The runner
fetches these overrides from TRIGGER_METADATA_URL at restore time.

runnerId is derived per restore cycle as runner-{runIdShort}-{checkpointSuffix},
matching iceman's pattern.
Gates snapshot/restore behaviour independently of compute mode.
When disabled, VMs won't receive the metadata URL and suspend/restore
are no-ops. Defaults to off so compute mode can be used without snapshots.
@changeset-bot
Copy link

changeset-bot bot commented Feb 23, 2026

🦋 Changeset detected

Latest commit: 7349acb

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 29 packages
Name Type
trigger.dev Patch
d3-chat Patch
references-d3-openai-agents Patch
references-nextjs-realtime Patch
references-realtime-hooks-test Patch
references-realtime-streams Patch
references-telemetry Patch
@trigger.dev/build Patch
@trigger.dev/core Patch
@trigger.dev/python Patch
@trigger.dev/react-hooks Patch
@trigger.dev/redis-worker Patch
@trigger.dev/rsc Patch
@trigger.dev/schema-to-json Patch
@trigger.dev/sdk Patch
@trigger.dev/database Patch
@trigger.dev/otlp-importer Patch
@internal/cache Patch
@internal/clickhouse Patch
@internal/llm-pricing Patch
@internal/redis Patch
@internal/replication Patch
@internal/run-engine Patch
@internal/schedule-engine Patch
@internal/testcontainers Patch
@internal/tracing Patch
@internal/tsql Patch
@internal/zod-worker Patch
@internal/sdk-compat-tests Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a ComputeWorkloadManager and integrates compute mode across the supervisor and workload server (create/snapshot/restore/delete flows, compute snapshot callback endpoint, delayed snapshot scheduling, and trace propagation). Extends environment schema with multiple COMPUTE_ variables and conditional validation. Introduces TimerWheel utility and tests. Propagates dequeue/polling/warm-start timing fields through consumer/queue/session/events/types into workload creation and run events. Adds OTLP trace payload builder and sender with tests. Adjusts CLI image output options for local builds, lowers several log levels, adds documentation and a changeset file.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description lacks a Testing section and the required Closes # issue link, though it includes thorough technical details about the changes. The Checklist section is also missing. Add the missing Closes # issue reference, a Testing section describing test steps, and the complete checklist from the template before merging.
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(supervisor): compute workload manager' clearly and concisely describes the main feature addition—the ComputeWorkloadManager for the supervisor—aligning well with the changeset's primary purpose.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/compute-workload-manager

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (6)
packages/cli-v3/src/deploy/buildImage.ts (1)

1128-1153: Consider tightening the load check to load === true for contract clarity.

load is typed as boolean | undefined, so the truthy check isLocalBuild && load && !push silently falls through to type=image when load is undefined. In current call sites this is fine (callers always pass a resolved boolean), but the implicit contract is easy to violate if a future caller passes isLocalBuild: true without resolving load first, unexpectedly getting type=image output.

♻️ Proposed clarification
-  if (isLocalBuild && load && !push) {
+  if (isLocalBuild && load === true && !push) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/cli-v3/src/deploy/buildImage.ts` around lines 1128 - 1153, The
conditional that chooses "type=docker" relies on a truthy check (isLocalBuild &&
load && !push) which allows load to be undefined; change the guard to explicitly
test load === true so the branch only triggers when load is explicitly true.
Locate the build function parameters (isLocalBuild, load, push, imageTag) and
update the if condition (and any related comment if helpful) so it reads
isLocalBuild && load === true && !push, keeping the imageTag handling and
returned outputOptions logic unchanged.
apps/supervisor/src/env.ts (1)

81-85: Consider validating COMPUTE_GATEWAY_AUTH_TOKEN when COMPUTE_GATEWAY_URL is set.

If the compute gateway always requires authentication, add a superRefine check similar to the snapshots/metadata validation. If unauthenticated gateways are a valid use case (e.g., mTLS or network-level auth), this is fine as-is.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/env.ts` around lines 81 - 85, Add a zod superRefine to
the environment schema to enforce that when COMPUTE_GATEWAY_URL is defined,
COMPUTE_GATEWAY_AUTH_TOKEN must also be provided: update the env schema around
the COMPUTE_GATEWAY_URL and COMPUTE_GATEWAY_AUTH_TOKEN entries to call
superRefine (similar to the snapshots/metadata validation) and add an error on
the token field when URL is set but token is missing; keep behavior unchanged if
URL is undefined to support unauthenticated gateways.
packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts (1)

18-18: Consider extracting the timing type to avoid inline duplication.

The { dequeueResponseMs: number; pollingIntervalMs: number } shape is repeated at Line 18 and Line 26 (and similarly in consumerPool.ts and session.ts). A shared named type would reduce repetition across the dequeue chain.

Example: shared type in types.ts
// In types.ts
export type DequeueTiming = {
  dequeueResponseMs: number;
  pollingIntervalMs: number;
};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts` at line 18,
Extract the inline timing shape into a shared exported type (e.g., export type
DequeueTiming = { dequeueResponseMs: number; pollingIntervalMs: number; }) in a
central types file, then replace the inline occurrences with that type in the
function signature for onDequeue (currently onDequeue: (messages:
WorkerApiDequeueResponseBody, timing?: { dequeueResponseMs: number;
pollingIntervalMs: number }) => Promise<void>), and update the other files
referencing the same shape (consumerPool.ts and session.ts) to import and use
DequeueTiming to avoid duplication across the dequeue chain.
apps/supervisor/src/index.ts (1)

229-231: Runner ID derivation looks reasonable but is fragile on format assumptions.

message.run.friendlyId.replace("run_", "") uses a plain string replace (only first occurrence), and checkpoint.id.slice(-8) assumes IDs are always ≥8 characters. Both assumptions are likely safe given the platform's ID format, but consider adding a brief comment documenting the expected format, or using a more defensive approach.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/index.ts` around lines 229 - 231, The runner ID
construction is brittle: replace("run_", "") and slice(-8) assume formats;
update code around runIdShort, checkpointSuffix, and runnerId to be defensive by
(1) deriving runIdShort using a check like
message.run.friendlyId.startsWith("run_") ? message.run.friendlyId.substring(4)
: message.run.friendlyId (or use a regex to extract the suffix) and (2)
computing checkpointSuffix by using checkpoint.id.length >= 8 ?
checkpoint.id.slice(-8) : checkpoint.id (or another safe fallback) so you never
slice past bounds; also add a short comment above the logic documenting the
expected friendlyId and checkpoint.id formats.
apps/supervisor/src/workloadManager/compute.ts (1)

72-78: create() duplicates the authHeaders getter logic — use this.authHeaders instead.

The inline header construction at Lines 72–78 is identical to the private get authHeaders getter defined at Lines 168–176. All other methods (snapshot, deleteInstance, restore) correctly use this.authHeaders.

♻️ Proposed fix
-    const headers: Record<string, string> = {
-      "Content-Type": "application/json",
-    };
-
-    if (this.opts.gatewayAuthToken) {
-      headers["Authorization"] = `Bearer ${this.opts.gatewayAuthToken}`;
-    }
-
     // Strip image digest — resolve by tag, not digest
     const imageRef = opts.image.split("@")[0]!;
 
     const url = `${this.opts.gatewayUrl}/api/instances`;
 
     ...
 
       const [fetchError, response] = await tryCatch(
         fetch(url, {
           method: "POST",
-          headers,
+          headers: this.authHeaders,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadManager/compute.ts` around lines 72 - 78, The
create() method duplicates header construction that already exists in the
private getter authHeaders; replace the inline headers block in create() with a
single reference to this.authHeaders so it reuses the centralized logic (similar
to snapshot, deleteInstance, restore). Locate create() in compute.ts, remove the
manual headers Record<string,string> and the conditional Authorization
assignment, and use this.authHeaders wherever headers are needed to avoid
duplication.
apps/supervisor/src/workloadServer/index.ts (1)

454-499: 200 response is sent only after all network calls complete — gateway retries on slow ops could cause duplicate submissions.

submitSuspendCompletion and deleteInstance are both awaited before reply.empty(200) at Line 499. If either call is slow, the gateway's outbound callback will stall waiting for the acknowledgement; if it has its own timeout/retry policy it will re-deliver the callback, potentially triggering duplicate submitSuspendCompletion calls and a second deleteInstance attempt on the same (possibly already-deleted) instance.

Consider responding 200 immediately after validating the body and processing the callback asynchronously, relying on idempotency guarantees in submitSuspendCompletion, or ensuring submitSuspendCompletion is idempotent for the same (runId, snapshotId) pair.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadServer/index.ts` around lines 454 - 499, The
handler currently awaits submitSuspendCompletion and deleteInstance before
calling reply.empty(200), which can block the gateway and cause duplicate
callbacks; change flow to validate the incoming body, call reply.empty(200)
immediately, then process the suspend completion asynchronously (e.g. spawn a
background task/Promise) that calls
this.workerClient.submitSuspendCompletion(runId, snapshotFriendlyId, ...) and
this.computeManager.deleteInstance(body.instance_id) without blocking the
response; ensure the background task wraps calls in try/catch, logs errors,
makes submitSuspendCompletion idempotent for the same (runId, snapshotId) pair,
and makes deleteInstance tolerant of already-deleted instances (or check
existence before deleting) so retries are safe.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/supervisor/src/index.ts`:
- Around line 226-251: The handler currently logs failures from
computeManager.restore (didRestore === false or caught exceptions in the
try/catch around computeManager.restore) and returns without remediation; update
it to notify the platform before returning by calling the worker session HTTP
API (use this.workerSession.httpClient) to either mark the run failed or request
rescheduling, sending run id (message.run.id/friendlyId), runnerId,
checkpoint.id/location and the error details (or a failure reason when
didRestore is false); invoke this notification in both the didRestore === false
branch and the catch block and swallow/handle any HTTP errors locally so the
handler never silently abandons the run.

In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 433-501: The endpoint registered at
"/api/v1/compute/snapshot-complete" (handler for ComputeSnapshotCallbackBody)
lacks authentication allowing arbitrary suspend submissions and instance
deletions; fix by validating an immutable callback secret/signature before
processing: when generating callback URLs embed a per-run or global callback
token, then in this handler verify a signed header or Bearer token (e.g.,
X-Callback-Signature or Authorization) and reject requests with reply.empty(401)
if verification fails; additionally, once authenticated, cross-check the
provided body.instance_id against the expected instance for the runId (look up
the canonical instance for runId before calling
this.computeManager?.deleteInstance) and only call computeManager.deleteInstance
after both signature and instance match succeed; keep logging but avoid acting
on unauthenticated or mismatched requests and return appropriate HTTP codes.
- Around line 270-272: The callback URL construction in workloadServer/index.ts
uses a fallback to "localhost" which breaks external gateways; update validation
and usage so TRIGGER_WORKLOAD_API_DOMAIN is required when
COMPUTE_SNAPSHOTS_ENABLED is true (similar to TRIGGER_METADATA_URL), and remove
the silent default to "localhost" in the callbackUrl assembly. Specifically,
add/adjust env.ts validation to mark TRIGGER_WORKLOAD_API_DOMAIN as mandatory
when COMPUTE_SNAPSHOTS_ENABLED is enabled (or throw early), and change the code
that builds callbackUrl (the const callbackUrl) to fail loudly if
TRIGGER_WORKLOAD_API_DOMAIN is missing rather than using "localhost".
- Around line 266-288: Return 202 is sent before awaiting
computeManager.snapshot, and if snapshotResult is false there is no recovery
path; after the existing snapshotResult check in the block that calls
computeManager.snapshot({ runnerId, callbackUrl, metadata: { runId:
params.runFriendlyId, snapshotFriendlyId: params.snapshotFriendlyId }}), call
workerClient.submitSuspendCompletion (or equivalent API) to notify the platform
of the suspend failure including runnerId,
params.runFriendlyId/snapshotFriendlyId and error context, log the failure via
this.logger.error, and ensure the submitSuspendCompletion call handles/awaits
errors so the run can be re-queued or failed cleanly when snapshotResult is
false.

In `@packages/core/src/v3/runEngineWorker/supervisor/session.ts`:
- Around line 83-93: This change touched the public package `@trigger.dev/core`
(method onDequeue in supervisor/session.ts), so add a changeset entry for
`@trigger.dev/core`: run pnpm run changeset:add, select the `@trigger.dev/core`
package, describe the change and bump the appropriate version, then commit the
generated changeset file alongside your PR so the release tooling will include
this package.

---

Nitpick comments:
In `@apps/supervisor/src/env.ts`:
- Around line 81-85: Add a zod superRefine to the environment schema to enforce
that when COMPUTE_GATEWAY_URL is defined, COMPUTE_GATEWAY_AUTH_TOKEN must also
be provided: update the env schema around the COMPUTE_GATEWAY_URL and
COMPUTE_GATEWAY_AUTH_TOKEN entries to call superRefine (similar to the
snapshots/metadata validation) and add an error on the token field when URL is
set but token is missing; keep behavior unchanged if URL is undefined to support
unauthenticated gateways.

In `@apps/supervisor/src/index.ts`:
- Around line 229-231: The runner ID construction is brittle: replace("run_",
"") and slice(-8) assume formats; update code around runIdShort,
checkpointSuffix, and runnerId to be defensive by (1) deriving runIdShort using
a check like message.run.friendlyId.startsWith("run_") ?
message.run.friendlyId.substring(4) : message.run.friendlyId (or use a regex to
extract the suffix) and (2) computing checkpointSuffix by using
checkpoint.id.length >= 8 ? checkpoint.id.slice(-8) : checkpoint.id (or another
safe fallback) so you never slice past bounds; also add a short comment above
the logic documenting the expected friendlyId and checkpoint.id formats.

In `@apps/supervisor/src/workloadManager/compute.ts`:
- Around line 72-78: The create() method duplicates header construction that
already exists in the private getter authHeaders; replace the inline headers
block in create() with a single reference to this.authHeaders so it reuses the
centralized logic (similar to snapshot, deleteInstance, restore). Locate
create() in compute.ts, remove the manual headers Record<string,string> and the
conditional Authorization assignment, and use this.authHeaders wherever headers
are needed to avoid duplication.

In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 454-499: The handler currently awaits submitSuspendCompletion and
deleteInstance before calling reply.empty(200), which can block the gateway and
cause duplicate callbacks; change flow to validate the incoming body, call
reply.empty(200) immediately, then process the suspend completion asynchronously
(e.g. spawn a background task/Promise) that calls
this.workerClient.submitSuspendCompletion(runId, snapshotFriendlyId, ...) and
this.computeManager.deleteInstance(body.instance_id) without blocking the
response; ensure the background task wraps calls in try/catch, logs errors,
makes submitSuspendCompletion idempotent for the same (runId, snapshotId) pair,
and makes deleteInstance tolerant of already-deleted instances (or check
existence before deleting) so retries are safe.

In `@packages/cli-v3/src/deploy/buildImage.ts`:
- Around line 1128-1153: The conditional that chooses "type=docker" relies on a
truthy check (isLocalBuild && load && !push) which allows load to be undefined;
change the guard to explicitly test load === true so the branch only triggers
when load is explicitly true. Locate the build function parameters
(isLocalBuild, load, push, imageTag) and update the if condition (and any
related comment if helpful) so it reads isLocalBuild && load === true && !push,
keeping the imageTag handling and returned outputOptions logic unchanged.

In `@packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts`:
- Line 18: Extract the inline timing shape into a shared exported type (e.g.,
export type DequeueTiming = { dequeueResponseMs: number; pollingIntervalMs:
number; }) in a central types file, then replace the inline occurrences with
that type in the function signature for onDequeue (currently onDequeue:
(messages: WorkerApiDequeueResponseBody, timing?: { dequeueResponseMs: number;
pollingIntervalMs: number }) => Promise<void>), and update the other files
referencing the same shape (consumerPool.ts and session.ts) to import and use
DequeueTiming to avoid duplication across the dequeue chain.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a70546 and 5089bba.

📒 Files selected for processing (11)
  • .changeset/fix-local-build-load.md
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • apps/supervisor/src/workloadManager/types.ts
  • apps/supervisor/src/workloadServer/index.ts
  • packages/cli-v3/src/deploy/buildImage.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: typecheck / typecheck
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: Analyze (javascript-typescript)
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: Always import tasks from @trigger.dev/sdk, never use @trigger.dev/sdk/v3 or deprecated client.defineJob pattern
Every Trigger.dev task must be exported and have a unique id property with no timeouts in the run function

Files:

  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
  • apps/supervisor/src/workloadManager/types.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/index.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
  • packages/cli-v3/src/deploy/buildImage.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • apps/supervisor/src/env.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Import from @trigger.dev/core using subpaths only, never import from root

Files:

  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
  • apps/supervisor/src/workloadManager/types.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/index.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
  • packages/cli-v3/src/deploy/buildImage.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • apps/supervisor/src/env.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
  • apps/supervisor/src/workloadManager/types.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/index.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
  • packages/cli-v3/src/deploy/buildImage.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • apps/supervisor/src/env.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
  • apps/supervisor/src/workloadManager/types.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/index.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
  • packages/cli-v3/src/deploy/buildImage.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • apps/supervisor/src/env.ts
{packages,integrations}/**/*

📄 CodeRabbit inference engine (CLAUDE.md)

Add a changeset when modifying any public package in packages/* or integrations/* using pnpm run changeset:add

Files:

  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
  • packages/cli-v3/src/deploy/buildImage.ts
🧠 Learnings (9)
📚 Learning: 2026-01-15T11:50:06.067Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-15T11:50:06.067Z
Learning: Applies to {packages,integrations}/**/* : Add a changeset when modifying any public package in `packages/*` or `integrations/*` using `pnpm run changeset:add`

Applied to files:

  • .changeset/fix-local-build-load.md
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger.config.ts : Use build extensions in trigger.config.ts (additionalFiles, additionalPackages, aptGet, prismaExtension, etc.) to customize the build

Applied to files:

  • .changeset/fix-local-build-load.md
  • packages/cli-v3/src/deploy/buildImage.ts
  • apps/supervisor/src/env.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger.config.ts : Configure build process in trigger.config.ts using `build` object with external packages, extensions, and JSX settings

Applied to files:

  • .changeset/fix-local-build-load.md
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Run `npx trigger.devlatest dev` to start the Trigger.dev development server

Applied to files:

  • .changeset/fix-local-build-load.md
📚 Learning: 2025-11-14T19:24:39.536Z
Learnt from: myftija
Repo: triggerdotdev/trigger.dev PR: 2685
File: apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.settings/route.tsx:1234-1257
Timestamp: 2025-11-14T19:24:39.536Z
Learning: In the trigger.dev project, version validation for the `useNativeBuildServer` setting cannot be performed at the settings form level because the SDK version is only known at build/deployment time, not when saving project settings.

Applied to files:

  • .changeset/fix-local-build-load.md
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Applies to apps/webapp/app/**/*.{ts,tsx} : Access all environment variables through the `env` export of `env.server.ts` instead of directly accessing `process.env` in the Trigger.dev webapp

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2026-02-04T16:34:48.876Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 2994
File: apps/webapp/app/routes/vercel.connect.tsx:13-27
Timestamp: 2026-02-04T16:34:48.876Z
Learning: In apps/webapp/app/routes/vercel.connect.tsx, configurationId may be absent for "dashboard" flows but must be present for "marketplace" flows. Enforce this with a Zod superRefine and pass installationId to repository methods only when configurationId is defined (omit the field otherwise).

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger.config.ts : Configure OpenTelemetry instrumentations and exporters in trigger.config.ts for enhanced logging

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2025-06-04T16:02:22.957Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 2150
File: apps/supervisor/src/workloadManager/docker.ts:115-115
Timestamp: 2025-06-04T16:02:22.957Z
Learning: In the Trigger.dev codebase, the supervisor component uses DOCKER_ENFORCE_MACHINE_PRESETS while the docker provider component uses ENFORCE_MACHINE_PRESETS. These are separate components with separate environment variable configurations for the same logical concept of enforcing machine presets.

Applied to files:

  • apps/supervisor/src/env.ts
🧬 Code graph analysis (4)
packages/core/src/v3/runEngineWorker/supervisor/session.ts (1)
packages/core/src/v3/runEngineWorker/supervisor/schemas.ts (2)
  • WorkerApiDequeueResponseBody (73-73)
  • WorkerApiDequeueResponseBody (74-74)
apps/supervisor/src/index.ts (3)
apps/supervisor/src/workloadManager/compute.ts (1)
  • ComputeWorkloadManager (17-311)
packages/core/src/v3/serverOnly/checkpointClient.ts (1)
  • CheckpointClient (16-122)
apps/supervisor/src/env.ts (1)
  • env (164-164)
packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts (2)
packages/core/src/v3/runEngineWorker/supervisor/schemas.ts (2)
  • WorkerApiDequeueResponseBody (73-73)
  • WorkerApiDequeueResponseBody (74-74)
packages/core/src/v3/runEngineWorker/supervisor/types.ts (1)
  • PreDequeueFn (12-15)
apps/supervisor/src/workloadManager/compute.ts (2)
apps/supervisor/src/workloadManager/types.ts (3)
  • WorkloadManagerOptions (3-14)
  • WorkloadManager (16-18)
  • WorkloadManagerCreateOptions (20-42)
apps/supervisor/src/util.ts (1)
  • getRunnerId (17-25)
🔇 Additional comments (13)
packages/cli-v3/src/deploy/buildImage.ts (2)

201-209: LGTM.

isLocalBuild: false correctly gates the new type=docker path out of remote builds. Since push: true is hardcoded and depot uses --save, the load option is intentionally a no-op here.


535-543: LGTM.

load and push are already resolved to boolean via shouldLoad/shouldPush, so the truthy check in getOutputOptions is reliable here.

.changeset/fix-local-build-load.md (1)

1-5: LGTM!

Changeset correctly documents a patch-level fix for the --load flag issue.

apps/supervisor/src/env.ts (1)

154-162: Good conditional validation.

The superRefine correctly enforces TRIGGER_METADATA_URL as a dependency when COMPUTE_SNAPSHOTS_ENABLED is true. The error is scoped to the right path for clear diagnostics.

packages/core/src/v3/runEngineWorker/supervisor/events.ts (1)

9-10: LGTM!

Optional timing fields are cleanly added to the event type, maintaining backward compatibility.

apps/supervisor/src/workloadManager/types.ts (1)

27-30: LGTM! Clean addition of timing context fields for observability.

Nit: The file uses interface throughout, but coding guidelines prefer type. Since this is pre-existing, not worth changing in this PR.

packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts (1)

354-359: LGTM!

Timing parameter is correctly threaded through the consumer pool's interceptor to the original handler.

apps/supervisor/src/index.ts (2)

282-284: Good use of performance.now() for high-resolution timing.

Clean pattern for measuring warm-start duration.


83-98: Compute mode initialization is clean.

The triple assignment pattern (create → store as computeManager → assign as workloadManager) is clear. The isComputeMode flag derived from the URL's presence is a good single source of truth.

packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts (2)

116-127: LGTM! Timing instrumentation is correctly scoped.

dequeueResponseMs measures only the HTTP call, and pollingIntervalMs captures the interval that scheduled this dequeue cycle. The lastScheduledIntervalMs tracking across scheduleNextDequeue → next dequeue() call is sound.


33-33: lastScheduledIntervalMs tracking is correct.

Initialized to idleIntervalMs (first poll starts from idle state), then updated in scheduleNextDequeue before the callback fires. The value read in the next dequeue() cycle accurately reflects the delay used.

Also applies to: 42-42, 148-148

apps/supervisor/src/workloadManager/compute.ts (2)

120-120: Memory field name and unit differ between create and restore — verify gateway API contracts.

  • create (Line 120): memory_gb: opts.machine.memory — sends machine memory as-is, using field name memory_gb
  • restore (Line 273): memory_mb: opts.machine.memory * 1024 — multiplies by 1024, using field name memory_mb

If opts.machine.memory is in GiB (e.g., 2 = 2 GiB), the create endpoint receives 2 while the restore endpoint receives 2048. Confirm this asymmetry is intentional per the gateway's API schemas for /api/instances vs /api/snapshots/{id}/restore.

Also applies to: 273-273


9-9: No action required. The import is correct.

import { tryCatch } from "@trigger.dev/core" is a valid import. The root index.ts explicitly re-exports tryCatch from utils.ts, making it an intentional root-level export. This does not violate the subpath-only guideline.

…nabled

Remove the silent `localhost` fallback for the snapshot callback URL,
which would be unreachable from external compute gateways. Add env
validation and a runtime guard matching the existing metadata URL pattern.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
apps/supervisor/src/env.ts (1)

154-169: Good cross-field validation — consider also requiring COMPUTE_GATEWAY_URL.

The superRefine correctly gates TRIGGER_METADATA_URL and TRIGGER_WORKLOAD_API_DOMAIN behind COMPUTE_SNAPSHOTS_ENABLED. However, COMPUTE_GATEWAY_URL is not validated here. If someone sets COMPUTE_SNAPSHOTS_ENABLED=true without COMPUTE_GATEWAY_URL, the env validation passes (requiring the metadata/domain vars) but no ComputeWorkloadManager is instantiated — so snapshots never actually work, with no warning at startup.

Adding a third check would make the fail-fast behavior consistent:

Suggested addition
   .superRefine((data, ctx) => {
+    if (data.COMPUTE_SNAPSHOTS_ENABLED && !data.COMPUTE_GATEWAY_URL) {
+      ctx.addIssue({
+        code: z.ZodIssueCode.custom,
+        message: "COMPUTE_GATEWAY_URL is required when COMPUTE_SNAPSHOTS_ENABLED is true",
+        path: ["COMPUTE_GATEWAY_URL"],
+      });
+    }
     if (data.COMPUTE_SNAPSHOTS_ENABLED && !data.TRIGGER_METADATA_URL) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/env.ts` around lines 154 - 169, The superRefine validator
currently requires TRIGGER_METADATA_URL and TRIGGER_WORKLOAD_API_DOMAIN when
COMPUTE_SNAPSHOTS_ENABLED is true but misses COMPUTE_GATEWAY_URL; add a third
conditional inside superRefine checking if data.COMPUTE_SNAPSHOTS_ENABLED &&
!data.COMPUTE_GATEWAY_URL and call ctx.addIssue with code:
z.ZodIssueCode.custom, a message like "COMPUTE_GATEWAY_URL is required when
COMPUTE_SNAPSHOTS_ENABLED is true", and path: ["COMPUTE_GATEWAY_URL"] so
validation fails fast and surfaces the missing gateway variable.
apps/supervisor/src/workloadServer/index.ts (1)

439-507: Callback handler logic is sound; one defensive improvement to consider.

The completed/failed branching, metadata validation, and suspend-completion submission all look correct. One thing to note: if submitSuspendCompletion succeeds but deleteInstance fails (Line 478), the failure is silently swallowed — deleteInstance returns a boolean and logs internally, but the callback handler doesn't log the deletion outcome. This could make debugging orphaned instances harder.

Optional: log deletion failure
-            await this.computeManager?.deleteInstance(body.instance_id);
+            const deleted = await this.computeManager?.deleteInstance(body.instance_id);
+            if (deleted === false) {
+              this.logger.warn("Failed to delete compute instance after suspend completion", {
+                runId,
+                instanceId: body.instance_id,
+              });
+            }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadServer/index.ts` around lines 439 - 507, The
handler currently calls await
this.computeManager?.deleteInstance(body.instance_id) without checking the
boolean result; update the completed branch after submitSuspendCompletion to
capture the deleteOutcome = await
this.computeManager?.deleteInstance(body.instance_id) and then log a
warning/error via this.logger.warn or this.logger.error if deleteOutcome is
falsy (include runId, instanceId, snapshotFriendlyId and any contextual info),
otherwise log successful deletion—this ensures failures in
computeManager.deleteInstance are visible; reference submitSuspendCompletion,
deleteInstance, computeManager, and this.logger when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 266-273: The error response in the branch where
this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED and
env.TRIGGER_WORKLOAD_API_DOMAIN is missing returns { error: "..." } which
doesn't match WorkloadSuspendRunResponseBody; change the reply.json call in that
branch (the block referencing this.computeManager,
env.TRIGGER_WORKLOAD_API_DOMAIN and reply.json) to return the same shape as the
other handlers: { ok: false, error: "TRIGGER_WORKLOAD_API_DOMAIN is not set,
cannot create snapshot callback URL" } (or equivalent message) so the response
conforms to WorkloadSuspendRunResponseBody and the runner can parse it.

---

Nitpick comments:
In `@apps/supervisor/src/env.ts`:
- Around line 154-169: The superRefine validator currently requires
TRIGGER_METADATA_URL and TRIGGER_WORKLOAD_API_DOMAIN when
COMPUTE_SNAPSHOTS_ENABLED is true but misses COMPUTE_GATEWAY_URL; add a third
conditional inside superRefine checking if data.COMPUTE_SNAPSHOTS_ENABLED &&
!data.COMPUTE_GATEWAY_URL and call ctx.addIssue with code:
z.ZodIssueCode.custom, a message like "COMPUTE_GATEWAY_URL is required when
COMPUTE_SNAPSHOTS_ENABLED is true", and path: ["COMPUTE_GATEWAY_URL"] so
validation fails fast and surfaces the missing gateway variable.

In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 439-507: The handler currently calls await
this.computeManager?.deleteInstance(body.instance_id) without checking the
boolean result; update the completed branch after submitSuspendCompletion to
capture the deleteOutcome = await
this.computeManager?.deleteInstance(body.instance_id) and then log a
warning/error via this.logger.warn or this.logger.error if deleteOutcome is
falsy (include runId, instanceId, snapshotFriendlyId and any contextual info),
otherwise log successful deletion—this ensures failures in
computeManager.deleteInstance are visible; reference submitSuspendCompletion,
deleteInstance, computeManager, and this.logger when making the change.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5089bba and 7ed9221.

📒 Files selected for processing (2)
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: sdk-compat / Cloudflare Workers
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: Always import tasks from @trigger.dev/sdk, never use @trigger.dev/sdk/v3 or deprecated client.defineJob pattern
Every Trigger.dev task must be exported and have a unique id property with no timeouts in the run function

Files:

  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/env.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Import from @trigger.dev/core using subpaths only, never import from root

Files:

  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/env.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/env.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/env.ts
🧠 Learnings (4)
📓 Common learnings
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:44.997Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.
📚 Learning: 2026-02-23T12:56:44.997Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:44.997Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.

Applied to files:

  • apps/supervisor/src/workloadServer/index.ts
📚 Learning: 2026-02-04T16:34:48.876Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 2994
File: apps/webapp/app/routes/vercel.connect.tsx:13-27
Timestamp: 2026-02-04T16:34:48.876Z
Learning: In apps/webapp/app/routes/vercel.connect.tsx, configurationId may be absent for "dashboard" flows but must be present for "marketplace" flows. Enforce this with a Zod superRefine and pass installationId to repository methods only when configurationId is defined (omit the field otherwise).

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Applies to apps/webapp/app/**/*.{ts,tsx} : Access all environment variables through the `env` export of `env.server.ts` instead of directly accessing `process.env` in the Trigger.dev webapp

Applied to files:

  • apps/supervisor/src/env.ts
🧬 Code graph analysis (2)
apps/supervisor/src/workloadServer/index.ts (2)
apps/supervisor/src/workloadManager/compute.ts (1)
  • ComputeWorkloadManager (17-311)
packages/core/src/v3/runEngineWorker/workload/schemas.ts (2)
  • WorkloadSuspendRunResponseBody (22-30)
  • WorkloadSuspendRunResponseBody (31-31)
apps/supervisor/src/env.ts (2)
packages/core/src/v3/utils/crypto.ts (1)
  • randomUUID (1-5)
apps/supervisor/src/envUtil.ts (2)
  • BoolEnv (15-17)
  • AdditionalEnvVars (19-47)
🔇 Additional comments (4)
apps/supervisor/src/env.ts (1)

81-85: LGTM — new compute settings are well-structured.

Clean schema additions with sensible defaults (30_000 ms timeout, snapshots off by default).

apps/supervisor/src/workloadServer/index.ts (3)

27-27: LGTM — type-only import for ComputeWorkloadManager.

Correctly uses import type to avoid runtime dependency on the compute module.


57-63: LGTM — ComputeSnapshotCallbackBody schema is well-defined.

Uses z.object with appropriate optional fields and a bounded string enum for status.


275-294: Fire-and-forget pattern looks correct for the compute snapshot path.

The 202 is returned before the async snapshot() call, and the failure is logged. Recovery for failed snapshots is acknowledged as a future enhancement per prior discussion.

Comment on lines +266 to +273
if (this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED) {
if (!env.TRIGGER_WORKLOAD_API_DOMAIN) {
this.logger.error(
"TRIGGER_WORKLOAD_API_DOMAIN is not set, cannot create snapshot callback URL"
);
reply.json({ error: "Snapshot callbacks not configured" }, false, 500);
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Response at Line 271 doesn't conform to WorkloadSuspendRunResponseBody.

The other error responses in this handler (Lines 255–262, 297–304) correctly return { ok: false, error: "..." } satisfies WorkloadSuspendRunResponseBody. This branch returns { error: "..." } without the ok discriminator, so the runner may fail to parse the response.

Proposed fix
-              reply.json({ error: "Snapshot callbacks not configured" }, false, 500);
+              reply.json(
+                {
+                  ok: false,
+                  error: "Snapshot callbacks not configured",
+                } satisfies WorkloadSuspendRunResponseBody,
+                false,
+                500
+              );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED) {
if (!env.TRIGGER_WORKLOAD_API_DOMAIN) {
this.logger.error(
"TRIGGER_WORKLOAD_API_DOMAIN is not set, cannot create snapshot callback URL"
);
reply.json({ error: "Snapshot callbacks not configured" }, false, 500);
return;
}
if (this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED) {
if (!env.TRIGGER_WORKLOAD_API_DOMAIN) {
this.logger.error(
"TRIGGER_WORKLOAD_API_DOMAIN is not set, cannot create snapshot callback URL"
);
reply.json(
{
ok: false,
error: "Snapshot callbacks not configured",
} satisfies WorkloadSuspendRunResponseBody,
false,
500
);
return;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadServer/index.ts` around lines 266 - 273, The
error response in the branch where this.computeManager &&
env.COMPUTE_SNAPSHOTS_ENABLED and env.TRIGGER_WORKLOAD_API_DOMAIN is missing
returns { error: "..." } which doesn't match WorkloadSuspendRunResponseBody;
change the reply.json call in that branch (the block referencing
this.computeManager, env.TRIGGER_WORKLOAD_API_DOMAIN and reply.json) to return
the same shape as the other handlers: { ok: false, error:
"TRIGGER_WORKLOAD_API_DOMAIN is not set, cannot create snapshot callback URL" }
(or equivalent message) so the response conforms to
WorkloadSuspendRunResponseBody and the runner can parse it.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/supervisor/src/index.ts`:
- Around line 430-433: Change shutdown ordering to quiesce and drain the worker
session before closing the workload API: in the async stop() method, first call
and await a worker-session quiesce/drain operation (e.g.,
workerSession.quiesce() or workerSession.drainPendingWork()) to stop intake and
let in-flight runners/callbacks complete, then call await
this.workloadServer.stop(), and finally call await this.workerSession.stop() if
you need to fully teardown session state; if workerSession lacks a quiesce/drain
API, add one (e.g., WorkerSession.quiesce/ drainPendingWork) that stops new
intake, waits for active tasks/callbacks to finish, and resolves before
workloadServer.stop() is invoked.

In `@apps/supervisor/src/services/timerWheel.ts`:
- Around line 43-50: The delay-to-slot conversion in the TimerWheel constructor
(setting this.delaySlots) currently uses Math.round(opts.delayMs / TICK_MS)
which can undershoot the requested delay; change it to Math.ceil(opts.delayMs /
TICK_MS) so delays never shorten, and search the file for any other uses of
Math.round(... / TICK_MS) (e.g., later scheduling/adjustment logic around
delaySlots) and replace those with Math.ceil as well to keep consistent, correct
rounding behavior.
- Around line 22-24: The onExpire callback should allow async handlers and be
awaited with error handling: change TimerWheelOptions.onExpire signature to
return Promise<void> (or void | Promise<void>), then where you invoke onExpire
for a TimerWheelItem<T> (the current expiration path that deletes the entry
before calling onExpire), await the call inside a try/catch; if it throws, catch
the error, log it, and re-insert/re-schedule the same TimerWheelItem back into
the wheel (so delayed work like dispatchComputeSnapshot is not permanently lost)
instead of letting the rejection go unhandled after deletion.

In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 478-523: The handler currently always calls reply.empty(200) even
when this.workerClient.submitSuspendCompletion returns result.success === false,
which hides failures and prevents retries; update the logic around the two calls
to submitSuspendCompletion (the completed branch and the failure branch) so that
if result.success is false you return a non-2xx response (e.g.,
reply.code(500).send({ error: result.error })) instead of reply.empty(200),
preserving the existing logger.error calls; ensure both places reference the
same result variable from submitSuspendCompletion and that success paths still
reply 200.
- Around line 752-760: When stopping the snapshot delay wheel on shutdown
(this.snapshotDelayWheel?.stop()) we currently log and discard the returned
pending items, which loses pending suspend/snapshot work after we've already
responded 202; instead iterate the returned remaining items and either
re-enqueue them or invoke the same expiration handler used by the wheel so each
pending waitpoint still triggers the suspend/snapshot workflow (i.e. call the
TimerWheel onExpire handler or the snapshot processing function for each
item.key/item), or persist them for resume — do not drop them. Ensure you
reference snapshotDelayWheel, TimerWheel.stop(), onExpire() and the remaining
array when making the change so pending entries are handled the same way as when
they naturally expire.
- Around line 117-123: The TimerWheel's onExpire handler is calling the async
dispatchComputeSnapshot() synchronously so its returned promise can reject
unhandled; update the onExpire callback passed to TimerWheel
(snapshotDelayWheel) to call dispatchComputeSnapshot(item.data) and attach
explicit error handling (e.g., .catch(...) to log/report failures) so any
rejection is caught and does not become an unhandled promise rejection;
reference TimerWheel, TimerWheel.tick, snapshotDelayWheel, and
dispatchComputeSnapshot when locating the handler to wrap the call with a
rejection handler.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 1d829bf2-d50e-47e1-9d3f-417818c8b908

📥 Commits

Reviewing files that changed from the base of the PR and between 0531a23 and 80b62d4.

📒 Files selected for processing (5)
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/services/timerWheel.test.ts
  • apps/supervisor/src/services/timerWheel.ts
  • apps/supervisor/src/workloadServer/index.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: sdk-compat / Cloudflare Workers
🧰 Additional context used
📓 Path-based instructions (11)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: In TypeScript SDK usage, always import from @trigger.dev/sdk, never from @trigger.dev/sdk/v3 or use deprecated client.defineJob
Import from @trigger.dev/core subpaths only, never from the root
Use the Run Engine 2.0 (@internal/run-engine) and redis-worker for all new work, not legacy V1 MarQS queue or deprecated V1 functions

Files:

  • apps/supervisor/src/services/timerWheel.ts
  • apps/supervisor/src/services/timerWheel.test.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Files:

  • apps/supervisor/src/services/timerWheel.ts
  • apps/supervisor/src/services/timerWheel.test.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • apps/supervisor/src/services/timerWheel.ts
  • apps/supervisor/src/services/timerWheel.test.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

  • apps/supervisor/src/services/timerWheel.ts
  • apps/supervisor/src/services/timerWheel.test.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
apps/**/*.{ts,tsx,js}

📄 CodeRabbit inference engine (CLAUDE.md)

When modifying only server components (apps/webapp/, apps/supervisor/, etc.) with no package changes, add a .server-changes/ file instead of a changeset

Files:

  • apps/supervisor/src/services/timerWheel.ts
  • apps/supervisor/src/services/timerWheel.test.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
apps/supervisor/src/services/**/*.{js,ts}

📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)

Core service logic should be organized in the src/services/ directory

Files:

  • apps/supervisor/src/services/timerWheel.ts
  • apps/supervisor/src/services/timerWheel.test.ts
**/*.{test,spec}.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use vitest for all tests in the Trigger.dev repository

Files:

  • apps/supervisor/src/services/timerWheel.test.ts
**/*.test.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.test.{ts,tsx,js,jsx}: Test files should live beside the files under test and use descriptive describe and it blocks
Tests should avoid mocks or stubs and use the helpers from @internal/testcontainers when Redis or Postgres are needed
Use vitest for running unit tests

Files:

  • apps/supervisor/src/services/timerWheel.test.ts
**/*.test.{ts,tsx,js}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.test.{ts,tsx,js}: Use vitest exclusively for testing - never mock anything, use testcontainers instead
Place test files next to source files with .test.ts naming convention (e.g., MyService.ts -> MyService.test.ts)
Test files using Redis or PostgreSQL should use testcontainers helpers (redisTest, postgresTest, containerTest) instead of mocks

Files:

  • apps/supervisor/src/services/timerWheel.test.ts
apps/supervisor/src/env.ts

📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)

Environment configuration should be defined in src/env.ts

Files:

  • apps/supervisor/src/env.ts
apps/supervisor/src/workloadServer/**/*.{js,ts}

📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)

HTTP server for workload communication (heartbeats, snapshots) should be implemented in src/workloadServer/

Files:

  • apps/supervisor/src/workloadServer/index.ts
🧠 Learnings (13)
📓 Common learnings
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:51.147Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/workloadServer/**/*.{js,ts} : HTTP server for workload communication (heartbeats, snapshots) should be implemented in `src/workloadServer/`
📚 Learning: 2026-03-03T13:07:33.177Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3166
File: internal-packages/run-engine/src/batch-queue/tests/index.test.ts:711-713
Timestamp: 2026-03-03T13:07:33.177Z
Learning: In `internal-packages/run-engine/src/batch-queue/tests/index.test.ts`, test assertions for rate limiter stubs can use `toBeGreaterThanOrEqual` rather than exact equality (`toBe`) because the consumer loop may call the rate limiter during empty pops in addition to actual item processing, and this over-calling is acceptable in integration tests.

Applied to files:

  • apps/supervisor/src/services/timerWheel.test.ts
📚 Learning: 2026-03-02T12:43:25.254Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: internal-packages/run-engine/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:43:25.254Z
Learning: Applies to internal-packages/run-engine/src/engine/tests/**/*.test.ts : Implement tests for RunEngine in `src/engine/tests/` using testcontainers for Redis and PostgreSQL containerization

Applied to files:

  • apps/supervisor/src/services/timerWheel.test.ts
📚 Learning: 2026-02-23T12:56:51.147Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:51.147Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.

Applied to files:

  • apps/supervisor/src/index.ts
  • apps/supervisor/src/workloadServer/index.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/workloadManager/**/*.{js,ts} : Container orchestration abstraction (Docker or Kubernetes) should be implemented in `src/workloadManager/`

Applied to files:

  • apps/supervisor/src/index.ts
  • apps/supervisor/src/workloadServer/index.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/workloadServer/**/*.{js,ts} : HTTP server for workload communication (heartbeats, snapshots) should be implemented in `src/workloadServer/`

Applied to files:

  • apps/supervisor/src/index.ts
  • apps/supervisor/src/workloadServer/index.ts
📚 Learning: 2026-03-02T12:42:41.110Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:41.110Z
Learning: Applies to **/*.{ts,tsx} : Use the Run Engine 2.0 (internal/run-engine) and redis-worker for all new work, not legacy V1 MarQS queue or deprecated V1 functions

Applied to files:

  • apps/supervisor/src/index.ts
📚 Learning: 2026-03-02T12:42:56.114Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/webapp/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:56.114Z
Learning: All run lifecycle operations (triggering, completing, cancelling, etc.) should be performed through the singleton run engine instance in `app/v3/runEngine.server.ts` via service calls

Applied to files:

  • apps/supervisor/src/index.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: WorkloadManager should abstract Docker vs Kubernetes execution to allow switching between container orchestration platforms

Applied to files:

  • apps/supervisor/src/index.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/env.ts : Environment configuration should be defined in `src/env.ts`

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2026-02-04T16:34:48.876Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 2994
File: apps/webapp/app/routes/vercel.connect.tsx:13-27
Timestamp: 2026-02-04T16:34:48.876Z
Learning: In apps/webapp/app/routes/vercel.connect.tsx, configurationId may be absent for "dashboard" flows but must be present for "marketplace" flows. Enforce this with a Zod superRefine and pass installationId to repository methods only when configurationId is defined (omit the field otherwise).

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Applies to apps/webapp/app/**/*.{ts,tsx} : Access all environment variables through the `env` export of `env.server.ts` instead of directly accessing `process.env` in the Trigger.dev webapp

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2026-03-02T12:43:37.906Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: packages/core/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:43:37.906Z
Learning: Exercise caution with changes to trigger.dev/core as they affect both the customer-facing SDK and server-side webapp - breaking changes can impact deployed user tasks and the platform simultaneously

Applied to files:

  • apps/supervisor/src/workloadServer/index.ts

Comment on lines 430 to 433
async stop() {
this.logger.log("Shutting down");
await this.workloadServer.stop();
await this.workerSession.stop();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Quiesce the worker session before tearing down the workload API.

workloadServer fronts heartbeats, attempt completion, suspend/continue, and compute snapshot callbacks. Closing it first can strand active runners and in-flight callbacks against a dead server while workerSession is still live. This shutdown path needs a “stop intake, drain, then close the API” order instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/index.ts` around lines 430 - 433, Change shutdown
ordering to quiesce and drain the worker session before closing the workload
API: in the async stop() method, first call and await a worker-session
quiesce/drain operation (e.g., workerSession.quiesce() or
workerSession.drainPendingWork()) to stop intake and let in-flight
runners/callbacks complete, then call await this.workloadServer.stop(), and
finally call await this.workerSession.stop() if you need to fully teardown
session state; if workerSession lacks a quiesce/drain API, add one (e.g.,
WorkerSession.quiesce/ drainPendingWork) that stops new intake, waits for active
tasks/callbacks to finish, and resolves before workloadServer.stop() is invoked.

Comment on lines +22 to +24
export type TimerWheelOptions<T> = {
/** Called when an item's delay expires. */
onExpire: (item: TimerWheelItem<T>) => void;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "timerWheel.ts" -type f

Repository: triggerdotdev/trigger.dev

Length of output: 113


🏁 Script executed:

find . -name "workloadServer" -type f

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

fd -t f timerWheel

Repository: triggerdotdev/trigger.dev

Length of output: 159


🏁 Script executed:

cat -n apps/supervisor/src/services/timerWheel.ts

Repository: triggerdotdev/trigger.dev

Length of output: 5895


🏁 Script executed:

cat -n apps/supervisor/src/workloadServer/index.ts | sed -n '115,125p;725,745p'

Repository: triggerdotdev/trigger.dev

Length of output: 1425


🏁 Script executed:

cat -n apps/supervisor/src/workloadServer/index.ts | head -200 | tail -100

Repository: triggerdotdev/trigger.dev

Length of output: 3881


🏁 Script executed:

cat -n apps/supervisor/src/services/timerWheel.test.ts | head -100

Repository: triggerdotdev/trigger.dev

Length of output: 3290


🏁 Script executed:

rg "dispatchComputeSnapshot" apps/supervisor/src/workloadServer/index.ts -A 5 -B 5

Repository: triggerdotdev/trigger.dev

Length of output: 993


🏁 Script executed:

rg "unhandledRejection|uncaughtException" apps/supervisor/src --type ts -B 3 -A 3

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

rg "catch|\.on\(" apps/supervisor/src/workloadServer/index.ts | head -20

Repository: triggerdotdev/trigger.dev

Length of output: 522


🏁 Script executed:

git ls-files apps/supervisor/src | head -20

Repository: triggerdotdev/trigger.dev

Length of output: 886


🏁 Script executed:

rg "process\.on\(" apps/supervisor --type ts

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

cat -n apps/supervisor/src/index.ts

Repository: triggerdotdev/trigger.dev

Length of output: 18909


Handle async onExpire callbacks explicitly.

The onExpire callback type declares () => void, but TypeScript permits assigning async functions here since Promise<void> is assignable to void. Line 157 invokes the callback without awaiting or catching, so if dispatchComputeSnapshot() (line 725, apps/supervisor/src/workloadServer/index.ts) rejects, the rejection goes unhandled. The entry is already deleted at line 156, making the delayed snapshot permanently lost.

Suggested fix
 export type TimerWheelOptions<T> = {
   /** Called when an item's delay expires. */
-  onExpire: (item: TimerWheelItem<T>) => void;
+  onExpire: (item: TimerWheelItem<T>) => void | Promise<void>;
+  onExpireError?: (error: unknown, item: TimerWheelItem<T>) => void;
   /** Delay in milliseconds before items fire. Clamped to [100, 60000]. */
   delayMs: number;
 };
@@
-  private onExpire: (item: TimerWheelItem<T>) => void;
+  private onExpire: (item: TimerWheelItem<T>) => void | Promise<void>;
+  private onExpireError?: (error: unknown, item: TimerWheelItem<T>) => void;
@@
     this.intervalId = null;
     this.onExpire = opts.onExpire;
+    this.onExpireError = opts.onExpireError;
     this.delaySlots = Math.max(1, Math.min(NUM_SLOTS, Math.round(opts.delayMs / TICK_MS)));
   }
@@
-      this.entries.delete(key);
-      this.onExpire({ key, data: entry.data });
+      this.entries.delete(key);
+      const item = { key, data: entry.data };
+
+      try {
+        void Promise.resolve(this.onExpire(item)).catch((error) => {
+          this.onExpireError?.(error, item);
+        });
+      } catch (error) {
+        this.onExpireError?.(error, item);
+      }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/services/timerWheel.ts` around lines 22 - 24, The
onExpire callback should allow async handlers and be awaited with error
handling: change TimerWheelOptions.onExpire signature to return Promise<void>
(or void | Promise<void>), then where you invoke onExpire for a
TimerWheelItem<T> (the current expiration path that deletes the entry before
calling onExpire), await the call inside a try/catch; if it throws, catch the
error, log it, and re-insert/re-schedule the same TimerWheelItem back into the
wheel (so delayed work like dispatchComputeSnapshot is not permanently lost)
instead of letting the rejection go unhandled after deletion.

Comment on lines +43 to +50
constructor(opts: TimerWheelOptions<T>) {
this.slots = Array.from({ length: NUM_SLOTS }, () => new Set());
this.entries = new Map();
this.cursor = 0;
this.intervalId = null;
this.onExpire = opts.onExpire;
this.delaySlots = Math.max(1, Math.min(NUM_SLOTS, Math.round(opts.delayMs / TICK_MS)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use Math.ceil() for slot conversion.

Math.round(delayMs / TICK_MS) can shave almost half a tick off the requested delay. For example, a 149ms delay submitted on a tick boundary becomes a single 100ms slot, so the grace window can expire earlier than configured.

🛠️ Minimal fix
-    this.delaySlots = Math.max(1, Math.min(NUM_SLOTS, Math.round(opts.delayMs / TICK_MS)));
+    this.delaySlots = Math.max(1, Math.min(NUM_SLOTS, Math.ceil(opts.delayMs / TICK_MS)));
@@
-    this.delaySlots = Math.max(1, Math.min(NUM_SLOTS, Math.round(delayMs / TICK_MS)));
+    this.delaySlots = Math.max(1, Math.min(NUM_SLOTS, Math.ceil(delayMs / TICK_MS)));

Also applies to: 90-92

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/services/timerWheel.ts` around lines 43 - 50, The
delay-to-slot conversion in the TimerWheel constructor (setting this.delaySlots)
currently uses Math.round(opts.delayMs / TICK_MS) which can undershoot the
requested delay; change it to Math.ceil(opts.delayMs / TICK_MS) so delays never
shorten, and search the file for any other uses of Math.round(... / TICK_MS)
(e.g., later scheduling/adjustment logic around delaySlots) and replace those
with Math.ceil as well to keep consistent, correct rounding behavior.

Comment on lines +117 to +123
if (this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED) {
this.snapshotDelayWheel = new TimerWheel<DelayedSnapshot>({
delayMs: env.COMPUTE_SNAPSHOT_DELAY_MS,
onExpire: (item) => this.dispatchComputeSnapshot(item.data),
});
this.snapshotDelayWheel.start();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Expected result:
# - timerWheel.tick() calls onExpire(...) with no await/.catch
# - workloadServer wires onExpire to async dispatchComputeSnapshot()
sed -n '141,159p' apps/supervisor/src/services/timerWheel.ts
echo
sed -n '117,123p' apps/supervisor/src/workloadServer/index.ts
echo
sed -n '721,745p' apps/supervisor/src/workloadServer/index.ts

Repository: triggerdotdev/trigger.dev

Length of output: 1785


Catch timer-wheel dispatch failures before they become unhandled rejections.

TimerWheel.tick() invokes onExpire() synchronously without awaiting it, so the promise returned by dispatchComputeSnapshot() is currently dropped. Any rejection inside that async function becomes an unhandled promise rejection, which can crash the supervisor process.

🐛 Proposed fix
     if (this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED) {
       this.snapshotDelayWheel = new TimerWheel<DelayedSnapshot>({
         delayMs: env.COMPUTE_SNAPSHOT_DELAY_MS,
-        onExpire: (item) => this.dispatchComputeSnapshot(item.data),
+        onExpire: (item) => {
+          void this.dispatchComputeSnapshot(item.data).catch((error) => {
+            this.logger.error("Unhandled compute snapshot dispatch error", {
+              runId: item.data.runFriendlyId,
+              runnerId: item.data.runnerId,
+              error,
+            });
+          });
+        },
       });
       this.snapshotDelayWheel.start();
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadServer/index.ts` around lines 117 - 123, The
TimerWheel's onExpire handler is calling the async dispatchComputeSnapshot()
synchronously so its returned promise can reject unhandled; update the onExpire
callback passed to TimerWheel (snapshotDelayWheel) to call
dispatchComputeSnapshot(item.data) and attach explicit error handling (e.g.,
.catch(...) to log/report failures) so any rejection is caught and does not
become an unhandled promise rejection; reference TimerWheel, TimerWheel.tick,
snapshotDelayWheel, and dispatchComputeSnapshot when locating the handler to
wrap the call with a rejection handler.

Comment on lines +478 to +523
if (body.status === "completed") {
const result = await this.workerClient.submitSuspendCompletion({
runId,
snapshotId: snapshotFriendlyId,
body: {
success: true,
checkpoint: {
type: "KUBERNETES",
location: body.snapshot_id,
},
},
});

if (result.success) {
this.logger.info("Suspend completion submitted", {
runId,
instanceId: body.instance_id,
snapshotId: body.snapshot_id,
});
} else {
this.logger.error("Failed to submit suspend completion", {
runId,
snapshotFriendlyId,
error: result.error,
});
}
} else {
const result = await this.workerClient.submitSuspendCompletion({
runId,
snapshotId: snapshotFriendlyId,
body: {
success: false,
error: body.error ?? "Snapshot failed",
},
});

if (!result.success) {
this.logger.error("Failed to submit suspend failure", {
runId,
snapshotFriendlyId,
error: result.error,
});
}
}

reply.empty(200);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Return a non-2xx when suspend completion was not recorded.

Both branches log !result.success and still reply 200. That acknowledges the compute callback even though the platform never received the suspend result, which drops the only retry signal on transient worker API failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadServer/index.ts` around lines 478 - 523, The
handler currently always calls reply.empty(200) even when
this.workerClient.submitSuspendCompletion returns result.success === false,
which hides failures and prevents retries; update the logic around the two calls
to submitSuspendCompletion (the completed branch and the failure branch) so that
if result.success is false you return a non-2xx response (e.g.,
reply.code(500).send({ error: result.error })) instead of reply.empty(200),
preserving the existing logger.error calls; ensure both places reference the
same result variable from submitSuspendCompletion and that success paths still
reply 200.

Comment on lines +752 to +760
const remaining = this.snapshotDelayWheel?.stop() ?? [];
if (remaining.length > 0) {
this.logger.info("Snapshot delay wheel stopped, dropped pending snapshots", {
count: remaining.length,
});
this.logger.debug("Dropped snapshot details", {
runs: remaining.map((item) => item.key),
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't drop delayed suspends on shutdown after already returning 202.

TimerWheel.stop() returns pending items without firing onExpire(). At this point the runner has already been told to suspend, so logging and discarding these entries means no snapshot request or failure is ever reported for that waitpoint across a supervisor restart.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadServer/index.ts` around lines 752 - 760, When
stopping the snapshot delay wheel on shutdown (this.snapshotDelayWheel?.stop())
we currently log and discard the returned pending items, which loses pending
suspend/snapshot work after we've already responded 202; instead iterate the
returned remaining items and either re-enqueue them or invoke the same
expiration handler used by the wheel so each pending waitpoint still triggers
the suspend/snapshot workflow (i.e. call the TimerWheel onExpire handler or the
snapshot processing function for each item.key/item), or persist them for resume
— do not drop them. Ensure you reference snapshotDelayWheel, TimerWheel.stop(),
onExpire() and the remaining array when making the change so pending entries are
handled the same way as when they naturally expire.

nicktrn added 4 commits March 24, 2026 13:52
…races

Supervisor emits OTel spans for compute lifecycle events so they appear
in the run's trace view with per-stage timing breakdowns.

Spans:
- compute.provision: emitted after gateway create returns, includes
  gateway/agent/fcrun timing and cache indicators from _timing response
- compute.restore: emitted after gateway restore returns (supervisor-side
  timing only, gateway restore timing not yet surfaced)
- compute.snapshot: emitted from snapshot callback handler using
  duration_ms from the agent, trace context from in-memory map (best-effort,
  does not survive restarts - TRI-7992)

Implementation:
- Hand-rolled OTLP JSON client (otlpTrace.ts) - builds ExportTraceServiceRequest
  payload and fire-and-forget POSTs to TRIGGER_API_URL/otel
- Trace context (traceparent) from DequeuedMessage links spans to run trace
- Resource attributes (ctx.environment.id, ctx.run.id, etc.) link to the
  correct run in the trace view
- COMPUTE_TRACE_SPANS_ENABLED env var (default true) to disable in prod
- Span start time offset by -1ms to ensure stable sort order before attempt

Also adds .claude/rules/span-timeline-events.md documenting how the trace
view timeline events system works (trigger.dev/ prefix, admin visibility,
ClickHouse SPAN_EVENT storage, start_time filter constraint).
… remaining logs

Add optional COMPUTE_TRACE_OTLP_ENDPOINT env var to override the OTLP
endpoint for supervisor-emitted spans (defaults to TRIGGER_API_URL/otel).
Useful for sending spans to an OTel collector instead of the webapp.

Also demotes remaining per-run logs in compute workload manager and
workload server to debug/verbose.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
apps/supervisor/src/workloadServer/index.ts (4)

294-301: ⚠️ Potential issue | 🟡 Minor

Response doesn't conform to WorkloadSuspendRunResponseBody.

The error response at line 299 returns { error: "..." } without the ok discriminator. Other error responses in this handler correctly return { ok: false, error: "..." }.

🐛 Proposed fix
-              reply.json({ error: "Snapshot callbacks not configured" }, false, 500);
+              reply.json(
+                {
+                  ok: false,
+                  error: "Snapshot callbacks not configured",
+                } satisfies WorkloadSuspendRunResponseBody,
+                false,
+                500
+              );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadServer/index.ts` around lines 294 - 301, The
response returned when TRIGGER_WORKLOAD_API_DOMAIN is missing doesn't conform to
WorkloadSuspendRunResponseBody; update the reply.json call inside the block
guarded by snapshotDelayWheel, computeManager and env.COMPUTE_SNAPSHOTS_ENABLED
so it returns the proper discriminator shape (e.g., { ok: false, error:
"Snapshot callbacks not configured" }) instead of { error: "..." }, and keep the
existing logger.error call (symbols to find: snapshotDelayWheel, computeManager,
env.COMPUTE_SNAPSHOTS_ENABLED, TRIGGER_WORKLOAD_API_DOMAIN, logger.error,
reply.json, WorkloadSuspendRunResponseBody).

817-825: ⚠️ Potential issue | 🟠 Major

Pending snapshots are dropped on shutdown after 202 was already returned.

TimerWheel.stop() returns pending items without firing onExpire(). At this point runners have already been told to suspend (via 202 response), so logging and discarding these entries means no snapshot request is ever sent for those waitpoints.

Consider dispatching pending snapshots during drain rather than dropping them:

🐛 Proposed approach
   async stop() {
     const remaining = this.snapshotDelayWheel?.stop() ?? [];
     if (remaining.length > 0) {
-      this.logger.info("Snapshot delay wheel stopped, dropped pending snapshots", {
+      this.logger.info("Snapshot delay wheel stopped, draining pending snapshots", {
         count: remaining.length,
       });
-      this.logger.debug("Dropped snapshot details", {
-        runs: remaining.map((item) => item.key),
-      });
+      // Dispatch all pending snapshots before shutdown
+      await Promise.allSettled(
+        remaining.map((item) =>
+          this.dispatchComputeSnapshot(item.data).catch((error) => {
+            this.logger.error("Failed to dispatch snapshot during drain", {
+              runId: item.data.runFriendlyId,
+              error,
+            });
+          })
+        )
+      );
     }
     await this.httpServer.stop();
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadServer/index.ts` around lines 817 - 825, The
shutdown currently calls snapshotDelayWheel.stop() and drops returned pending
items, which prevents onExpire() from running and thus never sends snapshot
requests after a 202 suspend; change the shutdown/drain logic so after calling
this.snapshotDelayWheel.stop() you iterate the remaining array and for each item
invoke the same handler used when an item expires (i.e. call the onExpire
handler or the method that sends the snapshot request) instead of discarding
them, preserving the existing log entries (this.logger.info/debug) and ensuring
any errors from dispatching are caught and logged.

129-135: ⚠️ Potential issue | 🟠 Major

Catch timer-wheel dispatch failures to prevent unhandled rejections.

TimerWheel.tick() invokes onExpire() synchronously without awaiting, so the promise returned by dispatchComputeSnapshot() is dropped. Any rejection becomes an unhandled promise rejection that can crash the supervisor process.

🐛 Proposed fix
     if (this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED) {
       this.snapshotDelayWheel = new TimerWheel<DelayedSnapshot>({
         delayMs: env.COMPUTE_SNAPSHOT_DELAY_MS,
-        onExpire: (item) => this.dispatchComputeSnapshot(item.data),
+        onExpire: (item) => {
+          this.dispatchComputeSnapshot(item.data).catch((error) => {
+            this.logger.error("Unhandled compute snapshot dispatch error", {
+              runId: item.data.runFriendlyId,
+              runnerId: item.data.runnerId,
+              error,
+            });
+          });
+        },
       });
       this.snapshotDelayWheel.start();
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadServer/index.ts` around lines 129 - 135, The
TimerWheel onExpire currently calls this.dispatchComputeSnapshot(item.data) and
drops its returned promise, allowing rejections to become unhandled; update the
onExpire handler in the snapshotDelayWheel initialization so it explicitly
catches and handles promise rejections (e.g. replace onExpire: (item) =>
this.dispatchComputeSnapshot(item.data) with an inline handler that calls
this.dispatchComputeSnapshot(item.data).catch(err => /* log or handle error */)
or wraps the call in Promise.resolve(...).catch(...)) to prevent unhandled
promise rejections from crashing the supervisor; reference snapshotDelayWheel,
TimerWheel, and dispatchComputeSnapshot when making the change.

507-539: ⚠️ Potential issue | 🟠 Major

Return non-2xx when suspend completion fails.

Both success and failure branches always reply with 200 even when submitSuspendCompletion returns result.success === false. This acknowledges the compute callback even though the platform never received the suspend result, which drops the only retry signal on transient API failures.

🐛 Proposed fix
         if (result.success) {
           this.logger.debug("Suspend completion submitted", {
             runId,
             instanceId: body.instance_id,
             snapshotId: body.snapshot_id,
           });
+          reply.empty(200);
         } else {
           this.logger.error("Failed to submit suspend completion", {
             runId,
             snapshotFriendlyId,
             error: result.error,
           });
+          reply.empty(500);
+          return;
         }
       } else {
         const result = await this.workerClient.submitSuspendCompletion({
           // ...
         });

         if (!result.success) {
           this.logger.error("Failed to submit suspend failure", {
             runId,
             snapshotFriendlyId,
             error: result.error,
           });
+          reply.empty(500);
+          return;
         }
+        reply.empty(200);
       }

-      reply.empty(200);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadServer/index.ts` around lines 507 - 539, The
handler currently always calls reply.empty(200) even when
this.workerClient.submitSuspendCompletion(...) returns result.success === false;
update the logic in apps/supervisor/src/workloadServer/index.ts so that after
each call to this.workerClient.submitSuspendCompletion (both the success-path
branch that checks if (result.success) and the failure-path branch that checks
if (!result.success)) you return a non-2xx response (e.g.,
reply.status(502).send(...) or reply.code(500).send(...)) when result.success is
false, including contextual fields (runId, snapshotFriendlyId, body.error) in
the error path; only reply.empty(200) when submitSuspendCompletion succeeded.
🧹 Nitpick comments (4)
apps/supervisor/src/otlpPayload.ts (2)

36-37: Timestamp conversion may lose precision for epoch milliseconds.

Multiplying current epoch milliseconds (~1.7×10¹² ms) by 1,000,000 yields values exceeding Number.MAX_SAFE_INTEGER (9×10¹⁵), which can cause floating-point precision loss before the String() conversion.

Consider using BigInt for the conversion to preserve precision:

♻️ Suggested fix using BigInt
-              startTimeUnixNano: String(opts.startTimeMs * 1_000_000),
-              endTimeUnixNano: String(opts.endTimeMs * 1_000_000),
+              startTimeUnixNano: String(BigInt(opts.startTimeMs) * 1_000_000n),
+              endTimeUnixNano: String(BigInt(opts.endTimeMs) * 1_000_000n),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/otlpPayload.ts` around lines 36 - 37, The timestamp
conversion for startTimeUnixNano and endTimeUnixNano currently multiplies
opts.startTimeMs/opts.endTimeMs by 1_000_000 using Number, which can exceed
Number.MAX_SAFE_INTEGER and lose precision; change the conversion to use BigInt
(e.g., BigInt(opts.startTimeMs) * 1_000_000n and BigInt(opts.endTimeMs) *
1_000_000n) and then call toString() to set startTimeUnixNano and
endTimeUnixNano, ensuring any surrounding types or interfaces accept the string
result.

3-11: Prefer type over interface for consistency.

Per coding guidelines, use types over interfaces for TypeScript definitions.

♻️ Suggested change
-export interface OtlpTraceOptions {
+export type OtlpTraceOptions = {
   traceId: string;
   parentSpanId?: string;
   spanName: string;
   startTimeMs: number;
   endTimeMs: number;
   resourceAttributes: Record<string, string | number | boolean>;
   spanAttributes: Record<string, string | number | boolean>;
-}
+};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/otlpPayload.ts` around lines 3 - 11, Replace the exported
interface OtlpTraceOptions with an exported type alias using the same name;
change "export interface OtlpTraceOptions { ... }" to "export type
OtlpTraceOptions = { ... }", keeping all property names and types (traceId,
parentSpanId, spanName, startTimeMs, endTimeMs, resourceAttributes,
spanAttributes) identical; then update any downstream code if it relied on
declaration merging (interfaces) — verify usages of OtlpTraceOptions compile
unchanged.
apps/supervisor/src/workloadManager/compute.ts (2)

76-82: Consolidate duplicate header construction.

The create() method manually builds headers (lines 76-82) while authHeaders getter (lines 179-187) does the same. Consider using authHeaders in create() for consistency.

♻️ Suggested change in create()
-    const headers: Record<string, string> = {
-      "Content-Type": "application/json",
-    };
-
-    if (this.opts.gatewayAuthToken) {
-      headers["Authorization"] = `Bearer ${this.opts.gatewayAuthToken}`;
-    }

     // ...

     const [fetchError, response] = await tryCatch(
       fetch(url, {
         method: "POST",
-        headers,
+        headers: this.authHeaders,
         signal: AbortSignal.timeout(this.opts.gatewayTimeoutMs),
         // ...
       })
     );

Also applies to: 179-187

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadManager/compute.ts` around lines 76 - 82, The
create() method is duplicating header construction that already exists in the
authHeaders getter; replace the manual headers object in create() with a call to
the authHeaders getter (e.g., use this.authHeaders) so header logic is
centralized in the authHeaders accessor, ensuring the resulting value is a
Record<string,string> (clone it if you plan to mutate headers later) and remove
the duplicated Authorization/Content-Type assembly in create(); keep authHeaders
as the single source of truth.

32-177: create() doesn't return success/failure indicator.

Unlike snapshot(), deleteInstance(), and restore() which return Promise<boolean>, create() returns Promise<void> and relies solely on logging for failure signaling. This inconsistency may make it harder for callers to react to creation failures.

Consider returning a boolean for consistency, or document that failures are intentionally fire-and-forget at this layer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadManager/compute.ts` around lines 32 - 177, The
create() method currently returns void but should return Promise<boolean> for
consistency with snapshot()/deleteInstance()/restore(); change the signature of
create to return Promise<boolean>, ensure all early failure paths (fetchError
branch, non-ok response branch, parseError branch) return false, and return true
after successful parsing and setting event.ok (just before calling
this.#emitProvisionSpan or immediately after); also ensure the finally block
still logs and does not suppress the boolean return (i.e., keep returns in try
branches and return true on success). Update any callers expecting a boolean if
required.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 294-301: The response returned when TRIGGER_WORKLOAD_API_DOMAIN is
missing doesn't conform to WorkloadSuspendRunResponseBody; update the reply.json
call inside the block guarded by snapshotDelayWheel, computeManager and
env.COMPUTE_SNAPSHOTS_ENABLED so it returns the proper discriminator shape
(e.g., { ok: false, error: "Snapshot callbacks not configured" }) instead of {
error: "..." }, and keep the existing logger.error call (symbols to find:
snapshotDelayWheel, computeManager, env.COMPUTE_SNAPSHOTS_ENABLED,
TRIGGER_WORKLOAD_API_DOMAIN, logger.error, reply.json,
WorkloadSuspendRunResponseBody).
- Around line 817-825: The shutdown currently calls snapshotDelayWheel.stop()
and drops returned pending items, which prevents onExpire() from running and
thus never sends snapshot requests after a 202 suspend; change the
shutdown/drain logic so after calling this.snapshotDelayWheel.stop() you iterate
the remaining array and for each item invoke the same handler used when an item
expires (i.e. call the onExpire handler or the method that sends the snapshot
request) instead of discarding them, preserving the existing log entries
(this.logger.info/debug) and ensuring any errors from dispatching are caught and
logged.
- Around line 129-135: The TimerWheel onExpire currently calls
this.dispatchComputeSnapshot(item.data) and drops its returned promise, allowing
rejections to become unhandled; update the onExpire handler in the
snapshotDelayWheel initialization so it explicitly catches and handles promise
rejections (e.g. replace onExpire: (item) =>
this.dispatchComputeSnapshot(item.data) with an inline handler that calls
this.dispatchComputeSnapshot(item.data).catch(err => /* log or handle error */)
or wraps the call in Promise.resolve(...).catch(...)) to prevent unhandled
promise rejections from crashing the supervisor; reference snapshotDelayWheel,
TimerWheel, and dispatchComputeSnapshot when making the change.
- Around line 507-539: The handler currently always calls reply.empty(200) even
when this.workerClient.submitSuspendCompletion(...) returns result.success ===
false; update the logic in apps/supervisor/src/workloadServer/index.ts so that
after each call to this.workerClient.submitSuspendCompletion (both the
success-path branch that checks if (result.success) and the failure-path branch
that checks if (!result.success)) you return a non-2xx response (e.g.,
reply.status(502).send(...) or reply.code(500).send(...)) when result.success is
false, including contextual fields (runId, snapshotFriendlyId, body.error) in
the error path; only reply.empty(200) when submitSuspendCompletion succeeded.

---

Nitpick comments:
In `@apps/supervisor/src/otlpPayload.ts`:
- Around line 36-37: The timestamp conversion for startTimeUnixNano and
endTimeUnixNano currently multiplies opts.startTimeMs/opts.endTimeMs by
1_000_000 using Number, which can exceed Number.MAX_SAFE_INTEGER and lose
precision; change the conversion to use BigInt (e.g., BigInt(opts.startTimeMs) *
1_000_000n and BigInt(opts.endTimeMs) * 1_000_000n) and then call toString() to
set startTimeUnixNano and endTimeUnixNano, ensuring any surrounding types or
interfaces accept the string result.
- Around line 3-11: Replace the exported interface OtlpTraceOptions with an
exported type alias using the same name; change "export interface
OtlpTraceOptions { ... }" to "export type OtlpTraceOptions = { ... }", keeping
all property names and types (traceId, parentSpanId, spanName, startTimeMs,
endTimeMs, resourceAttributes, spanAttributes) identical; then update any
downstream code if it relied on declaration merging (interfaces) — verify usages
of OtlpTraceOptions compile unchanged.

In `@apps/supervisor/src/workloadManager/compute.ts`:
- Around line 76-82: The create() method is duplicating header construction that
already exists in the authHeaders getter; replace the manual headers object in
create() with a call to the authHeaders getter (e.g., use this.authHeaders) so
header logic is centralized in the authHeaders accessor, ensuring the resulting
value is a Record<string,string> (clone it if you plan to mutate headers later)
and remove the duplicated Authorization/Content-Type assembly in create(); keep
authHeaders as the single source of truth.
- Around line 32-177: The create() method currently returns void but should
return Promise<boolean> for consistency with
snapshot()/deleteInstance()/restore(); change the signature of create to return
Promise<boolean>, ensure all early failure paths (fetchError branch, non-ok
response branch, parseError branch) return false, and return true after
successful parsing and setting event.ok (just before calling
this.#emitProvisionSpan or immediately after); also ensure the finally block
still logs and does not suppress the boolean return (i.e., keep returns in try
branches and return true on success). Update any callers expecting a boolean if
required.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0365242a-c60d-4579-8784-2716f7b1454f

📥 Commits

Reviewing files that changed from the base of the PR and between 80b62d4 and 7349acb.

📒 Files selected for processing (13)
  • .claude/rules/span-timeline-events.md
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/otlpPayload.ts
  • apps/supervisor/src/otlpTrace.test.ts
  • apps/supervisor/src/otlpTrace.ts
  • apps/supervisor/src/services/failedPodHandler.ts
  • apps/supervisor/src/services/podCleaner.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • apps/supervisor/src/workloadManager/docker.ts
  • apps/supervisor/src/workloadManager/kubernetes.ts
  • apps/supervisor/src/workloadManager/types.ts
  • apps/supervisor/src/workloadServer/index.ts
✅ Files skipped from review due to trivial changes (5)
  • apps/supervisor/src/workloadManager/kubernetes.ts
  • apps/supervisor/src/workloadManager/docker.ts
  • apps/supervisor/src/services/failedPodHandler.ts
  • apps/supervisor/src/services/podCleaner.ts
  • .claude/rules/span-timeline-events.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • apps/supervisor/src/workloadManager/types.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: typecheck / typecheck
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (10)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: For apps and internal packages (apps/*, internal-packages/*), use pnpm run typecheck --filter <package> for verification, never use build as it proves almost nothing about correctness
Use testcontainers helpers (redisTest, postgresTest, containerTest from @internal/testcontainers) for integration tests with Redis and PostgreSQL instead of mocking
When writing Trigger.dev tasks, always import from @trigger.dev/sdk - never use @trigger.dev/sdk/v3 or deprecated client.defineJob

Files:

  • apps/supervisor/src/otlpTrace.test.ts
  • apps/supervisor/src/otlpTrace.ts
  • apps/supervisor/src/otlpPayload.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/workloadManager/compute.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

**/*.{ts,tsx,js,jsx}: Use pnpm for package management in this monorepo (version 10.23.0) with Turborepo for orchestration - run commands from root with pnpm run
Add crumbs as you write code for debug tracing using // @Crumbs comments or `// `#region` `@crumbs blocks - they stay on the branch throughout development and are stripped via agentcrumbs strip before merge

Files:

  • apps/supervisor/src/otlpTrace.test.ts
  • apps/supervisor/src/otlpTrace.ts
  • apps/supervisor/src/otlpPayload.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/workloadManager/compute.ts
**/*.{test,spec}.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use vitest for all tests in the Trigger.dev repository

Files:

  • apps/supervisor/src/otlpTrace.test.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • apps/supervisor/src/otlpTrace.test.ts
  • apps/supervisor/src/otlpTrace.ts
  • apps/supervisor/src/otlpPayload.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/workloadManager/compute.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

  • apps/supervisor/src/otlpTrace.test.ts
  • apps/supervisor/src/otlpTrace.ts
  • apps/supervisor/src/otlpPayload.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/workloadManager/compute.ts
**/*.test.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.test.{ts,tsx,js,jsx}: Test files should live beside the files under test and use descriptive describe and it blocks
Tests should avoid mocks or stubs and use the helpers from @internal/testcontainers when Redis or Postgres are needed
Use vitest for running unit tests

**/*.test.{ts,tsx,js,jsx}: Use vitest exclusively for testing and never mock anything - use testcontainers instead for Redis/PostgreSQL
Place test files next to source files using naming convention: MyService.ts -> MyService.test.ts

Files:

  • apps/supervisor/src/otlpTrace.test.ts
apps/**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (CLAUDE.md)

When modifying only server components (apps/webapp/, apps/supervisor/, etc.) with no package changes, add a .server-changes/ file instead of a changeset

Files:

  • apps/supervisor/src/otlpTrace.test.ts
  • apps/supervisor/src/otlpTrace.ts
  • apps/supervisor/src/otlpPayload.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/workloadManager/compute.ts
apps/supervisor/src/env.ts

📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)

Environment configuration should be defined in src/env.ts

Files:

  • apps/supervisor/src/env.ts
apps/supervisor/src/workloadServer/**/*.{js,ts}

📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)

HTTP server for workload communication (heartbeats, snapshots) should be implemented in src/workloadServer/

Files:

  • apps/supervisor/src/workloadServer/index.ts
apps/supervisor/src/workloadManager/**/*.{js,ts}

📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)

Container orchestration abstraction (Docker or Kubernetes) should be implemented in src/workloadManager/

Files:

  • apps/supervisor/src/workloadManager/compute.ts
🧠 Learnings (27)
📓 Common learnings
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:51.147Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/workloadManager/**/*.{js,ts} : Container orchestration abstraction (Docker or Kubernetes) should be implemented in `src/workloadManager/`
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/workloadServer/**/*.{js,ts} : HTTP server for workload communication (heartbeats, snapshots) should be implemented in `src/workloadServer/`
📚 Learning: 2025-11-27T16:26:37.432Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-27T16:26:37.432Z
Learning: Applies to **/*.{test,spec}.{ts,tsx} : Use vitest for all tests in the Trigger.dev repository

Applied to files:

  • apps/supervisor/src/otlpTrace.test.ts
📚 Learning: 2026-03-02T12:43:25.254Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: internal-packages/run-engine/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:43:25.254Z
Learning: Applies to internal-packages/run-engine/src/engine/systems/**/*.ts : Integrate OpenTelemetry tracer and meter instrumentation in RunEngine systems for observability

Applied to files:

  • apps/supervisor/src/otlpTrace.test.ts
  • apps/supervisor/src/otlpTrace.ts
  • apps/supervisor/src/otlpPayload.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger.config.ts : Configure OpenTelemetry instrumentations and exporters in trigger.config.ts for enhanced logging

Applied to files:

  • apps/supervisor/src/otlpTrace.test.ts
  • apps/supervisor/src/otlpTrace.ts
  • apps/supervisor/src/otlpPayload.ts
📚 Learning: 2026-01-15T10:48:02.687Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-15T10:48:02.687Z
Learning: Applies to **/*.test.{ts,tsx,js,jsx} : Use vitest for running unit tests

Applied to files:

  • apps/supervisor/src/otlpTrace.test.ts
📚 Learning: 2026-03-02T12:43:25.254Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: internal-packages/run-engine/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:43:25.254Z
Learning: Applies to internal-packages/run-engine/src/engine/tests/**/*.test.ts : Implement tests for RunEngine in `src/engine/tests/` using testcontainers for Redis and PostgreSQL containerization

Applied to files:

  • apps/supervisor/src/otlpTrace.test.ts
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

  • apps/supervisor/src/otlpTrace.test.ts
  • apps/supervisor/src/otlpTrace.ts
  • apps/supervisor/src/otlpPayload.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/workloadManager/compute.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

  • apps/supervisor/src/otlpTrace.test.ts
  • apps/supervisor/src/otlpTrace.ts
  • apps/supervisor/src/otlpPayload.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/workloadManager/compute.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use logger methods (debug, log, info, warn, error) from `trigger.dev/sdk/v3` for structured logging in tasks

Applied to files:

  • apps/supervisor/src/otlpTrace.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/workloadManager/**/*.{js,ts} : Container orchestration abstraction (Docker or Kubernetes) should be implemented in `src/workloadManager/`

Applied to files:

  • apps/supervisor/src/index.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/workloadManager/compute.ts
📚 Learning: 2026-02-23T12:56:51.147Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:51.147Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.

Applied to files:

  • apps/supervisor/src/index.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/workloadManager/compute.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/workloadServer/**/*.{js,ts} : HTTP server for workload communication (heartbeats, snapshots) should be implemented in `src/workloadServer/`

Applied to files:

  • apps/supervisor/src/index.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/workloadManager/compute.ts
📚 Learning: 2026-03-23T06:24:14.555Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-23T06:24:14.555Z
Learning: Applies to apps/webapp/app/v3/**/*.{ts,tsx} : In the webapp v3 directory, only modify V2 code paths when encountering V1/V2 branching in services - all new work uses Run Engine 2.0 (`internal/run-engine`) and redis-worker, not legacy V1 engine code

Applied to files:

  • apps/supervisor/src/index.ts
📚 Learning: 2026-03-02T12:43:43.173Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: packages/redis-worker/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:43:43.173Z
Learning: Applies to packages/redis-worker/**/redis-worker/src/worker.ts : Worker loop and job processing should implement concurrency control in src/worker.ts

Applied to files:

  • apps/supervisor/src/index.ts
📚 Learning: 2026-03-23T06:24:25.016Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/webapp/CLAUDE.md:0-0
Timestamp: 2026-03-23T06:24:25.016Z
Learning: Applies to apps/webapp/app/v3/services/**/*.server.ts : Only modify V2 code paths when editing services that branch on `RunEngineVersion` to support both V1 and V2 (e.g., `cancelTaskRun.server.ts`, `batchTriggerV3.server.ts`)

Applied to files:

  • apps/supervisor/src/index.ts
  • apps/supervisor/src/workloadServer/index.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: WorkloadManager should abstract Docker vs Kubernetes execution to allow switching between container orchestration platforms

Applied to files:

  • apps/supervisor/src/index.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/env.ts : Environment configuration should be defined in `src/env.ts`

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Applies to apps/webapp/app/**/*.{ts,tsx} : Access all environment variables through the `env` export of `env.server.ts` instead of directly accessing `process.env` in the Trigger.dev webapp

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2026-03-23T06:24:25.016Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/webapp/CLAUDE.md:0-0
Timestamp: 2026-03-23T06:24:25.016Z
Learning: Applies to apps/webapp/app/**/*.{ts,tsx,server.ts} : Access environment variables via `env` export from `app/env.server.ts`. Never use `process.env` directly

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2025-08-14T18:35:44.370Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 2390
File: apps/webapp/app/env.server.ts:764-765
Timestamp: 2025-08-14T18:35:44.370Z
Learning: The BoolEnv helper in apps/webapp/app/utils/boolEnv.ts uses z.preprocess with inconsistent default value types across the codebase - some usages pass boolean defaults (correct) while others pass string defaults (incorrect), leading to type confusion. The helper should enforce boolean-only defaults or have clearer documentation.

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2026-02-04T16:34:48.876Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 2994
File: apps/webapp/app/routes/vercel.connect.tsx:13-27
Timestamp: 2026-02-04T16:34:48.876Z
Learning: In apps/webapp/app/routes/vercel.connect.tsx, configurationId may be absent for "dashboard" flows but must be present for "marketplace" flows. Enforce this with a Zod superRefine and pass installationId to repository methods only when configurationId is defined (omit the field otherwise).

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2025-11-27T16:26:37.432Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-27T16:26:37.432Z
Learning: Applies to {packages/core,apps/webapp}/**/*.{ts,tsx} : Use zod for validation in packages/core and apps/webapp

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2025-06-04T16:02:22.957Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 2150
File: apps/supervisor/src/workloadManager/docker.ts:115-115
Timestamp: 2025-06-04T16:02:22.957Z
Learning: In the Trigger.dev codebase, the supervisor component uses DOCKER_ENFORCE_MACHINE_PRESETS while the docker provider component uses ENFORCE_MACHINE_PRESETS. These are separate components with separate environment variable configurations for the same logical concept of enforcing machine presets.

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2026-03-21T21:23:31.557Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/runs/v3/ai/extractAISummarySpanData.ts:149-150
Timestamp: 2026-03-21T21:23:31.557Z
Learning: In `apps/webapp/app/components/runs/v3/ai/extractAISummarySpanData.ts` (and related AI span extraction files in `apps/webapp/app/components/runs/v3/ai/`), manual JSON.parse with typeof guards and type assertions is intentional. These functions are on the hot path for span rendering, so Zod validation is deliberately avoided for performance reasons. Do not suggest replacing manual JSON parsing with Zod schemas in these files.

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2026-03-02T12:43:37.906Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: packages/core/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:43:37.906Z
Learning: Exercise caution with changes to trigger.dev/core as they affect both the customer-facing SDK and server-side webapp - breaking changes can impact deployed user tasks and the platform simultaneously

Applied to files:

  • apps/supervisor/src/workloadServer/index.ts
📚 Learning: 2026-03-22T19:27:26.469Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/createAlertChannel.server.ts:104-112
Timestamp: 2026-03-22T19:27:26.469Z
Learning: In `apps/webapp/app/v3/services/alerts/createAlertChannel.server.ts`, the `#scheduleErrorAlertEvaluation` helper intentionally uses the same job id (`evaluateErrorAlerts:${projectId}`) as the evaluator's periodic self-chain. The deduplication is desired: if a future run is already queued, the immediate enqueue becomes a no-op, preventing two evaluations firing in quick succession. Do not flag this as a bug or suggest a unique/timestamped id.

Applied to files:

  • apps/supervisor/src/workloadServer/index.ts
📚 Learning: 2025-11-27T16:26:37.432Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-27T16:26:37.432Z
Learning: Applies to packages/trigger-sdk/**/*.{ts,tsx} : In the Trigger.dev SDK (packages/trigger-sdk), prefer isomorphic code like fetch and ReadableStream instead of Node.js-specific code

Applied to files:

  • apps/supervisor/src/workloadServer/index.ts
🔇 Additional comments (12)
apps/supervisor/src/index.ts (3)

454-457: Shutdown ordering concern from previous review.

Past review flagged that stopping workloadServer before workerSession could strand active runners and in-flight callbacks. The ordering was changed but the concern about quiescing/draining before teardown may still apply.


83-98: LGTM - Compute mode initialization.

The conditional instantiation of ComputeWorkloadManager vs Docker/Kubernetes managers is clean and correctly gates on env.COMPUTE_GATEWAY_URL.


226-256: LGTM - Compute restore flow.

The restore implementation correctly:

  • Derives a unique runnerId per restore cycle
  • Passes all required context (machine specs, trace context, identifiers)
  • Handles failures gracefully with logging
  • Returns early to skip the standard checkpoint flow

Based on learnings, the run engine handles retries for failed restore operations, so the error-and-return pattern is appropriate.

apps/supervisor/src/otlpTrace.ts (1)

1-19: LGTM!

Clean fire-and-forget OTLP trace implementation with appropriate error handling and timeout. The .catch() pattern correctly prevents unhandled rejections from crashing the supervisor.

apps/supervisor/src/otlpTrace.test.ts (1)

1-114: LGTM!

Comprehensive test coverage for the OTLP payload builder. The tests correctly verify resource attributes, span attribute type conversions, spanId format, and timestamp handling.

Minor note: The test file is named otlpTrace.test.ts but tests buildOtlpTracePayload from otlpPayload.ts. Consider renaming to otlpPayload.test.ts for clarity, though this is not blocking.

apps/supervisor/src/env.ts (2)

81-88: LGTM - Compute environment variables.

Well-structured configuration with sensible defaults:

  • COMPUTE_SNAPSHOTS_ENABLED defaults to false (safe default)
  • COMPUTE_TRACE_SPANS_ENABLED defaults to true (observability by default)
  • COMPUTE_SNAPSHOT_DELAY_MS has appropriate bounds (0-60s)

157-176: LGTM - Cross-field validation and transform.

The superRefine validation correctly enforces that TRIGGER_METADATA_URL and TRIGGER_WORKLOAD_API_DOMAIN are required when COMPUTE_SNAPSHOTS_ENABLED is true. This addresses the previous review concern about localhost callback URLs breaking external gateways.

The transform properly derives COMPUTE_TRACE_OTLP_ENDPOINT from TRIGGER_API_URL when not explicitly set.

apps/supervisor/src/workloadServer/index.ts (2)

61-81: LGTM - Schema and type definitions.

The ComputeSnapshotCallbackBody Zod schema and helper types are well-defined with appropriate optional fields.


744-806: LGTM - Snapshot dispatch and OTLP span emission.

The dispatchComputeSnapshot method correctly constructs the callback URL and forwards metadata. The #emitSnapshotSpan helper properly handles missing trace context gracefully.

apps/supervisor/src/workloadManager/compute.ts (3)

261-317: LGTM - OTLP span emission helpers.

The #emitProvisionSpan and #emitRestoreSpan helpers correctly:

  • Check COMPUTE_TRACE_SPANS_ENABLED guard
  • Parse traceparent and handle missing context gracefully
  • Calculate epoch-aligned timestamps
  • Build appropriate resource and span attributes

Also applies to: 397-449


189-259: LGTM - snapshot and deleteInstance methods.

Both methods use consistent error handling with tryCatch, appropriate logging, and return boolean success indicators. The snapshot method correctly expects HTTP 202 for async operations.


344-349: No changes needed. The code correctly converts memory from GB to MB. The machine.memory field consistently represents gigabytes (as documented in test files and used throughout the codebase with memory: 1 meaning 1GB), so multiplying by 1024 to obtain memory_mb is the correct conversion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant