fix(producer): restore calibration timeout ceiling + add pipeline observability by miguel-heygen · Pull Request #1233 · heygen-com/hyperframes

miguel-heygen · 2026-06-06T01:59:22Z

Summary

Fixes #1231 — CLI render hangs at "Initializing calibration session..." on v0.6.74.

Root cause: PR #1221 changed Math.min → Math.max in createCaptureCalibrationConfig, raising the calibration protocol timeout from 30s to the default 300s. When any CDP call stalls during session init (page.goto, pollHfReady, page.evaluate), the 300s timeout makes the render appear to hang indefinitely. Note: BeginFrame is Linux-only — the reporter is on macOS where screenshot mode is always used, so this is a stalled CDP call in the screenshot init path, not a BeginFrame wedge.

Commit 1: Restore calibration timeout ceiling

Math.max → Math.min in createCaptureCalibrationConfig — calibration stays capped at 30s so a stuck CDP call fails fast and the fallback path recovers
Users who need longer timeouts for the main render can still use --protocol-timeout
Per-phase timing logs in initializeSession ([initSession:<mode>] <phase> (<ms>ms)) to pinpoint which step stalls on affected machines

Commit 2: Render pipeline observability

Added structured logging at four key points so future issues contain enough context for triage:

Log line	What it captures
`[Render] Pipeline started`	Platform, arch, node version, all timeout values, GPU mode
`[Render] Calibration config`	Calibration timeout vs parent timeout, forceScreenshot, frame count
`[BrowserManager] Browser launched`	Chrome version, capture mode, headless shell path, platform
`[Render] Failure summary`	Failed stage, error, elapsed time, stage timings, last 5 console errors

Test plan

renderOrchestrator.test.ts — 59/59 pass (calibration safeguard tests updated)
TypeScript compiles cleanly (engine + producer)
Lint + format clean (oxlint + oxfmt)
Local render verified: all new log lines appear, healthy init completes in ~120ms
Pre-commit hooks pass (lefthook: lint, format, typecheck, fallow)

…der hang The v0.6.74 change (Math.min → Math.max in createCaptureCalibrationConfig) raised the calibration protocol timeout from 30s to the default 300s. When a CDP call stalls during session init — page.goto, pollHfReady, or any page.evaluate — the 300s timeout makes the render appear to hang indefinitely at "Initializing calibration session...". Restore Math.min so calibration stays capped at 30s: if Chrome is stuck, fail fast and let the fallback path recover. Also add phase-level timing logs to initializeSession so the next report pinpoints which step stalls. Closes #1231

Log the resolved environment at pipeline start (platform, arch, node version, all timeout values, GPU mode), the calibration config showing the actual timeout being used vs the parent, Chrome version and capture mode at browser launch, and a structured failure summary on error with stage timings and console errors. These four log categories give agents and users enough context to file actionable issues without needing to reproduce the problem.

james-russo-rames-d-jusso

Hey Miguel — two-part PR, both clean: a real production regression fix from #1221 + observability instrumentation that's well-shaped for triage. No blockers, one small instrumentation asymmetry, a couple of nits.

The fix itself

Math.max → Math.min in createCaptureCalibrationConfig (captureCost.ts:163). One character, exactly the right semantic — calibration's protocolTimeout is a ceiling (cap a wedged BeginFrame fast so screenshot fallback kicks in), not a floor. The flip in #1221 made the default 300s cfg.protocolTimeout get picked, which is the user-visible "hangs at Initializing calibration session..." that #1231 reports. Confirmed: Math.min(300_000, 30_000) = 30_000 (capped) and Math.min(5_000, 30_000) = 5_000 (preserved). The docstring rewrite ("ceiling, not a floor") locks the intent for the next reader.
Test rename matches the restored semantic. The previous tests were renamed to "respects user protocol timeout when higher" / "uses calibration floor when user timeout is lower" in #1221 to match the (wrong) Math.max behavior — so CI stayed green through the regression. Worth a small process callout: when a test name flips its premise (floor ↔ ceiling, respects ↔ caps) the underlying contract has changed, and that's the moment for an extra sanity check on the behavior shift. Not actionable for this PR; the new test names ("caps protocol timeout at calibration ceiling for fast fallback" / "preserves user timeout when already below calibration ceiling") self-document the restored semantic cleanly.

Observability — concerns

Asymmetric pollVideosReady instrumentation. The BeginFrame branch logs logInitPhase("pollVideosReady complete") at frameCapture.ts:936. The screenshot branch calls pollVideosReady at frameCapture.ts:784 (the exact same gate, same pageReadyTimeout) but doesn't log the phase. The reporter on #1231 is on macOS = screenshot mode — and the per-phase logs are explicitly there to pinpoint which CDP call wedges. A future stall in pollVideosReady on the screenshot path would show up as a gap between applyVideoMetadataHints complete and images ready + decoded, which doesn't name the wedged step. One extra logInitPhase("pollVideosReady complete") line after frameCapture.ts:802 closes the gap.

Nits

Per-phase logs use console.log; the orchestrator-level logs use log.info. Defensible because frameCapture.ts is a lower-level module that already uses console.warn for the video/image timeout warnings — internally consistent. Worth noting only because the on-call diagnosing #1231-shaped issues will be grep'ing for both prefixes; a single structured logger pass eventually would make field aggregation simpler. Future cleanup, not blocking. (nit)
Log volume on the producer's hot path. ~9 logInitPhase lines per render, fires on every render. For a one-off CLI render, fine. For a producer pumping 100s/hour, that's a measurable log stream. Worth either a process.env.HF_RENDER_DEBUG gate, or pushing them to log.debug (assuming the producer's logger has level filtering) so they only stream in verbose mode. The triage value is highest when something is wrong; the 80% happy-path case doesn't need the timing trail. (nit)
[Render] Failure summary includes a nice browserConsoleErrors slice but filters by string-match on "ERROR"/"PAGEERROR". The actual buffer format isn't visible from the diff; if those tokens are part of a formatter prefix you control, you'll catch what you want — if they're upstream Chrome console substrings, edge-case error categories (e.g. WARNING-level entries that still indicate the wedge) get dropped. Worth a quick check that the regex matches what the buffer actually emits. (nit)
pollSubCompositionTimelines complete phase log fires unconditionally in both branches. If a composition has zero sub-comp timelines, the call returns ~instantly and the log line is still 100% noise. Could elide when the inner work was a no-op, but the cost is one console.log so not worth the conditional. (nit, lean toward leave as-is)
Browser launch log uses await browser.version().catch(() => "unknown") — adds a CDP round-trip to the launch path. browser.version() is typically cheap (cached after first call), but if you ever launch many short-lived browsers (Lambda burst case from the #1225 conversation) it's worth knowing this is per-browser, not per-render. (nit, almost certainly fine in practice)

Questions

Long-term, do you want these phase timings going to whatever metrics pipeline the producer uses (Datadog, Honeycomb), or is structured stdout sufficient for the on-call use case? The shape you've chosen is great for grep + tail diagnostics; metric ingestion would want the same data with a slightly different envelope (e.g. tag = phase, value = ms). Mention because the next iteration of this work might want both.
Were the [initSession] phase logs sized against any specific real wedged-render trace from #1231, or are they the staff-engineer-intuition pick of "the obvious split points"? If you have a wedge trace from the reporter, worth a comment in the PR describing which phase it actually stalled at — closes the feedback loop and proves the instrumentation answers the question that drove it.

What I didn't verify

The executeRenderJob's catch block scope — confirmed pipelineStart (renderOrchestrator.ts:1454) and lastBrowserConsole (renderOrchestrator.ts:1458) are live inside the new failure summary; the perfStages reference is presumed in scope from the surrounding context.
The exact format of lastBrowserConsole entries — only the filter is visible in the diff.
pollHfReady / pollSubCompositionTimelines internals — assumed they each have their own internal timeouts and don't depend on protocolTimeout in a way the calibration cap affects.
Local reproducer of #1231 (no Chromium-on-stalling-macOS rig).
That the asymmetry in pollVideosReady instrumentation isn't intentional (e.g., screenshot mode's call is somehow lower-cost / not worth instrumenting) — read it as oversight, happy to be told otherwise.

— Review by Rames D Jusso

The BeginFrame path logged this phase but the screenshot path didn't, creating an instrumentation gap when diagnosing hangs on macOS where screenshot mode is always used.

miguel-heygen added 2 commits June 5, 2026 21:58

miguel-heygen changed the title ~~fix(producer): restore 30s calibration timeout ceiling to prevent render hang~~ fix(producer): restore calibration timeout ceiling + add pipeline observability Jun 6, 2026

james-russo-rames-d-jusso reviewed Jun 6, 2026

View reviewed changes

fix(engine): add missing pollVideosReady phase log in screenshot path

d74e7b6

The BeginFrame path logged this phase but the screenshot path didn't, creating an instrumentation gap when diagnosing hangs on macOS where screenshot mode is always used.

miguel-heygen merged commit 1d16216 into main Jun 6, 2026
29 checks passed

miguel-heygen deleted the fix/calibration-timeout-regression branch June 6, 2026 02:16

vanceingalls mentioned this pull request Jun 6, 2026

fix(engine): scale V8 heap and GPU budget proportionally on low-memory systems #1235

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(producer): restore calibration timeout ceiling + add pipeline observability#1233

fix(producer): restore calibration timeout ceiling + add pipeline observability#1233
miguel-heygen merged 3 commits into
mainfrom
fix/calibration-timeout-regression

miguel-heygen commented Jun 6, 2026 •

edited

Loading

Uh oh!

james-russo-rames-d-jusso left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

miguel-heygen commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commit 1: Restore calibration timeout ceiling

Commit 2: Render pipeline observability

Test plan

Uh oh!

james-russo-rames-d-jusso left a comment

Choose a reason for hiding this comment

The fix itself

Observability — concerns

Nits

Questions

What I didn't verify

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

miguel-heygen commented Jun 6, 2026 •

edited

Loading