fix(producer): restore calibration timeout ceiling + add pipeline observability#1233
Merged
Merged
Conversation
…der hang The v0.6.74 change (Math.min → Math.max in createCaptureCalibrationConfig) raised the calibration protocol timeout from 30s to the default 300s. When a CDP call stalls during session init — page.goto, pollHfReady, or any page.evaluate — the 300s timeout makes the render appear to hang indefinitely at "Initializing calibration session...". Restore Math.min so calibration stays capped at 30s: if Chrome is stuck, fail fast and let the fallback path recover. Also add phase-level timing logs to initializeSession so the next report pinpoints which step stalls. Closes #1231
Log the resolved environment at pipeline start (platform, arch, node version, all timeout values, GPU mode), the calibration config showing the actual timeout being used vs the parent, Chrome version and capture mode at browser launch, and a structured failure summary on error with stage timings and console errors. These four log categories give agents and users enough context to file actionable issues without needing to reproduce the problem.
james-russo-rames-d-jusso
left a comment
There was a problem hiding this comment.
Hey Miguel — two-part PR, both clean: a real production regression fix from #1221 + observability instrumentation that's well-shaped for triage. No blockers, one small instrumentation asymmetry, a couple of nits.
The fix itself
Math.max→Math.minincreateCaptureCalibrationConfig(captureCost.ts:163). One character, exactly the right semantic — calibration'sprotocolTimeoutis a ceiling (cap a wedged BeginFrame fast so screenshot fallback kicks in), not a floor. The flip in #1221 made the default 300scfg.protocolTimeoutget picked, which is the user-visible "hangs at Initializing calibration session..." that #1231 reports. Confirmed:Math.min(300_000, 30_000) = 30_000(capped) andMath.min(5_000, 30_000) = 5_000(preserved). The docstring rewrite ("ceiling, not a floor") locks the intent for the next reader.- Test rename matches the restored semantic. The previous tests were renamed to "respects user protocol timeout when higher" / "uses calibration floor when user timeout is lower" in #1221 to match the (wrong) Math.max behavior — so CI stayed green through the regression. Worth a small process callout: when a test name flips its premise (
floor↔ceiling,respects↔caps) the underlying contract has changed, and that's the moment for an extra sanity check on the behavior shift. Not actionable for this PR; the new test names ("caps protocol timeout at calibration ceiling for fast fallback" / "preserves user timeout when already below calibration ceiling") self-document the restored semantic cleanly.
Observability — concerns
- Asymmetric
pollVideosReadyinstrumentation. The BeginFrame branch logslogInitPhase("pollVideosReady complete")atframeCapture.ts:936. The screenshot branch callspollVideosReadyatframeCapture.ts:784(the exact same gate, samepageReadyTimeout) but doesn't log the phase. The reporter on #1231 is on macOS = screenshot mode — and the per-phase logs are explicitly there to pinpoint which CDP call wedges. A future stall inpollVideosReadyon the screenshot path would show up as a gap betweenapplyVideoMetadataHints completeandimages ready + decoded, which doesn't name the wedged step. One extralogInitPhase("pollVideosReady complete")line afterframeCapture.ts:802closes the gap.
Nits
- Per-phase logs use
console.log; the orchestrator-level logs uselog.info. Defensible becauseframeCapture.tsis a lower-level module that already usesconsole.warnfor the video/image timeout warnings — internally consistent. Worth noting only because the on-call diagnosing #1231-shaped issues will be grep'ing for both prefixes; a single structured logger pass eventually would make field aggregation simpler. Future cleanup, not blocking. (nit) - Log volume on the producer's hot path. ~9
logInitPhaselines per render, fires on every render. For a one-off CLI render, fine. For a producer pumping 100s/hour, that's a measurable log stream. Worth either aprocess.env.HF_RENDER_DEBUGgate, or pushing them tolog.debug(assuming the producer's logger has level filtering) so they only stream in verbose mode. The triage value is highest when something is wrong; the 80% happy-path case doesn't need the timing trail. (nit) [Render] Failure summaryincludes a nicebrowserConsoleErrorsslice but filters by string-match on"ERROR"/"PAGEERROR". The actual buffer format isn't visible from the diff; if those tokens are part of a formatter prefix you control, you'll catch what you want — if they're upstream Chrome console substrings, edge-case error categories (e.g.WARNING-level entries that still indicate the wedge) get dropped. Worth a quick check that the regex matches what the buffer actually emits. (nit)pollSubCompositionTimelines completephase log fires unconditionally in both branches. If a composition has zero sub-comp timelines, the call returns ~instantly and the log line is still 100% noise. Could elide when the inner work was a no-op, but the cost is oneconsole.logso not worth the conditional. (nit, lean toward leave as-is)- Browser launch log uses
await browser.version().catch(() => "unknown")— adds a CDP round-trip to the launch path.browser.version()is typically cheap (cached after first call), but if you ever launch many short-lived browsers (Lambda burst case from the #1225 conversation) it's worth knowing this is per-browser, not per-render. (nit, almost certainly fine in practice)
Questions
- Long-term, do you want these phase timings going to whatever metrics pipeline the producer uses (Datadog, Honeycomb), or is structured stdout sufficient for the on-call use case? The shape you've chosen is great for grep + tail diagnostics; metric ingestion would want the same data with a slightly different envelope (e.g. tag =
phase, value =ms). Mention because the next iteration of this work might want both. - Were the
[initSession]phase logs sized against any specific real wedged-render trace from #1231, or are they the staff-engineer-intuition pick of "the obvious split points"? If you have a wedge trace from the reporter, worth a comment in the PR describing which phase it actually stalled at — closes the feedback loop and proves the instrumentation answers the question that drove it.
What I didn't verify
- The
executeRenderJob's catch block scope — confirmedpipelineStart(renderOrchestrator.ts:1454) andlastBrowserConsole(renderOrchestrator.ts:1458) are live inside the new failure summary; theperfStagesreference is presumed in scope from the surrounding context. - The exact format of
lastBrowserConsoleentries — only the filter is visible in the diff. pollHfReady/pollSubCompositionTimelinesinternals — assumed they each have their own internal timeouts and don't depend onprotocolTimeoutin a way the calibration cap affects.- Local reproducer of #1231 (no Chromium-on-stalling-macOS rig).
- That the asymmetry in pollVideosReady instrumentation isn't intentional (e.g., screenshot mode's call is somehow lower-cost / not worth instrumenting) — read it as oversight, happy to be told otherwise.
— Review by Rames D Jusso
The BeginFrame path logged this phase but the screenshot path didn't, creating an instrumentation gap when diagnosing hangs on macOS where screenshot mode is always used.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1231 — CLI render hangs at "Initializing calibration session..." on v0.6.74.
Root cause: PR #1221 changed
Math.min→Math.maxincreateCaptureCalibrationConfig, raising the calibration protocol timeout from 30s to the default 300s. When any CDP call stalls during session init (page.goto,pollHfReady,page.evaluate), the 300s timeout makes the render appear to hang indefinitely. Note: BeginFrame is Linux-only — the reporter is on macOS where screenshot mode is always used, so this is a stalled CDP call in the screenshot init path, not a BeginFrame wedge.Commit 1: Restore calibration timeout ceiling
Math.max→Math.minincreateCaptureCalibrationConfig— calibration stays capped at 30s so a stuck CDP call fails fast and the fallback path recovers--protocol-timeoutinitializeSession([initSession:<mode>] <phase> (<ms>ms)) to pinpoint which step stalls on affected machinesCommit 2: Render pipeline observability
Added structured logging at four key points so future issues contain enough context for triage:
[Render] Pipeline started[Render] Calibration config[BrowserManager] Browser launched[Render] Failure summaryTest plan
renderOrchestrator.test.ts— 59/59 pass (calibration safeguard tests updated)