fix(producer): localize remote <img> sources + await image readiness#1197
Conversation
Producer's frame-capture has `pollVideosReady` (waits readyState >= 2 for every <video>) but no equivalent for <img>. Combined with htmlCompiler's `collectExternalAssets` explicitly skipping http(s) URLs (line 805-806), agent-pipeline-generated compositions (astral / daphne / hyperion multi-v2 outputs with raw S3 <img src>) reach Chrome with a network dependency that races the readiness gate AND can be evicted mid-render. Either path produces blank-frame flicker. Reproduction (02_kobe agent output, 42s render @ 30fps): scene_02's remote S3 background-image painted from t=7.0s, vanished at t=10.5s (frame size 139KB vs 700-940KB neighbors), back at t=11.0s. GSAP timeline said opacity:1 throughout — Chrome simply didn't have the pixels. Two-layer fix: 1. **Producer** — `localizeRemoteImageSources` in `htmlCompiler.ts` mirrors the existing `localizeRemoteMediaSources` (video/audio) + `localizeRemoteFontFaces` pattern, reusing `downloadAndRewriteUrls` and the `_remote_media/` subdir. Wired into `compileForRender` between the media and font localize steps. Once the file is local, Chrome's image cache is bounded by disk reads, not S3 latency. 2. **Engine** — `pollImagesReady` + `decodeAllImages` helpers in `frameCapture.ts` parallel to `pollVideosReady`. Waits for every `<img>` (skipping data: URIs) to have `complete && naturalWidth > 0`, then forces GPU upload via `img.decode()`. Called from both the classic-xvfb path and the BeginFrame path after their respective video readiness checks. Defense-in-depth — Layer 1 closes the symptom for current+future agent-pipeline outputs; Layer 2 protects any future code path that leaves a remote URL in place. Tests: 7 new cases in `htmlCompiler.test.ts` covering happy-path rewrite, 404 fallback, dedup of duplicate URLs, non-HTTP and data: URI passthrough, both quote styles, and the agent-pipeline shape where `src` is not the first attribute. All pass alongside the existing 56 htmlCompiler tests.
…ents Review follow-ups on the remote-<img> localization fix: - Tighten REMOTE_IMG_TAG_RE with a (?<![\w-]) lookbehind so it matches a real `src` attribute only. The previous `\bsrc` also matched `data-src` (and `data-*-src`) lazy-loader placeholders, which would download/rewrite a URL the render never paints. Added a regression test; `srcset` stays excluded by the `\s*=` requirement. - Fix comments that claimed frameCapture has "no pollImagesReady analog" — this PR adds exactly that, so the docstrings were self-contradictory. Reframed localization as the primary fix and pollImagesReady as the defense-in-depth layer, and documented the <img src>-only scope (srcset / <picture> / SVG <image> / CSS background-image are follow-ups). Verified locally end-to-end on the 02_kobe repro: all 4 remote S3 <img> URLs localize to _remote_media/, the render completes, and the frame at t~10.5s that was a 139KB blank in the broken render now paints the trophy background in every native-fps frame. htmlCompiler.test.ts 64 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Code review — reviewed at high effort, fixes pushed (
|
miguel-heygen
left a comment
There was a problem hiding this comment.
Good fix — the two-layer approach (localize-at-compile + poll-at-capture) is the right architecture, the regex and downloadAndRewriteUrls reuse are solid, and the 7 new test cases cover the important shapes. Two real issues to address before merge, plus one cleanup note.
[bug] pollImagesReady has no broken-image escape — causes full timeout delay for errored/404 imgs
pollVideosReady exits early on errored video elements with if (ve.error) return true. pollImagesReady has no equivalent, and HTMLImageElement doesn't have a .error property — but for a broken/404 image Chrome sets complete = true with naturalWidth = 0. The current predicate complete && naturalWidth > 0 returns false for that case, so the poll spins the full pageReadyTimeout (45 s default) before falling through to the warning.
This is a regression for any composition that has a broken <img> (including the 404-fallback path in localizeRemoteImageSources when a remote URL fails to download). Before this PR, a broken image had no wait; after this PR it silently adds a 45 s delay.
Fix is a one-liner inside the imgs.every callback — treat complete=true && naturalWidth===0 as done:
if (ie.complete && ie.naturalWidth === 0) return true; // broken/404 — can't get better, skip
if (ie.complete && ie.naturalWidth > 0) return true; // successfully loaded
return false;File: packages/engine/src/services/frameCapture.ts, around line 575.
[bug] decodeAllImages can throw (uncaught) when called on still-in-flight images
decodeAllImages calls img.decode() on every image after pollImagesReady finishes — including images that timed out with complete = false (still fetching). Per the WHATWG spec, img.decode() on a loading image awaits the fetch result; it only rejects on error, not on timeout. Puppeteer's page.evaluate has a 30 s default timeout, so decodeAllImages will throw a TimeoutError when called on an image that is still in-flight after the full 45 s pollImagesReady timeout. That exception is not caught by either call site and propagates out of initializeSession, aborting the render.
The fix is either:
- Only call
decode()on images wherecomplete && naturalWidth > 0(skip unresolved ones), or - Wrap the
page.evaluatewithPromise.race([..., timeout])similar towaitForOptionalTailwindReady.
Option 1 is simpler and consistent with the "skip broken" pattern:
return ie.decode().catch(() => undefined);
// becomes:
if (!ie.complete || ie.naturalWidth === 0) return Promise.resolve(); // skip unloaded/broken
return ie.decode().catch(() => undefined);File: packages/engine/src/services/frameCapture.ts, around line 602.
[nit] Missing test coverage for the broken-image path in pollImagesReady
The 7 new htmlCompiler.test.ts cases are thorough. The frameCapture side has no test for the complete=true, naturalWidth=0 (404/errored image) path. Given the bug above, a test that mocks an errored <img> and asserts that pollImagesReady resolves immediately (not after timeout) would be worth adding alongside the fix.
CI: All required checks passing (Build, Test, Lint, Typecheck, runtime contract, Studio smoke, global install). Regression shards and Windows checks are still pending at review time.
Scope acknowledgment: The PR correctly documents that srcset, <picture><source>, SVG <image href>, and CSS background-image: url() are out of scope — those are reasonable follow-ups. The regex lookbehind for data-src is correct and the downloadAndRewriteUrls reuse is clean.
vanceingalls
left a comment
There was a problem hiding this comment.
Two-layer design is sound — mirrors the existing localizeRemoteMediaSources + localizeRemoteFontFaces pattern and doesn't reach for abstractions the current scope doesn't need. The regex semantics are correct ((?<![\w-]) blocks data-src/data-*-src; srcset is excluded because set follows src rather than \s*=). The html-chain in compileForRender is correct: htmlWithLocalMedia → htmlWithLocalImages → html(font-localized), and parseImageElements runs over the final html, so CompiledComposition.images will contain the localized paths. CI is green.
A few things worth a look before/after merge:
Important — repro description says "background-image gone": please confirm the failing element is <img> not CSS background-image.
The PR table at t=10.5s says "background-image gone, only text overlay + flat gray bg." If the trophy background in the 02_kobe composition is set via background-image: url("https://...") on a <div> (rather than <img src="https://...">), neither the producer localize step nor pollImagesReady covers it — pollImagesReady only walks <img> elements, and localizeRemoteImageSources uses REMOTE_IMG_TAG_RE which matches <img> tags, not CSS properties. The rest of the architecture is solid, but if the primary failing asset is a CSS background, the fix misses it.
If the composition does use <img> for this layer (colloquial "background-image"), just noting it here so reviewers have explicit confirmation before merge.
Important — decodeAllImages comment overstates what it prevents.
The comment on decodeAllImages reads:
"Chrome may evict the decoded pixels under memory pressure between frames.
img.decode()returns a Promise that resolves once the image is ready for synchronous use"
decodeAllImages runs once at init, before the first frame. A single decode() at session-start cannot prevent mid-render cache eviction — if Chrome evicts at frame 120, there's no mechanism here to force a re-upload before frame 121. The real guard against mid-render eviction is the producer-side localization: once the file is local, a cache miss is a fast disk read instead of an S3 round-trip, so the "blank frame" window closes even if eviction still occurs.
This is a comment-accuracy issue, not a correctness bug. The function still provides value (forces GPU upload before the first frame, which is the initial-paint race). Suggested fix: change "Chrome may evict the decoded pixels under memory pressure between frames" → "ensures the bitmap is GPU-uploaded before the first frame, closing the initial-paint compositor race."
Nit — PR description says 7 new test cases; diff has 8.
Count in the diff: "rewrites remote src", "preserves original URL", "deduplicates", "does not rewrite local", "does not rewrite data:", "rewrites both quote styles", "does not match data-src", "handles src not as first attribute" = 8 it() blocks. The test plan checkmark says 7. Minor, just flagging.
Nit — pollImagesReady re-implements pollPageExpression.
The new function has its own check closure + deadline loop that is structurally identical to pollPageExpression(page, expr, timeoutMs). The difference is that it uses a typed function callback rather than a string expression, which is the right call for complex multi-step logic. This is fine as-is — using the string-expression helper would force serializing the whole predicate into a JS string and lose type safety. Just noting for awareness.
Open follow-up scope (not blocking, documented in PR):
- CSS
background-image: url(https://...)on non-@font-faceblocks — same race shape, not covered here. <picture><source srcset>and<img srcset>— multi-resolution images with remote srcsets.- JS-set
img.srcat runtime — the video path hasdiscoverMediaFromBrowserfor dynamicvideo.src; no image equivalent exists. If any agent-pipeline composition setsimg.srcfrom JS rather than static HTML, the regex misses it.
These are documented non-goals for this PR, not blockers. Worth tracking as follow-ups if agent-pipeline output shapes evolve.
Overall: architecture is correct, tests cover the critical shapes, CI passes. The only thing that should block merge is clarification on whether the 02_kobe failing element is <img src> or CSS background-image. If it's <img>, ship it.
— Vai
…flight Addresses two real bugs Magi caught in review on hf#1197: 1. pollImagesReady would spin the full pageReadyTimeout (45s default) for any <img> that settled with an error — Chrome marks 404 / decode failure / CORS rejection with (complete=true, naturalWidth=0), and the previous predicate `complete && naturalWidth > 0` returned false for those, so the poll ran to timeout. This is the HTMLImageElement equivalent of pollVideosReady's `ve.error` early-exit. Add a `complete && naturalWidth === 0` branch that treats settled-with- error as done — waiting won't make it load. Particularly relevant because localizeRemoteImageSources falls back to the original URL on download failure; that failed URL is now hit by a 45s stall instead of the broken-image marker rendering immediately. 2. decodeAllImages called img.decode() on every image, including those still in flight after pollImagesReady timed out. Per the WHATWG spec, decode() on a loading image awaits the fetch — never resolving until the network completes or puppeteer's evaluate timeout fires and throws an uncaught error that aborts the render. Pre-filter to only call decode() on images that successfully loaded. Test coverage: new frameCapture-pollImagesReady.test.ts with 8 cases covering empty docs, all-loaded, broken (complete + naturalWidth=0), data: URI, empty src, in-flight → resolves, in-flight → timeout, and the mixed batch. The broken-image test explicitly asserts elapsed < 500ms on a 1000ms timeout — guards against the regression Magi flagged.
jrusso1020
left a comment
There was a problem hiding this comment.
Thanks Magi — both bugs verified and fixed in 4d831246.
1. pollImagesReady broken-image escape — added the complete && naturalWidth === 0 branch right above the loaded check, with a comment citing pollVideosReady's ve.error early-exit as the symmetric pattern. Confirmed in test: a broken image now resolves in <500ms on a 1000ms-timeout poll instead of running to full timeout.
2. decodeAllImages in-flight skip — added if (!ie.complete || ie.naturalWidth === 0) return Promise.resolve(); before the decode() call. Pre-filters in-flight (would hang on the fetch) and broken (decode rejects but skipping is clearer than relying on the .catch).
3. Test coverage — new packages/engine/src/services/frameCapture-pollImagesReady.test.ts with 8 cases:
- empty doc
- all loaded
- broken image (the regression case — explicit
elapsed < 500msassertion) - data: URI
- empty src
- in-flight → resolves
- in-flight → timeout
- mixed batch (loaded + broken + data: + in-flight)
pollImagesReady is @internal export'd for testability — same pattern as localizeRemoteImageSources. Mock page.evaluate swaps in a Node-side document mock so the in-page predicate runs against synthetic image specs without a real browser.
All 8 new tests pass. Existing htmlCompiler.test.ts (63) and frameCapture*.test.ts (31 across 4 files) still green. bun run lint + bun run format:check repo-wide clean. Commit signed.
Review-cycle reply by Jerrai (hyperframes specialist)
miguel-heygen
left a comment
There was a problem hiding this comment.
Both fixes in, tests cover the edge cases including the broken-image resolution time bound (<500ms on a 1000ms timeout). LGTM.
Vai correctly noted that decode() forces initial GPU upload but does not prevent Chrome from evicting decoded pixels mid-render. The producer-side localizeRemoteImageSources is what bounds the eviction risk (local file-server paging vs S3 re-fetch). Comment updated to reflect that split of responsibilities.
vanceingalls
left a comment
There was a problem hiding this comment.
LGTM — <img> confirmed, nit addressed. Ship it.
What
Two-layer fix for a flicker bug where agent-pipeline-generated compositions with remote
<img src="https://...">URLs produce blank-frame flicker in the rendered video while the preview looks fine.localizeRemoteImageSourcesinpackages/producer/src/services/htmlCompiler.ts— mirrors the existinglocalizeRemoteMediaSources(video/audio) +localizeRemoteFontFacespattern. Wired intocompileForRenderbetween the media and font localize steps.pollImagesReady+decodeAllImageshelpers inpackages/engine/src/services/frameCapture.tsparallel topollVideosReady. Wait for every<img>(skippingdata:URIs) to havecomplete && naturalWidth > 0, then force GPU upload viaimg.decode(). Called from both the classic-xvfb path and the BeginFrame path.Why
frameCapture.tswaits for<video>(pollVideosReady), fonts (document.fonts.ready), and Tailwind, but has no equivalent for<img>. Combined withcollectExternalAssetsathtmlCompiler.ts:805-806explicitly skipping http(s) URLs (only handles local files outside the project dir), compositions emitted by agent pipelines (astral / daphne / hyperionmulti-v2outputs with raw S3<img src>) reach Chrome with a network dependency that:Reported by Miao Yang + Wenbo Zhu on a 02_kobe composition. Reproduction at 30fps:
GSAP timeline at t=10.5s says
opacity:1, scale:~1.03, filter:brightness(0.6)— fully visible per the timeline. Frame state diverged from timeline state because Chrome didn't have the pixels.The existing
hf#1146/hf#1155/hf#1160localize work runs athyperframes publisharchive time. Agent-pipeline outputs render directly without going throughpublish, so that localize step never ran on them. This PR makes localization part of the render-side compile pipeline so any path producing remote<img>URLs gets caught.How
Producer layer (architectural fix — closes the symptom):
REMOTE_IMG_TAG_REregex parallels the existingREMOTE_MEDIA_TAG_RE, matches<img>opening tags with an http(s)src.localizeRemoteImageSources(exported@internalfor tests, same pattern aslocalizeRemoteMediaSources) collects unique URLs, calls the shareddownloadAndRewriteUrlshelper, writes to_remote_media/<basename>, and returns{ html, remoteMediaAssets }for the orchestrator to copy.compileForRenderright after the existing media-localize call. Failures fall back to the original URL with a console.warn (same pattern as the other localize steps).Engine layer (defense in depth):
pollImagesReadypolls every 100ms (configurable) until every<img>reportscomplete && naturalWidth > 0, skippingdata:URIs and emptysrc. Same shape aspollVideosReady.decodeAllImagescallsimg.decode()on every image after readiness to force GPU upload before the first frame capture — guards against compositor-side lazy decoding races.pollVideosReadyin both the classic-xvfb readiness path and the BeginFrame readiness path. Mirrors the existingpollVideosReadyerror-reporting (lists which images failed and continues the render rather than throwing).Test plan
htmlCompiler.test.tsforlocalizeRemoteImageSources:<img>src to_remote_media/path on download successdata:URI srcsrcnot as the first attribute (the agent-pipeline shape from the bug report)htmlCompiler.test.ts: 63 pass / 0 failframeCapture*.test.tsfiles: 31 pass / 0 failbun run lint(oxlint repo-wide): 0 warnings / 0 errorsbun run format:check(oxfmt repo-wide): clean🤖 Signed off by Jerrai (hyperframes specialist)