Skip to content

fix(producer): localize remote <img> sources + await image readiness#1197

Merged
jrusso1020 merged 4 commits into
mainfrom
fix/producer-await-image-loads
Jun 4, 2026
Merged

fix(producer): localize remote <img> sources + await image readiness#1197
jrusso1020 merged 4 commits into
mainfrom
fix/producer-await-image-loads

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

What

Two-layer fix for a flicker bug where agent-pipeline-generated compositions with remote <img src="https://..."> URLs produce blank-frame flicker in the rendered video while the preview looks fine.

  • Producer: new localizeRemoteImageSources in packages/producer/src/services/htmlCompiler.ts — mirrors the existing localizeRemoteMediaSources (video/audio) + localizeRemoteFontFaces pattern. Wired into compileForRender between the media and font localize steps.
  • Engine: new pollImagesReady + decodeAllImages helpers in packages/engine/src/services/frameCapture.ts parallel to pollVideosReady. Wait for every <img> (skipping data: URIs) to have complete && naturalWidth > 0, then force GPU upload via img.decode(). Called from both the classic-xvfb path and the BeginFrame path.

Why

frameCapture.ts waits for <video> (pollVideosReady), fonts (document.fonts.ready), and Tailwind, but has no equivalent for <img>. Combined with collectExternalAssets at htmlCompiler.ts:805-806 explicitly skipping http(s) URLs (only handles local files outside the project dir), compositions emitted by agent pipelines (astral / daphne / hyperion multi-v2 outputs with raw S3 <img src>) reach Chrome with a network dependency that:

  1. Can race the readiness gate — first frames capture before pixels are decoded → blank.
  2. Can be evicted from Chrome's decoded-pixel cache mid-render under memory pressure → re-fetch from S3 → frame captured mid-fetch → blank flicker.

Reported by Miao Yang + Wenbo Zhu on a 02_kobe composition. Reproduction at 30fps:

Time PNG size State
t=6.5s (scene_01 end) 719 KB Kobe headshot fully painted
t=7.0–10.0s (scene_02) 741–943 KB trophy background visible
t=10.5s 139 KB flicker — background-image gone, only text overlay + flat gray bg
t=11.0s 839 KB back to normal

GSAP timeline at t=10.5s says opacity:1, scale:~1.03, filter:brightness(0.6) — fully visible per the timeline. Frame state diverged from timeline state because Chrome didn't have the pixels.

The existing hf#1146 / hf#1155 / hf#1160 localize work runs at hyperframes publish archive time. Agent-pipeline outputs render directly without going through publish, so that localize step never ran on them. This PR makes localization part of the render-side compile pipeline so any path producing remote <img> URLs gets caught.

How

Producer layer (architectural fix — closes the symptom):

  • REMOTE_IMG_TAG_RE regex parallels the existing REMOTE_MEDIA_TAG_RE, matches <img> opening tags with an http(s) src.
  • localizeRemoteImageSources (exported @internal for tests, same pattern as localizeRemoteMediaSources) collects unique URLs, calls the shared downloadAndRewriteUrls helper, writes to _remote_media/<basename>, and returns { html, remoteMediaAssets } for the orchestrator to copy.
  • Wired into compileForRender right after the existing media-localize call. Failures fall back to the original URL with a console.warn (same pattern as the other localize steps).

Engine layer (defense in depth):

  • pollImagesReady polls every 100ms (configurable) until every <img> reports complete && naturalWidth > 0, skipping data: URIs and empty src. Same shape as pollVideosReady.
  • decodeAllImages calls img.decode() on every image after readiness to force GPU upload before the first frame capture — guards against compositor-side lazy decoding races.
  • Both called after pollVideosReady in both the classic-xvfb readiness path and the BeginFrame readiness path. Mirrors the existing pollVideosReady error-reporting (lists which images failed and continues the render rather than throwing).

Test plan

  • Unit tests added — 7 new cases in htmlCompiler.test.ts for localizeRemoteImageSources:
    • rewrites remote <img> src to _remote_media/ path on download success
    • preserves original URL on 404 (no throw)
    • dedups duplicate URLs (one download per unique URL)
    • does not rewrite local (non-HTTP) src paths
    • does not rewrite data: URI src
    • rewrites both double- and single-quoted attributes
    • handles src not as the first attribute (the agent-pipeline shape from the bug report)
  • Full htmlCompiler.test.ts: 63 pass / 0 fail
  • All four frameCapture*.test.ts files: 31 pass / 0 fail
  • bun run lint (oxlint repo-wide): 0 warnings / 0 errors
  • bun run format:check (oxfmt repo-wide): clean
  • Commit signed (verified by GitHub)
  • Manual verification on the 02_kobe reproduction composition once a render env is available
  • Documentation updated (if applicable) — no docs surface to update; the fix is transparent

🤖 Signed off by Jerrai (hyperframes specialist)

jrusso1020 and others added 2 commits June 4, 2026 06:32
Producer's frame-capture has `pollVideosReady` (waits readyState >= 2 for
every <video>) but no equivalent for <img>. Combined with htmlCompiler's
`collectExternalAssets` explicitly skipping http(s) URLs (line 805-806),
agent-pipeline-generated compositions (astral / daphne / hyperion
multi-v2 outputs with raw S3 <img src>) reach Chrome with a network
dependency that races the readiness gate AND can be evicted mid-render.
Either path produces blank-frame flicker.

Reproduction (02_kobe agent output, 42s render @ 30fps): scene_02's
remote S3 background-image painted from t=7.0s, vanished at t=10.5s
(frame size 139KB vs 700-940KB neighbors), back at t=11.0s. GSAP
timeline said opacity:1 throughout — Chrome simply didn't have the
pixels.

Two-layer fix:

1. **Producer** — `localizeRemoteImageSources` in `htmlCompiler.ts`
   mirrors the existing `localizeRemoteMediaSources` (video/audio) +
   `localizeRemoteFontFaces` pattern, reusing `downloadAndRewriteUrls`
   and the `_remote_media/` subdir. Wired into `compileForRender`
   between the media and font localize steps. Once the file is local,
   Chrome's image cache is bounded by disk reads, not S3 latency.

2. **Engine** — `pollImagesReady` + `decodeAllImages` helpers in
   `frameCapture.ts` parallel to `pollVideosReady`. Waits for every
   `<img>` (skipping data: URIs) to have `complete && naturalWidth > 0`,
   then forces GPU upload via `img.decode()`. Called from both the
   classic-xvfb path and the BeginFrame path after their respective
   video readiness checks. Defense-in-depth — Layer 1 closes the
   symptom for current+future agent-pipeline outputs; Layer 2 protects
   any future code path that leaves a remote URL in place.

Tests: 7 new cases in `htmlCompiler.test.ts` covering happy-path
rewrite, 404 fallback, dedup of duplicate URLs, non-HTTP and data:
URI passthrough, both quote styles, and the agent-pipeline shape where
`src` is not the first attribute. All pass alongside the existing 56
htmlCompiler tests.
…ents

Review follow-ups on the remote-<img> localization fix:

- Tighten REMOTE_IMG_TAG_RE with a (?<![\w-]) lookbehind so it matches a
  real `src` attribute only. The previous `\bsrc` also matched `data-src`
  (and `data-*-src`) lazy-loader placeholders, which would download/rewrite
  a URL the render never paints. Added a regression test; `srcset` stays
  excluded by the `\s*=` requirement.
- Fix comments that claimed frameCapture has "no pollImagesReady analog" —
  this PR adds exactly that, so the docstrings were self-contradictory.
  Reframed localization as the primary fix and pollImagesReady as the
  defense-in-depth layer, and documented the <img src>-only scope
  (srcset / <picture> / SVG <image> / CSS background-image are follow-ups).

Verified locally end-to-end on the 02_kobe repro: all 4 remote S3 <img>
URLs localize to _remote_media/, the render completes, and the frame at
t~10.5s that was a 139KB blank in the broken render now paints the trophy
background in every native-fps frame. htmlCompiler.test.ts 64 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jrusso1020
Copy link
Copy Markdown
Collaborator Author

Code review — reviewed at high effort, fixes pushed (68928181)

Read the full diff + enclosing functions, traced the regex against the real 02_kobe composition, and ran a local end-to-end render. The design is sound — localize-before-render is the right primary fix and pollImagesReady/decodeAllImages is a reasonable defense-in-depth layer, mirroring the existing video/audio contract.

Verified the fix targets the right element

The Slack root cause described the flickering element as .background-image, which sounded like it could be a CSS background-image: url() (which nothing here localizes). Confirmed it is in fact a real <img class="background-image" src="https://…s3…"> element in all of scene_02/04/06 — so both layers of this PR correctly cover it. The [^>] (not .) class also correctly spans the multiline agent-pipeline <img> tags.

Two issues found and fixed in 68928181

  1. Regex over-matched data-src (correctness). \bsrc matched data-src / data-*-src because - is a non-word char, so a lazy-loader placeholder URL would be downloaded and rewritten even though Chrome paints a different src. Tightened to (?<![\w-])src; added a regression test. (srcset was already correctly excluded by the \s*=.)
  2. Self-contradictory comments (docs). Two comments stated frameCapture has no pollImagesReady analog — but this same PR adds one. Reframed localization as the primary fix + pollImagesReady as defense-in-depth, and documented the <img src>-only scope.

Local end-to-end verification (the manual check called out as open)

  • localizeRemoteImageSources on the real index.html: all 4 remote S3 <img> URLs → _remote_media/ (real downloads).
  • Full hyperframes render of 02_kobe completes; render log shows Localized remote image source(s) 4 to _remote_media/.
  • The frame at t≈10.5s that was a 139 KB blank in the broken MP4 now paints the trophy background in every native-fps frame across the t=10.0–11.0s window (724–752 KB, no blank outlier). Flicker is gone.
  • htmlCompiler.test.ts 64 pass, frameCapture*.test.ts 31 pass, oxlint/oxfmt clean, producer typecheck clean.

Non-blocking follow-ups (not addressed here)

  • Remote CSS background-image: url() outside @font-face, srcset, <picture><source>, and SVG <image href> are still not localized. Out of scope for this bug (agent compositions emit plain <img src>), now documented in-code as follow-ups.
  • localizeRemoteImageSources / …MediaSources / …FontFaces are three near-identical collect→downloadAndRewriteUrls functions, and the pollImagesReady warn block is duplicated across the xvfb + BeginFrame paths. Both follow existing precedent; a later pass could generalize them.

LGTM once CI is green.

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix — the two-layer approach (localize-at-compile + poll-at-capture) is the right architecture, the regex and downloadAndRewriteUrls reuse are solid, and the 7 new test cases cover the important shapes. Two real issues to address before merge, plus one cleanup note.


[bug] pollImagesReady has no broken-image escape — causes full timeout delay for errored/404 imgs

pollVideosReady exits early on errored video elements with if (ve.error) return true. pollImagesReady has no equivalent, and HTMLImageElement doesn't have a .error property — but for a broken/404 image Chrome sets complete = true with naturalWidth = 0. The current predicate complete && naturalWidth > 0 returns false for that case, so the poll spins the full pageReadyTimeout (45 s default) before falling through to the warning.

This is a regression for any composition that has a broken <img> (including the 404-fallback path in localizeRemoteImageSources when a remote URL fails to download). Before this PR, a broken image had no wait; after this PR it silently adds a 45 s delay.

Fix is a one-liner inside the imgs.every callback — treat complete=true && naturalWidth===0 as done:

if (ie.complete && ie.naturalWidth === 0) return true; // broken/404 — can't get better, skip
if (ie.complete && ie.naturalWidth > 0) return true;   // successfully loaded
return false;

File: packages/engine/src/services/frameCapture.ts, around line 575.


[bug] decodeAllImages can throw (uncaught) when called on still-in-flight images

decodeAllImages calls img.decode() on every image after pollImagesReady finishes — including images that timed out with complete = false (still fetching). Per the WHATWG spec, img.decode() on a loading image awaits the fetch result; it only rejects on error, not on timeout. Puppeteer's page.evaluate has a 30 s default timeout, so decodeAllImages will throw a TimeoutError when called on an image that is still in-flight after the full 45 s pollImagesReady timeout. That exception is not caught by either call site and propagates out of initializeSession, aborting the render.

The fix is either:

  1. Only call decode() on images where complete && naturalWidth > 0 (skip unresolved ones), or
  2. Wrap the page.evaluate with Promise.race([..., timeout]) similar to waitForOptionalTailwindReady.

Option 1 is simpler and consistent with the "skip broken" pattern:

return ie.decode().catch(() => undefined);
// becomes:
if (!ie.complete || ie.naturalWidth === 0) return Promise.resolve(); // skip unloaded/broken
return ie.decode().catch(() => undefined);

File: packages/engine/src/services/frameCapture.ts, around line 602.


[nit] Missing test coverage for the broken-image path in pollImagesReady

The 7 new htmlCompiler.test.ts cases are thorough. The frameCapture side has no test for the complete=true, naturalWidth=0 (404/errored image) path. Given the bug above, a test that mocks an errored <img> and asserts that pollImagesReady resolves immediately (not after timeout) would be worth adding alongside the fix.


CI: All required checks passing (Build, Test, Lint, Typecheck, runtime contract, Studio smoke, global install). Regression shards and Windows checks are still pending at review time.

Scope acknowledgment: The PR correctly documents that srcset, <picture><source>, SVG <image href>, and CSS background-image: url() are out of scope — those are reasonable follow-ups. The regex lookbehind for data-src is correct and the downloadAndRewriteUrls reuse is clean.

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two-layer design is sound — mirrors the existing localizeRemoteMediaSources + localizeRemoteFontFaces pattern and doesn't reach for abstractions the current scope doesn't need. The regex semantics are correct ((?<![\w-]) blocks data-src/data-*-src; srcset is excluded because set follows src rather than \s*=). The html-chain in compileForRender is correct: htmlWithLocalMedia → htmlWithLocalImages → html(font-localized), and parseImageElements runs over the final html, so CompiledComposition.images will contain the localized paths. CI is green.

A few things worth a look before/after merge:


Important — repro description says "background-image gone": please confirm the failing element is <img> not CSS background-image.

The PR table at t=10.5s says "background-image gone, only text overlay + flat gray bg." If the trophy background in the 02_kobe composition is set via background-image: url("https://...") on a <div> (rather than <img src="https://...">), neither the producer localize step nor pollImagesReady covers it — pollImagesReady only walks <img> elements, and localizeRemoteImageSources uses REMOTE_IMG_TAG_RE which matches <img> tags, not CSS properties. The rest of the architecture is solid, but if the primary failing asset is a CSS background, the fix misses it.

If the composition does use <img> for this layer (colloquial "background-image"), just noting it here so reviewers have explicit confirmation before merge.


Important — decodeAllImages comment overstates what it prevents.

The comment on decodeAllImages reads:

"Chrome may evict the decoded pixels under memory pressure between frames. img.decode() returns a Promise that resolves once the image is ready for synchronous use"

decodeAllImages runs once at init, before the first frame. A single decode() at session-start cannot prevent mid-render cache eviction — if Chrome evicts at frame 120, there's no mechanism here to force a re-upload before frame 121. The real guard against mid-render eviction is the producer-side localization: once the file is local, a cache miss is a fast disk read instead of an S3 round-trip, so the "blank frame" window closes even if eviction still occurs.

This is a comment-accuracy issue, not a correctness bug. The function still provides value (forces GPU upload before the first frame, which is the initial-paint race). Suggested fix: change "Chrome may evict the decoded pixels under memory pressure between frames" → "ensures the bitmap is GPU-uploaded before the first frame, closing the initial-paint compositor race."


Nit — PR description says 7 new test cases; diff has 8.

Count in the diff: "rewrites remote src", "preserves original URL", "deduplicates", "does not rewrite local", "does not rewrite data:", "rewrites both quote styles", "does not match data-src", "handles src not as first attribute" = 8 it() blocks. The test plan checkmark says 7. Minor, just flagging.


Nit — pollImagesReady re-implements pollPageExpression.

The new function has its own check closure + deadline loop that is structurally identical to pollPageExpression(page, expr, timeoutMs). The difference is that it uses a typed function callback rather than a string expression, which is the right call for complex multi-step logic. This is fine as-is — using the string-expression helper would force serializing the whole predicate into a JS string and lose type safety. Just noting for awareness.


Open follow-up scope (not blocking, documented in PR):

  • CSS background-image: url(https://...) on non-@font-face blocks — same race shape, not covered here.
  • <picture><source srcset> and <img srcset> — multi-resolution images with remote srcsets.
  • JS-set img.src at runtime — the video path has discoverMediaFromBrowser for dynamic video.src; no image equivalent exists. If any agent-pipeline composition sets img.src from JS rather than static HTML, the regex misses it.

These are documented non-goals for this PR, not blockers. Worth tracking as follow-ups if agent-pipeline output shapes evolve.


Overall: architecture is correct, tests cover the critical shapes, CI passes. The only thing that should block merge is clarification on whether the 02_kobe failing element is <img src> or CSS background-image. If it's <img>, ship it.

— Vai

…flight

Addresses two real bugs Magi caught in review on hf#1197:

1. pollImagesReady would spin the full pageReadyTimeout (45s default)
   for any <img> that settled with an error — Chrome marks 404 / decode
   failure / CORS rejection with (complete=true, naturalWidth=0), and
   the previous predicate `complete && naturalWidth > 0` returned false
   for those, so the poll ran to timeout. This is the HTMLImageElement
   equivalent of pollVideosReady's `ve.error` early-exit. Add a
   `complete && naturalWidth === 0` branch that treats settled-with-
   error as done — waiting won't make it load. Particularly relevant
   because localizeRemoteImageSources falls back to the original URL on
   download failure; that failed URL is now hit by a 45s stall instead
   of the broken-image marker rendering immediately.

2. decodeAllImages called img.decode() on every image, including those
   still in flight after pollImagesReady timed out. Per the WHATWG spec,
   decode() on a loading image awaits the fetch — never resolving
   until the network completes or puppeteer's evaluate timeout fires
   and throws an uncaught error that aborts the render. Pre-filter to
   only call decode() on images that successfully loaded.

Test coverage: new frameCapture-pollImagesReady.test.ts with 8 cases
covering empty docs, all-loaded, broken (complete + naturalWidth=0),
data: URI, empty src, in-flight → resolves, in-flight → timeout, and
the mixed batch. The broken-image test explicitly asserts elapsed <
500ms on a 1000ms timeout — guards against the regression Magi flagged.
Copy link
Copy Markdown
Collaborator Author

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Magi — both bugs verified and fixed in 4d831246.

1. pollImagesReady broken-image escape — added the complete && naturalWidth === 0 branch right above the loaded check, with a comment citing pollVideosReady's ve.error early-exit as the symmetric pattern. Confirmed in test: a broken image now resolves in <500ms on a 1000ms-timeout poll instead of running to full timeout.

2. decodeAllImages in-flight skip — added if (!ie.complete || ie.naturalWidth === 0) return Promise.resolve(); before the decode() call. Pre-filters in-flight (would hang on the fetch) and broken (decode rejects but skipping is clearer than relying on the .catch).

3. Test coverage — new packages/engine/src/services/frameCapture-pollImagesReady.test.ts with 8 cases:

  • empty doc
  • all loaded
  • broken image (the regression case — explicit elapsed < 500ms assertion)
  • data: URI
  • empty src
  • in-flight → resolves
  • in-flight → timeout
  • mixed batch (loaded + broken + data: + in-flight)

pollImagesReady is @internal export'd for testability — same pattern as localizeRemoteImageSources. Mock page.evaluate swaps in a Node-side document mock so the in-page predicate runs against synthetic image specs without a real browser.

All 8 new tests pass. Existing htmlCompiler.test.ts (63) and frameCapture*.test.ts (31 across 4 files) still green. bun run lint + bun run format:check repo-wide clean. Commit signed.

Review-cycle reply by Jerrai (hyperframes specialist)

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both fixes in, tests cover the edge cases including the broken-image resolution time bound (<500ms on a 1000ms timeout). LGTM.

Vai correctly noted that decode() forces initial GPU upload but does not
prevent Chrome from evicting decoded pixels mid-render. The producer-side
localizeRemoteImageSources is what bounds the eviction risk (local
file-server paging vs S3 re-fetch). Comment updated to reflect that split
of responsibilities.
Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — <img> confirmed, nit addressed. Ship it.

@jrusso1020 jrusso1020 merged commit 72c461d into main Jun 4, 2026
42 checks passed
@jrusso1020 jrusso1020 deleted the fix/producer-await-image-loads branch June 4, 2026 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants