Skip to content

feat(core): ensureHfIds node-id pass + shared hf- mint helper (R1)#1269

Merged
vanceingalls merged 5 commits into
mainfrom
06-07-feat_core_ensurehfids_node-id_pass_shared_hf-_mint_helper_r1_
Jun 9, 2026
Merged

feat(core): ensureHfIds node-id pass + shared hf- mint helper (R1)#1269
vanceingalls merged 5 commits into
mainfrom
06-07-feat_core_ensurehfids_node-id_pass_shared_hf-_mint_helper_r1_

Conversation

@vanceingalls

@vanceingalls vanceingalls commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

What

Adds ensureHfIds(html: string): string in packages/core/src/parsers/hfIds.ts. Single DOM pass (via linkedom) that mints a data-hf-id attribute on every eligible element before the caller sees the markup.

Id derivation: FNV-1a 32-bit hash of tagName | sorted-attrs(\x00/\x01 separated) | ownText, last 4 chars of base-36, hf- prefix. Collision resolution appends a sibling counter and re-hashes. Preserves existing ids (elements with data-hf-id already set are skipped). Excludes non-visual tags: script, style, template, meta, link, noscript, base.

Fragment handling: detects bare HTML fragments (no <!doctype / <html) and wraps in a full document shell before parsing, then returns body.innerHTML — matching the pattern used by parseSourceDocument in sourceMutation.

Why

Counter-based ids (element-0, element-1, …) are positional. Inserting a new layer at position 0 shifts every id below it. The R1 milestone requires content-based, stable ids so that targeting operations (split, patch, probe) stay valid across re-parses and element insertions. T2 spec (stableIds.test.ts) defines the contract: same content → same id, adding a sibling doesn't change other ids, format matches /^hf-[a-z0-9]{4}$/.

How

  • toHfId(hash)slice(-4) of hash.toString(36) for better distribution across the suffix space
  • data-hf-id is excluded from the hash input (prevents circular dependency on the attribute being set)
  • Already-assigned ids tracked in a Set; duplicates get a counter suffix before re-hashing
  • Fragment detection: /<!doctype|<html[\s>]/i.test(html) — if bare, wrap→parse→body.innerHTML

Test plan

@vanceingalls vanceingalls force-pushed the 06-07-feat_core_ensurehfids_node-id_pass_shared_hf-_mint_helper_r1_ branch from b2bfae7 to e275151 Compare June 8, 2026 17:24
@vanceingalls vanceingalls marked this pull request as draft June 8, 2026 18:09
@vanceingalls vanceingalls marked this pull request as ready for review June 8, 2026 21:44

@miguel-heygen miguel-heygen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid foundation for R1. The design choices all hold up — FNV-1a is the right hash for this (pure, deterministic, fast), content-key exclusion of data-hf-* is correct, and the fragment-detection/wrap/unwrap mirrors what parseSourceDocument does. Tests are excellent, especially the "KNOWN LIMITATION" assertions that pin the lifecycle gap honestly.

One thing worth flagging:

P3 — collision limit throws in production:

if (dup > 1000) throw new Error("ensureHfIds: hash collision limit exceeded");

1000+ identical elements (same tag, attrs, text) is unlikely in real compositions, but an uncaught throw from ensureHfIds during rendering would crash the entire parse call. Prefer a graceful fallback — e.g. ${key}#pos${positionIndex} or a counter suffix derived from the element's position in the body.querySelectorAll("*") walk — over a hard throw. The Set-based collision resolution already handles the common case; this guard is only for the adversarial edge.

Everything else looks correct. ✅

@james-russo-rames-d-jusso james-russo-rames-d-jusso left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walked hfIds.ts end-to-end. The mint algorithm is well-thought: FNV-1a 32-bit over tagName | sorted(non-data-hf-* attrs) | ownText, last-4 base-36 chars for the id, seeds the assigned set with pre-existing pins BEFORE walking so fresh mints never collide with persisted ids, and the test suite explicitly pins BOTH the R1 guarantees (idempotence, pinned-id survival, distinct-sibling uniqueness, determinism) AND the four known limitations of pure-content-hash mode (edit-drifts-id, attr-edit-drifts-id, identical-sibling-2nd-occurrence-has-no-content-stable-handle). That last block — KNOWN LIMITATION tests with comments pointing at the design doc's "Implementation status & verified lifecycle gap" — is exactly the right shape for a foundation PR. Future contributors won't accidentally rely on guarantees that don't hold yet.

Concerns

[The assigned set is seeded only from document.querySelectorAll("[data-hf-id]"), but minting walks body.querySelectorAll("*")] Lines 79-82:

const assigned = new Set<string>();
for (const el of Array.from(document.querySelectorAll("[data-hf-id]"))) {
  const existing = el.getAttribute("data-hf-id");
  if (existing) assigned.add(existing);
}

The seed walks the WHOLE document including <head> elements; minting walks only <body>. If a <head> element somehow carries a data-hf-id (it shouldn't, but <title> could be edited with one in a hand-rolled HTML), the head's id is pinned into assigned and a body element trying to mint that same value gets bumped to the dup branch.

Probably never fires in practice — but the asymmetry is worth either (a) seeding from body.querySelectorAll("[data-hf-id]") to match the mint scope, or (b) one-line comment noting the deliberate pin-across-document choice. Right now it reads like an inconsistency.

[Two parses per parseHtml call] ensureHfIds uses linkedom to mint, then the existing parseHtml re-parses with DOMParser. Each call to parseHtml (which is in the hot path for Studio operations, per the briefing) now does:

  1. linkedom parseHTML + walk + serialize → output HTML
  2. DOMParser parseFromString → AST for clip extraction

Likely fine at clip-model scale, but worth one of: (a) a perf note in the PR body, (b) a benchmark on the largest realistic clip-model size, or (c) a flag to skip ensureHfIds when the input is already known to be from generateElementHtml (which itself emits data-hf-id).

Also worth verifying: linkedom serialization is byte-stable across a parseHTML → toString() round-trip when the input has a <!doctype> shell. The idempotence test (expect(twice).toBe(once)) pins f(f(x)) === f(x) — but doesn't pin f(x) === x for inputs that already have all ids minted. If linkedom normalizes attribute quoting or whitespace, every parseHtml call subtly mutates the source. Probably fine since downstream consumers don't care, but worth confirming.

[4-char id space, no rationale comment] ~1.68M total ids (36^4). Birthday-paradox collision at N elements: P ≈ N²/(2·36^4). At 1000 elements in a single document: ~30% probability of at least one collision. The dup++ rehash handles it cleanly (and the 1000-iter throw is a safety net), but a one-line comment in toHfId on the chosen width vs. larger alternatives would help future readers — especially since the test fixture (fnv1a of <p class="x">same</p> minted once then minted again deliberately collides) shows the dup path actually runs.

Not asking to bump to 6 chars (4 is plenty for real clip-models) — just a // 4 chars · 36^4 ≈ 1.68M, sufficient for ≤1000 elements/doc with <1% collision per doc after dup rehash would document the choice.

Nits

[mintHfId is exported but only consumed internally] The doc comment says "shared by htmlParser for clip ids", but #1270's parseHtml reads el.getAttribute("data-hf-id") directly without calling mintHfId. If the future intent is to import mintHfId from parsers/hfIds.ts in htmlParser.ts, fine — flag it. If not, demoting it to non-exported tightens the public surface.

[Hash-input encoding] ${a.name}\x00${a.value} + \x01 between pairs is a good choice (control chars are invalid in HTML attribute names/values per the spec) — but el.tagName.toLowerCase() is concatenated with | separator on the outer level. If an attribute somehow contained a literal | (which IS valid in HTML), it'd be ambiguous with the separator. Edge case — but consistency with the \x00/\x01 choice would suggest using \x02 as the outer separator too. Cosmetic.

[EXCLUDED_TAGS lookup is per-element] Inside the hot loop, EXCLUDED_TAGS.has(el.tagName.toLowerCase()) recomputes the lowercase tag name for every element. Could cache el.tagName.toLowerCase() in a local once. Sub-microsecond savings; ignore unless profiling says otherwise.

Questions

[Document-vs-body scope for the pin seed] Same as the first concern but framed as intent: was pinning across the whole document deliberate (defense for future use of <head> ids?), or coincidental from copying querySelectorAll without scoping to body?

[Is there a plan for the wired-in callers (R2+) to also call ensureHfIds?] Right now ensureHfIds runs inside parseHtml only. If R2's mutation flow round-trips HTML without going through parseHtml, the data-hf-id values could go stale. Knowing whether ensureHfIds will be a guard at multiple boundaries or just one would help me reason about the contract.

CI status

CLI smoke (required) shows FAILURE — read the log: Navigation timeout of 60000 ms exceeded during capture_calibration, with browserGpuMode auto → software (WebGL unavailable) preamble. This is the headless-shell calibration timeout you've been actively fixing in #1233 / #1237 / #1239. The diff in this PR is hfIds.ts + hfIds.test.ts — pure parser-level code, zero render-path impact. CI infra flake, not a real failure. Re-run should pass.

Other 36 checks SUCCESS, 8 pending (regression-shards, Windows, etc).

What I didn't verify

  • That parseHTML(html).toString() for the has-document-shell case is genuinely byte-stable across input → output for inputs that don't change semantically. The PR's idempotence test pins f(f(x)) === f(x) but not f(x) === x for fully-pinned inputs.
  • That linkedom and DOMParser agree on attribute parsing edge cases (e.g. unquoted attributes, attributes with embedded >, etc). If they disagree, the cross-parser hop could produce different element sets — but the test fixtures look pretty clean.
  • The expected throughput of parseHtml on the largest realistic clip-model size; the perf comment is theoretical until measured.

Verdict

Keystone PR for the R1 stack. Mint algorithm is sound, test discipline is exemplary (pinning both guarantees and known limitations is the right move for foundation work), and the API surface is small. The two concerns above are worth addressing but neither is structurally blocking — most worth fixing is the assigned-set scope asymmetry, which is a 1-line change. Clean execution from where I sit; leaving as a comment.

Review by Rames D Jusso

vanceingalls added a commit that referenced this pull request Jun 8, 2026
…iew)

Addresses review on #1269 (Miguel P3 + Rames):
- mintHfId no longer throws on the pathological collision cap; it widens the id
  with the dup counter instead, so a degenerate document can never crash the
  whole parse call.
- Seed the `assigned` set from `body.querySelectorAll("[data-hf-id]")` (was the
  whole document) to match the mint walk's scope — a stray head id can't bump a
  body element off its hash.
- Document the 4-char id-space rationale in toHfId.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vanceingalls

Copy link
Copy Markdown
Collaborator Author

Addressed review (daecb2d): P3mintHfId no longer throws on the collision cap; it widens the id with the dup counter so a degenerate doc can't crash the parse. Rames — seed the assigned set from body.querySelectorAll (was whole document) to match the mint scope, and added the 4-char id-space rationale comment to toHfId. Lint/test green.

vanceingalls and others added 5 commits June 8, 2026 21:02
- Filter all data-hf-* attrs from contentKey (not just data-hf-id and data-hf-studio-*)
- Use \x00/\x01 byte separators to eliminate ambiguous & and = in attr serialization
- Use suffix slice for toHfId (most-avalanched FNV bits)
- Add 1000-iteration cap to mintHfId collision loop
- Fix misleading "Browser-safe" comment (linkedom is Node-only)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…apping

- EXCLUDED_TAGS: add link, noscript, base (non-editable body elements that
  should never receive a stable id and become mutation targets)
- ensureHfIds: wrap bare HTML fragments before parsing, matching the logic
  in parseSourceDocument — without this, linkedom places bare-fragment
  content outside <body> and body.querySelectorAll returns nothing, so no
  ids are minted and document.toString() returns a corrupted structure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Locks the pin guarantee and documents the two current limitations (no
write-back yet, design §3): pinned id survives a content edit; an unpinned id
drifts on text/attribute edits (flip to .toBe once write-back lands); a second
identical-content sibling has only a position-derived dedup id.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…iew)

Addresses review on #1269 (Miguel P3 + Rames):
- mintHfId no longer throws on the pathological collision cap; it widens the id
  with the dup counter instead, so a degenerate document can never crash the
  whole parse call.
- Seed the `assigned` set from `body.querySelectorAll("[data-hf-id]")` (was the
  whole document) to match the mint walk's scope — a stray head id can't bump a
  body element off its hash.
- Document the 4-char id-space rationale in toHfId.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vanceingalls vanceingalls force-pushed the 06-07-feat_core_ensurehfids_node-id_pass_shared_hf-_mint_helper_r1_ branch from daecb2d to 4476246 Compare June 9, 2026 04:03
@vanceingalls vanceingalls merged commit c12854b into main Jun 9, 2026
46 checks passed
@vanceingalls vanceingalls deleted the 06-07-feat_core_ensurehfids_node-id_pass_shared_hf-_mint_helper_r1_ branch June 9, 2026 06:28
vanceingalls added a commit that referenced this pull request Jun 9, 2026
…Target (R7, T5a) (#1296)

## Summary

- Adds `hfId` field to `resolveDomEditSelection` — reads `data-hf-id` off the live element and stores it in `DomEditSelection.hfId`
- `DomEditSelection extends PatchTarget` which already declares `hfId?: string`, so this is a single new line at the return site
- Widens `MutationTarget` in `files.ts` to include `hfId?: string` (type hygiene — the value already survives through `parseMutationBody`'s by-reference pass, so this is documentation not a behaviour change)

## Why

R7 / Task 5a. The full hf-id write-back and patch-engine infrastructure (R1 + R7 Tasks 0–4, PRs #1269#1292) is server-complete. The only missing piece was: the Studio client never read `data-hf-id` off a hit-tested element, so `target.hfId` was always `undefined` and the `hfId`-first lookup branches in both patch engines were unreachable in production. This PR fixes the selection side — the commit wire (#1297) completes the path.

## Test plan

- [ ] `packages/studio/src/components/editor/domEditingLayers.test.ts` — two new tests with jsdom environment:
  - `resolveDomEditSelection` on an element with `data-hf-id` → `selection.hfId` is populated
  - element without `data-hf-id` → `selection.hfId` is `undefined`
- [ ] All 65 studio test files pass, all 72 core test files pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants