feat(core): ensureHfIds node-id pass + shared hf- mint helper (R1)#1269
Conversation
b2bfae7 to
e275151
Compare
miguel-heygen
left a comment
There was a problem hiding this comment.
Solid foundation for R1. The design choices all hold up — FNV-1a is the right hash for this (pure, deterministic, fast), content-key exclusion of data-hf-* is correct, and the fragment-detection/wrap/unwrap mirrors what parseSourceDocument does. Tests are excellent, especially the "KNOWN LIMITATION" assertions that pin the lifecycle gap honestly.
One thing worth flagging:
P3 — collision limit throws in production:
if (dup > 1000) throw new Error("ensureHfIds: hash collision limit exceeded");1000+ identical elements (same tag, attrs, text) is unlikely in real compositions, but an uncaught throw from ensureHfIds during rendering would crash the entire parse call. Prefer a graceful fallback — e.g. ${key}#pos${positionIndex} or a counter suffix derived from the element's position in the body.querySelectorAll("*") walk — over a hard throw. The Set-based collision resolution already handles the common case; this guard is only for the adversarial edge.
Everything else looks correct. ✅
james-russo-rames-d-jusso
left a comment
There was a problem hiding this comment.
Walked hfIds.ts end-to-end. The mint algorithm is well-thought: FNV-1a 32-bit over tagName | sorted(non-data-hf-* attrs) | ownText, last-4 base-36 chars for the id, seeds the assigned set with pre-existing pins BEFORE walking so fresh mints never collide with persisted ids, and the test suite explicitly pins BOTH the R1 guarantees (idempotence, pinned-id survival, distinct-sibling uniqueness, determinism) AND the four known limitations of pure-content-hash mode (edit-drifts-id, attr-edit-drifts-id, identical-sibling-2nd-occurrence-has-no-content-stable-handle). That last block — KNOWN LIMITATION tests with comments pointing at the design doc's "Implementation status & verified lifecycle gap" — is exactly the right shape for a foundation PR. Future contributors won't accidentally rely on guarantees that don't hold yet.
Concerns
[The assigned set is seeded only from document.querySelectorAll("[data-hf-id]"), but minting walks body.querySelectorAll("*")] Lines 79-82:
const assigned = new Set<string>();
for (const el of Array.from(document.querySelectorAll("[data-hf-id]"))) {
const existing = el.getAttribute("data-hf-id");
if (existing) assigned.add(existing);
}The seed walks the WHOLE document including <head> elements; minting walks only <body>. If a <head> element somehow carries a data-hf-id (it shouldn't, but <title> could be edited with one in a hand-rolled HTML), the head's id is pinned into assigned and a body element trying to mint that same value gets bumped to the dup branch.
Probably never fires in practice — but the asymmetry is worth either (a) seeding from body.querySelectorAll("[data-hf-id]") to match the mint scope, or (b) one-line comment noting the deliberate pin-across-document choice. Right now it reads like an inconsistency.
[Two parses per parseHtml call] ensureHfIds uses linkedom to mint, then the existing parseHtml re-parses with DOMParser. Each call to parseHtml (which is in the hot path for Studio operations, per the briefing) now does:
- linkedom
parseHTML+ walk + serialize → output HTML - DOMParser
parseFromString→ AST for clip extraction
Likely fine at clip-model scale, but worth one of: (a) a perf note in the PR body, (b) a benchmark on the largest realistic clip-model size, or (c) a flag to skip ensureHfIds when the input is already known to be from generateElementHtml (which itself emits data-hf-id).
Also worth verifying: linkedom serialization is byte-stable across a parseHTML → toString() round-trip when the input has a <!doctype> shell. The idempotence test (expect(twice).toBe(once)) pins f(f(x)) === f(x) — but doesn't pin f(x) === x for inputs that already have all ids minted. If linkedom normalizes attribute quoting or whitespace, every parseHtml call subtly mutates the source. Probably fine since downstream consumers don't care, but worth confirming.
[4-char id space, no rationale comment] ~1.68M total ids (36^4). Birthday-paradox collision at N elements: P ≈ N²/(2·36^4). At 1000 elements in a single document: ~30% probability of at least one collision. The dup++ rehash handles it cleanly (and the 1000-iter throw is a safety net), but a one-line comment in toHfId on the chosen width vs. larger alternatives would help future readers — especially since the test fixture (fnv1a of <p class="x">same</p> minted once then minted again deliberately collides) shows the dup path actually runs.
Not asking to bump to 6 chars (4 is plenty for real clip-models) — just a // 4 chars · 36^4 ≈ 1.68M, sufficient for ≤1000 elements/doc with <1% collision per doc after dup rehash would document the choice.
Nits
[mintHfId is exported but only consumed internally] The doc comment says "shared by htmlParser for clip ids", but #1270's parseHtml reads el.getAttribute("data-hf-id") directly without calling mintHfId. If the future intent is to import mintHfId from parsers/hfIds.ts in htmlParser.ts, fine — flag it. If not, demoting it to non-exported tightens the public surface.
[Hash-input encoding] ${a.name}\x00${a.value} + \x01 between pairs is a good choice (control chars are invalid in HTML attribute names/values per the spec) — but el.tagName.toLowerCase() is concatenated with | separator on the outer level. If an attribute somehow contained a literal | (which IS valid in HTML), it'd be ambiguous with the separator. Edge case — but consistency with the \x00/\x01 choice would suggest using \x02 as the outer separator too. Cosmetic.
[EXCLUDED_TAGS lookup is per-element] Inside the hot loop, EXCLUDED_TAGS.has(el.tagName.toLowerCase()) recomputes the lowercase tag name for every element. Could cache el.tagName.toLowerCase() in a local once. Sub-microsecond savings; ignore unless profiling says otherwise.
Questions
[Document-vs-body scope for the pin seed] Same as the first concern but framed as intent: was pinning across the whole document deliberate (defense for future use of <head> ids?), or coincidental from copying querySelectorAll without scoping to body?
[Is there a plan for the wired-in callers (R2+) to also call ensureHfIds?] Right now ensureHfIds runs inside parseHtml only. If R2's mutation flow round-trips HTML without going through parseHtml, the data-hf-id values could go stale. Knowing whether ensureHfIds will be a guard at multiple boundaries or just one would help me reason about the contract.
CI status
CLI smoke (required) shows FAILURE — read the log: Navigation timeout of 60000 ms exceeded during capture_calibration, with browserGpuMode auto → software (WebGL unavailable) preamble. This is the headless-shell calibration timeout you've been actively fixing in #1233 / #1237 / #1239. The diff in this PR is hfIds.ts + hfIds.test.ts — pure parser-level code, zero render-path impact. CI infra flake, not a real failure. Re-run should pass.
Other 36 checks SUCCESS, 8 pending (regression-shards, Windows, etc).
What I didn't verify
- That
parseHTML(html).toString()for the has-document-shell case is genuinely byte-stable across input → output for inputs that don't change semantically. The PR's idempotence test pinsf(f(x)) === f(x)but notf(x) === xfor fully-pinned inputs. - That linkedom and DOMParser agree on attribute parsing edge cases (e.g. unquoted attributes, attributes with embedded
>, etc). If they disagree, the cross-parser hop could produce different element sets — but the test fixtures look pretty clean. - The expected throughput of
parseHtmlon the largest realistic clip-model size; the perf comment is theoretical until measured.
Verdict
Keystone PR for the R1 stack. Mint algorithm is sound, test discipline is exemplary (pinning both guarantees and known limitations is the right move for foundation work), and the API surface is small. The two concerns above are worth addressing but neither is structurally blocking — most worth fixing is the assigned-set scope asymmetry, which is a 1-line change. Clean execution from where I sit; leaving as a comment.
— Review by Rames D Jusso
…iew) Addresses review on #1269 (Miguel P3 + Rames): - mintHfId no longer throws on the pathological collision cap; it widens the id with the dup counter instead, so a degenerate document can never crash the whole parse call. - Seed the `assigned` set from `body.querySelectorAll("[data-hf-id]")` (was the whole document) to match the mint walk's scope — a stray head id can't bump a body element off its hash. - Document the 4-char id-space rationale in toHfId. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Addressed review (daecb2d): P3 — |
- Filter all data-hf-* attrs from contentKey (not just data-hf-id and data-hf-studio-*) - Use \x00/\x01 byte separators to eliminate ambiguous & and = in attr serialization - Use suffix slice for toHfId (most-avalanched FNV bits) - Add 1000-iteration cap to mintHfId collision loop - Fix misleading "Browser-safe" comment (linkedom is Node-only) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…apping - EXCLUDED_TAGS: add link, noscript, base (non-editable body elements that should never receive a stable id and become mutation targets) - ensureHfIds: wrap bare HTML fragments before parsing, matching the logic in parseSourceDocument — without this, linkedom places bare-fragment content outside <body> and body.querySelectorAll returns nothing, so no ids are minted and document.toString() returns a corrupted structure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Locks the pin guarantee and documents the two current limitations (no write-back yet, design §3): pinned id survives a content edit; an unpinned id drifts on text/attribute edits (flip to .toBe once write-back lands); a second identical-content sibling has only a position-derived dedup id. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…iew) Addresses review on #1269 (Miguel P3 + Rames): - mintHfId no longer throws on the pathological collision cap; it widens the id with the dup counter instead, so a degenerate document can never crash the whole parse call. - Seed the `assigned` set from `body.querySelectorAll("[data-hf-id]")` (was the whole document) to match the mint walk's scope — a stray head id can't bump a body element off its hash. - Document the 4-char id-space rationale in toHfId. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
daecb2d to
4476246
Compare
…Target (R7, T5a) (#1296) ## Summary - Adds `hfId` field to `resolveDomEditSelection` — reads `data-hf-id` off the live element and stores it in `DomEditSelection.hfId` - `DomEditSelection extends PatchTarget` which already declares `hfId?: string`, so this is a single new line at the return site - Widens `MutationTarget` in `files.ts` to include `hfId?: string` (type hygiene — the value already survives through `parseMutationBody`'s by-reference pass, so this is documentation not a behaviour change) ## Why R7 / Task 5a. The full hf-id write-back and patch-engine infrastructure (R1 + R7 Tasks 0–4, PRs #1269–#1292) is server-complete. The only missing piece was: the Studio client never read `data-hf-id` off a hit-tested element, so `target.hfId` was always `undefined` and the `hfId`-first lookup branches in both patch engines were unreachable in production. This PR fixes the selection side — the commit wire (#1297) completes the path. ## Test plan - [ ] `packages/studio/src/components/editor/domEditingLayers.test.ts` — two new tests with jsdom environment: - `resolveDomEditSelection` on an element with `data-hf-id` → `selection.hfId` is populated - element without `data-hf-id` → `selection.hfId` is `undefined` - [ ] All 65 studio test files pass, all 72 core test files pass 🤖 Generated with [Claude Code](https://claude.com/claude-code)

What
Adds
ensureHfIds(html: string): stringinpackages/core/src/parsers/hfIds.ts. Single DOM pass (via linkedom) that mints adata-hf-idattribute on every eligible element before the caller sees the markup.Id derivation: FNV-1a 32-bit hash of
tagName | sorted-attrs(\x00/\x01 separated) | ownText, last 4 chars of base-36,hf-prefix. Collision resolution appends a sibling counter and re-hashes. Preserves existing ids (elements withdata-hf-idalready set are skipped). Excludes non-visual tags:script,style,template,meta,link,noscript,base.Fragment handling: detects bare HTML fragments (no
<!doctype/<html) and wraps in a full document shell before parsing, then returnsbody.innerHTML— matching the pattern used byparseSourceDocumentinsourceMutation.Why
Counter-based ids (
element-0,element-1, …) are positional. Inserting a new layer at position 0 shifts every id below it. The R1 milestone requires content-based, stable ids so that targeting operations (split, patch, probe) stay valid across re-parses and element insertions. T2 spec (stableIds.test.ts) defines the contract: same content → same id, adding a sibling doesn't change other ids, format matches/^hf-[a-z0-9]{4}$/.How
toHfId(hash)—slice(-4)ofhash.toString(36)for better distribution across the suffix spacedata-hf-idis excluded from the hash input (prevents circular dependency on the attribute being set)Set; duplicates get a counter suffix before re-hashing/<!doctype|<html[\s>]/i.test(html)— if bare, wrap→parse→body.innerHTMLTest plan
packages/core/src/parsers/stableIds.test.ts— 7 tests pass (3 were.failsstubs targeting R1; 4 were pre-existing baselines that must not regress)