Skip to content

[rust-compiler] Emit loc.column/index as UTF-16 code units in SWC frontend#36507

Merged
poteto merged 1 commit into
pr-36173from
lauren/swc-06-utf16-loc
May 21, 2026
Merged

[rust-compiler] Emit loc.column/index as UTF-16 code units in SWC frontend#36507
poteto merged 1 commit into
pr-36173from
lauren/swc-06-utf16-loc

Conversation

@poteto
Copy link
Copy Markdown
Collaborator

@poteto poteto commented May 21, 2026

After the cluster-1 BytePos shift, ConvertCtx::position() emitted
loc.column and loc.index as 0-based UTF-8 byte offsets. Babel emits
them as 0-based UTF-16 code unit offsets (matching JS string indexing).
For files containing any character above U+FFFF (e.g. an emoji like
🔴 U+1F534), the two diverge by +2 per such character because the
char is 4 bytes in UTF-8 but 2 code units in UTF-16.

Precompute a utf16_offsets: Vec<u32> table in ConvertCtx::new
that maps each source byte index to its 0-based UTF-16 code unit
offset. position() then looks up index directly and computes
column as index - utf16_index_of_line_start. O(1) per call; the
table costs ~4× the source length in memory, which is bounded for
fixture/file inputs.

Considered an alternative that walks the source line on each
position() call to count UTF-16 code units. More memory-frugal but
O(line length) per call. The precomputed table wins on O(1) lookup
and the per-call cost matters because position() is invoked on
every node, comment, and reference in the converter.

Clamp the byte index in position() to the sentinel at
utf16_offsets.len() - 1. Synthetic spans (e.g. compiler-generated
imports given BytePos(1)) can point past EOF in degenerate cases;
clamping avoids a panic.

Line numbers stay 1-based and the binary-search remains keyed on
byte offsets, since the underlying line_offsets table is byte-based.

Fixes 4 e2e parity fixtures (3 targeted + 1 latent):

  • effect-derived-computations/invalid-derived-computation-in-effect.js
  • error.invalid-derived-computation-in-effect.js
  • fbt/error.todo-multiple-fbt-plural.tsx
  • (one additional latent fixture passes for free)

Test plan:

  • bash compiler/scripts/test-e2e.sh --variant swc:
    Before: Total 1770/1795
    After: Total 1774/1795 (4 fixed)
  • bash compiler/scripts/test-e2e.sh --variant babel: 1788/1795 (unchanged)
  • bash compiler/scripts/test-e2e.sh --variant oxc: 1702/1795 (unchanged)
  • cargo test --workspace: 56 passed, 0 failed

@github-actions github-actions Bot added the React Core Team Opened by a member of the React Core Team label May 21, 2026
@meta-cla meta-cla Bot added the CLA Signed label May 21, 2026
Base automatically changed from lauren/swc-05-ts-cast-wrapper-types to pr-36173 May 21, 2026 07:09
…ntend

After the cluster-1 BytePos shift, `ConvertCtx::position()` emitted
`loc.column` and `loc.index` as 0-based UTF-8 byte offsets. Babel emits
them as 0-based UTF-16 code unit offsets (matching JS string indexing).
For files containing any character above U+FFFF (e.g. an emoji like
🔴 U+1F534), the two diverge by +2 per such character because the
char is 4 bytes in UTF-8 but 2 code units in UTF-16.

Precompute a `utf16_offsets: Vec<u32>` table in `ConvertCtx::new`
that maps each source byte index to its 0-based UTF-16 code unit
offset. `position()` then looks up `index` directly and computes
`column` as `index - utf16_index_of_line_start`. O(1) per call; the
table costs ~4× the source length in memory, which is bounded for
fixture/file inputs.

Considered an alternative that walks the source line on each
`position()` call to count UTF-16 code units. More memory-frugal but
O(line length) per call. The precomputed table wins on O(1) lookup
and the per-call cost matters because `position()` is invoked on
every node, comment, and reference in the converter.

Clamp the byte index in `position()` to the sentinel at
`utf16_offsets.len() - 1`. Synthetic spans (e.g. compiler-generated
imports given `BytePos(1)`) can point past EOF in degenerate cases;
clamping avoids a panic.

Line numbers stay 1-based and the binary-search remains keyed on
byte offsets, since the underlying `line_offsets` table is byte-based.

Fixes 4 e2e parity fixtures (3 targeted + 1 latent):
- effect-derived-computations/invalid-derived-computation-in-effect.js
- error.invalid-derived-computation-in-effect.js
- fbt/error.todo-multiple-fbt-plural.tsx
- (one additional latent fixture passes for free)

Test plan:
- bash compiler/scripts/test-e2e.sh --variant swc:
    Before: Total 1770/1795
    After:  Total 1774/1795 (4 fixed)
- bash compiler/scripts/test-e2e.sh --variant babel: 1788/1795 (unchanged)
- bash compiler/scripts/test-e2e.sh --variant oxc:   1702/1795 (unchanged)
- cargo test --workspace: 56 passed, 0 failed
@poteto poteto force-pushed the lauren/swc-06-utf16-loc branch from bab4929 to 08cf7d6 Compare May 21, 2026 07:14
@poteto poteto merged commit 0d9885c into pr-36173 May 21, 2026
14 of 20 checks passed
@poteto poteto deleted the lauren/swc-06-utf16-loc branch May 21, 2026 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed React Core Team Opened by a member of the React Core Team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant