[rust-compiler] Emit loc.column/index as UTF-16 code units in SWC frontend by poteto · Pull Request #36507 · facebook/react

poteto · 2026-05-21T06:53:30Z

After the cluster-1 BytePos shift, ConvertCtx::position() emitted
loc.column and loc.index as 0-based UTF-8 byte offsets. Babel emits
them as 0-based UTF-16 code unit offsets (matching JS string indexing).
For files containing any character above U+FFFF (e.g. an emoji like
🔴 U+1F534), the two diverge by +2 per such character because the
char is 4 bytes in UTF-8 but 2 code units in UTF-16.

Precompute a utf16_offsets: Vec<u32> table in ConvertCtx::new
that maps each source byte index to its 0-based UTF-16 code unit
offset. position() then looks up index directly and computes
column as index - utf16_index_of_line_start. O(1) per call; the
table costs ~4× the source length in memory, which is bounded for
fixture/file inputs.

Considered an alternative that walks the source line on each
position() call to count UTF-16 code units. More memory-frugal but
O(line length) per call. The precomputed table wins on O(1) lookup
and the per-call cost matters because position() is invoked on
every node, comment, and reference in the converter.

Clamp the byte index in position() to the sentinel at
utf16_offsets.len() - 1. Synthetic spans (e.g. compiler-generated
imports given BytePos(1)) can point past EOF in degenerate cases;
clamping avoids a panic.

Line numbers stay 1-based and the binary-search remains keyed on
byte offsets, since the underlying line_offsets table is byte-based.

Fixes 4 e2e parity fixtures (3 targeted + 1 latent):

effect-derived-computations/invalid-derived-computation-in-effect.js
error.invalid-derived-computation-in-effect.js
fbt/error.todo-multiple-fbt-plural.tsx
(one additional latent fixture passes for free)

Test plan:

bash compiler/scripts/test-e2e.sh --variant swc:
Before: Total 1770/1795
After: Total 1774/1795 (4 fixed)
bash compiler/scripts/test-e2e.sh --variant babel: 1788/1795 (unchanged)
bash compiler/scripts/test-e2e.sh --variant oxc: 1702/1795 (unchanged)
cargo test --workspace: 56 passed, 0 failed

…ntend After the cluster-1 BytePos shift, `ConvertCtx::position()` emitted `loc.column` and `loc.index` as 0-based UTF-8 byte offsets. Babel emits them as 0-based UTF-16 code unit offsets (matching JS string indexing). For files containing any character above U+FFFF (e.g. an emoji like 🔴 U+1F534), the two diverge by +2 per such character because the char is 4 bytes in UTF-8 but 2 code units in UTF-16. Precompute a `utf16_offsets: Vec<u32>` table in `ConvertCtx::new` that maps each source byte index to its 0-based UTF-16 code unit offset. `position()` then looks up `index` directly and computes `column` as `index - utf16_index_of_line_start`. O(1) per call; the table costs ~4× the source length in memory, which is bounded for fixture/file inputs. Considered an alternative that walks the source line on each `position()` call to count UTF-16 code units. More memory-frugal but O(line length) per call. The precomputed table wins on O(1) lookup and the per-call cost matters because `position()` is invoked on every node, comment, and reference in the converter. Clamp the byte index in `position()` to the sentinel at `utf16_offsets.len() - 1`. Synthetic spans (e.g. compiler-generated imports given `BytePos(1)`) can point past EOF in degenerate cases; clamping avoids a panic. Line numbers stay 1-based and the binary-search remains keyed on byte offsets, since the underlying `line_offsets` table is byte-based. Fixes 4 e2e parity fixtures (3 targeted + 1 latent): - effect-derived-computations/invalid-derived-computation-in-effect.js - error.invalid-derived-computation-in-effect.js - fbt/error.todo-multiple-fbt-plural.tsx - (one additional latent fixture passes for free) Test plan: - bash compiler/scripts/test-e2e.sh --variant swc: Before: Total 1770/1795 After: Total 1774/1795 (4 fixed) - bash compiler/scripts/test-e2e.sh --variant babel: 1788/1795 (unchanged) - bash compiler/scripts/test-e2e.sh --variant oxc: 1702/1795 (unchanged) - cargo test --workspace: 56 passed, 0 failed

github-actions Bot added the React Core Team Opened by a member of the React Core Team label May 21, 2026

meta-cla Bot added the CLA Signed label May 21, 2026

Base automatically changed from lauren/swc-05-ts-cast-wrapper-types to pr-36173 May 21, 2026 07:09

poteto force-pushed the lauren/swc-06-utf16-loc branch from bab4929 to 08cf7d6 Compare May 21, 2026 07:14

poteto merged commit 0d9885c into pr-36173 May 21, 2026
14 of 20 checks passed

poteto deleted the lauren/swc-06-utf16-loc branch May 21, 2026 07:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rust-compiler] Emit loc.column/index as UTF-16 code units in SWC frontend#36507

[rust-compiler] Emit loc.column/index as UTF-16 code units in SWC frontend#36507
poteto merged 1 commit into
pr-36173from
lauren/swc-06-utf16-loc

poteto commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

poteto commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant