[rust-compiler] Emit loc.column/index as UTF-16 code units in SWC frontend#36507
Merged
Conversation
…ntend
After the cluster-1 BytePos shift, `ConvertCtx::position()` emitted
`loc.column` and `loc.index` as 0-based UTF-8 byte offsets. Babel emits
them as 0-based UTF-16 code unit offsets (matching JS string indexing).
For files containing any character above U+FFFF (e.g. an emoji like
🔴 U+1F534), the two diverge by +2 per such character because the
char is 4 bytes in UTF-8 but 2 code units in UTF-16.
Precompute a `utf16_offsets: Vec<u32>` table in `ConvertCtx::new`
that maps each source byte index to its 0-based UTF-16 code unit
offset. `position()` then looks up `index` directly and computes
`column` as `index - utf16_index_of_line_start`. O(1) per call; the
table costs ~4× the source length in memory, which is bounded for
fixture/file inputs.
Considered an alternative that walks the source line on each
`position()` call to count UTF-16 code units. More memory-frugal but
O(line length) per call. The precomputed table wins on O(1) lookup
and the per-call cost matters because `position()` is invoked on
every node, comment, and reference in the converter.
Clamp the byte index in `position()` to the sentinel at
`utf16_offsets.len() - 1`. Synthetic spans (e.g. compiler-generated
imports given `BytePos(1)`) can point past EOF in degenerate cases;
clamping avoids a panic.
Line numbers stay 1-based and the binary-search remains keyed on
byte offsets, since the underlying `line_offsets` table is byte-based.
Fixes 4 e2e parity fixtures (3 targeted + 1 latent):
- effect-derived-computations/invalid-derived-computation-in-effect.js
- error.invalid-derived-computation-in-effect.js
- fbt/error.todo-multiple-fbt-plural.tsx
- (one additional latent fixture passes for free)
Test plan:
- bash compiler/scripts/test-e2e.sh --variant swc:
Before: Total 1770/1795
After: Total 1774/1795 (4 fixed)
- bash compiler/scripts/test-e2e.sh --variant babel: 1788/1795 (unchanged)
- bash compiler/scripts/test-e2e.sh --variant oxc: 1702/1795 (unchanged)
- cargo test --workspace: 56 passed, 0 failed
bab4929 to
08cf7d6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
After the cluster-1 BytePos shift,
ConvertCtx::position()emittedloc.columnandloc.indexas 0-based UTF-8 byte offsets. Babel emitsthem as 0-based UTF-16 code unit offsets (matching JS string indexing).
For files containing any character above U+FFFF (e.g. an emoji like
🔴 U+1F534), the two diverge by +2 per such character because the
char is 4 bytes in UTF-8 but 2 code units in UTF-16.
Precompute a
utf16_offsets: Vec<u32>table inConvertCtx::newthat maps each source byte index to its 0-based UTF-16 code unit
offset.
position()then looks upindexdirectly and computescolumnasindex - utf16_index_of_line_start. O(1) per call; thetable costs ~4× the source length in memory, which is bounded for
fixture/file inputs.
Considered an alternative that walks the source line on each
position()call to count UTF-16 code units. More memory-frugal butO(line length) per call. The precomputed table wins on O(1) lookup
and the per-call cost matters because
position()is invoked onevery node, comment, and reference in the converter.
Clamp the byte index in
position()to the sentinel atutf16_offsets.len() - 1. Synthetic spans (e.g. compiler-generatedimports given
BytePos(1)) can point past EOF in degenerate cases;clamping avoids a panic.
Line numbers stay 1-based and the binary-search remains keyed on
byte offsets, since the underlying
line_offsetstable is byte-based.Fixes 4 e2e parity fixtures (3 targeted + 1 latent):
Test plan:
Before: Total 1770/1795
After: Total 1774/1795 (4 fixed)