fix: Use codepoints in `lpad`, `rpad`, `translate` by neilconway · Pull Request #21405 · apache/datafusion

neilconway · 2026-04-06T14:46:58Z

Which issue does this PR close?

Closes lpad, rpad, translate should use codepoints, not graphemes #21060.

Rationale for this change

lpad, rpad, and translate use grapheme segmentation. This is inconsistent with how these functions behave in Postgres and DuckDB, as well as the SQL standard -- segmentation based on Unicode codepoints is used instead. It also happens that grapheme-based segmentation is significantly more expensive than codepoint-based segmentation.

In the case of lpad and rpad, graphemes and codepoints were used inconsistently: the input string was measured in code points but the fill string was measured in graphemes.

#3054 switched to using codepoints for most string-related functions in DataFusion but these three functions still need to be changed.

Benchmarks (M4 Max):

lpad size=1024:

lpad utf8 [str_len=5, target=20]: 12.4 µs → 12.8 µs, +3.0%
lpad stringview [str_len=5, target=20]: 11.5 µs → 11.7 µs, +1.4%
lpad utf8 [str_len=20, target=50]: 11.3 µs → 11.3 µs, +0.1%
lpad stringview [str_len=20, target=50]: 11.8 µs → 12.0 µs, +1.6%
lpad utf8 unicode [target=20]: 98.4 µs → 24.4 µs, -75.1%
lpad stringview unicode [target=20]: 99.8 µs → 26.0 µs, -74.0%
lpad utf8 scalar [str_len=5, target=20, fill='x']: 8.7 µs → 8.8 µs, +1.0%
lpad stringview scalar [str_len=5, target=20, fill='x']: 10.2 µs → 10.1 µs, -0.1%
lpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 44.7 µs → 10.9 µs, -75.7%
lpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 152.5 µs → 11.7 µs, -92.3%

lpad size=4096:

lpad utf8 [str_len=5, target=20]: 55.9 µs → 55.1 µs, -1.4%
lpad stringview [str_len=5, target=20]: 49.2 µs → 50.1 µs, +1.8%
lpad utf8 [str_len=20, target=50]: 46.6 µs → 46.4 µs, -0.5%
lpad stringview [str_len=20, target=50]: 47.5 µs → 48.5 µs, +2.1%
lpad utf8 unicode [target=20]: 401.3 µs → 100.1 µs, -75.0%
lpad stringview unicode [target=20]: 397.7 µs → 104.9 µs, -73.6%
lpad utf8 scalar [str_len=5, target=20, fill='x']: 34.2 µs → 35.0 µs, +2.4%
lpad stringview scalar [str_len=5, target=20, fill='x']: 40.1 µs → 40.4 µs, +0.6%
lpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 178.3 µs → 42.9 µs, -76.0%
lpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 601.3 µs → 46.2 µs, -92.3%

rpad size=1024:

rpad utf8 [str_len=5, target=20]: 15.5 µs → 14.4 µs, -7.1%
rpad stringview [str_len=5, target=20]: 13.8 µs → 14.0 µs, +1.7%
rpad utf8 [str_len=20, target=50]: 12.6 µs → 12.7 µs, +1.3%
rpad stringview [str_len=20, target=50]: 13.0 µs → 13.1 µs, +0.7%
rpad utf8 unicode [target=20]: 103.5 µs → 26.0 µs, -74.8%
rpad stringview unicode [target=20]: 101.2 µs → 27.6 µs, -72.7%
rpad utf8 scalar [str_len=5, target=20, fill='x']: 11.4 µs → 10.9 µs, -3.9%
rpad stringview scalar [str_len=5, target=20, fill='x']: 12.2 µs → 12.6 µs, +2.8%
rpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 46.3 µs → 12.4 µs, -73.1%
rpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 155.6 µs → 11.6 µs, -92.4%

rpad size=4096:

rpad utf8 [str_len=5, target=20]: 70.1 µs → 61.6 µs, -12.2%
rpad stringview [str_len=5, target=20]: 60.4 µs → 56.8 µs, -6.0%
rpad utf8 [str_len=20, target=50]: 50.6 µs → 51.2 µs, +1.2%
rpad stringview [str_len=20, target=50]: 53.7 µs → 53.3 µs, -0.8%
rpad utf8 unicode [target=20]: 407.1 µs → 104.0 µs, -74.5%
rpad stringview unicode [target=20]: 404.8 µs → 114.5 µs, -71.7%
rpad utf8 scalar [str_len=5, target=20, fill='x']: 47.5 µs → 45.6 µs, -4.0%
rpad stringview scalar [str_len=5, target=20, fill='x']: 56.4 µs → 58.5 µs, +3.6%
rpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 184.1 µs → 48.1 µs, -73.9%
rpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 606.4 µs → 45.6 µs, -92.5%

translate size=1024:

array_from_to [str_len=8]: 140.0 µs → 37.6 µs, -73.2%
scalar_from_to [str_len=8]: 9.0 µs → 8.8 µs, -2.7%
array_from_to [str_len=32]: 371.3 µs → 65.6 µs, -82.3%
scalar_from_to [str_len=32]: 19.9 µs → 19.2 µs, -3.6%
array_from_to [str_len=128]: 1249.6 µs → 188.7 µs, -84.9%
scalar_from_to [str_len=128]: 70.2 µs → 64.7 µs, -7.9%
array_from_to [str_len=1024]: 9349.4 µs → 1378.1 µs, -85.3%
scalar_from_to [str_len=1024]: 506.5 µs → 445.8 µs, -12.0%

translate size=4096:

array_from_to [str_len=8]: 548.0 µs → 147.1 µs, -73.2%
scalar_from_to [str_len=8]: 33.9 µs → 32.8 µs, -3.1%
array_from_to [str_len=32]: 1457.2 µs → 266.0 µs, -81.7%
scalar_from_to [str_len=32]: 78.0 µs → 75.5 µs, -3.2%
array_from_to [str_len=128]: 4935.0 µs → 791.1 µs, -84.0%
scalar_from_to [str_len=128]: 278.2 µs → 260.7 µs, -6.3%
array_from_to [str_len=1024]: 37496 µs → 5591 µs, -85.1%
scalar_from_to [str_len=1024]: 2058.0 µs → 1770 µs, -14.0%

What changes are included in this PR?

Switch from grapheme segmentation to codepoint segmentation for lpad, rpad, and translate
Add SLT tests
Refactor a few helper functions
Remove dependency on unicode_segmentation crate as it is no longer used

Are these changes tested?

Yes. The new SLT tests were also run against DuckDB and Postgres to confirm the behavior is consistent.

Are there any user-facing changes?

Yes. This PR changes the behavior of lpad, rpad, and translate, although the new behavior is more consistent with the SQL standard and with other SQL implementations.

neilconway · 2026-04-06T15:55:48Z

FYI @martin-g -- this came up as part of reviewing #20657

Dandandan

LGTM

martin-g · 2026-04-07T07:00:11Z

datafusion/functions/src/unicode/lpad.rs

-                    builder.append_value(&string[..end]);
+                if target_len < char_count {
+                    builder
+                        .append_value(&string[..byte_offset_of_char(string, target_len)]);


Both byte_offset_of_char(string, ...) and let char_count = string.chars().count(); iterate over the chars in string.
It could be optimized to iterate just once with something like:

let mut char_count = 0; let mut truncate_offset = None; for (i, (byte_idx, _)) in string.char_indices().enumerate() { if i == target_len { truncate_offset = Some(byte_idx); } char_count += 1; } if let Some(offset) = truncate_offset { builder.append_value(&string[..offset]); ...

Good idea! I added a helper here to do this in one place and DRY up the code a bit.

I posted updated benchmark results, this is a nice win.

datafusion/functions/src/unicode/rpad.rs

datafusion/sqllogictest/test_files/string/string_literal.slt

Dandandan · 2026-04-07T09:17:59Z

datafusion/functions/src/unicode/rpad.rs

                    } else {
                        builder.write_str(string)?;
-                        for _ in 0..(target_len - graphemes_buf.len()) {
+                        for _ in 0..(target_len - char_count) {


couldn't this use some repeat extend or something?

You could, although this is just what the pre-existing code did. I think repeat would allocate and write_str in a loop will not; not sure offhand which is better but I'd think it's a wash? We could look at this in a followup PR if you'd like?

Dandandan · 2026-04-07T09:19:38Z

datafusion/functions/src/unicode/rpad.rs

-                        let end: usize =
-                            graphemes_buf[..target_len].iter().map(|g| g.len()).sum();
-                        builder.append_value(&string[..end]);
+                    if target_len < char_count {


Would <= not be correct here?

I refactored the code here as part of some improvements @martin-g suggested, so I think this is moot now.

martin-g · 2026-04-07T10:37:41Z

Maybe a note should be added to the migration guide because there is a small behavior change in these functions.

…te-codepoints

neilconway · 2026-04-07T15:10:58Z

Maybe a note should be added to the migration guide because there is a small behavior change in these functions.

Thanks! Done.

neilconway · 2026-04-08T18:46:36Z

@martin-g @Dandandan Is this ready to be merged?

neilconway added 2 commits April 6, 2026 09:02

.

de35dcf

.

2919ce7

github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Apr 6, 2026

Dandandan approved these changes Apr 6, 2026

View reviewed changes

martin-g approved these changes Apr 7, 2026

View reviewed changes

Dandandan reviewed Apr 7, 2026

View reviewed changes

neilconway added 4 commits April 7, 2026 09:04

Fixes per code review

c63800a

Clean up repeated string iteration, per review

82d8128

Merge remote-tracking branch 'origin/main' into neilc/fix-pad-transla…

488b2bf

…te-codepoints

Improve function name

7232c34

github-actions bot added the documentation Improvements or additions to documentation label Apr 7, 2026

cargo fmt

8a17d52

martin-g approved these changes Apr 7, 2026

View reviewed changes

Conversation

neilconway commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

neilconway commented Apr 6, 2026

Uh oh!

Dandandan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-g commented Apr 7, 2026

Uh oh!

neilconway commented Apr 7, 2026

Uh oh!

neilconway commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neilconway commented Apr 6, 2026 •

edited

Loading