Skip to content

fix: Use codepoints in lpad, rpad, translate#21405

Open
neilconway wants to merge 7 commits intoapache:mainfrom
neilconway:neilc/fix-pad-translate-codepoints
Open

fix: Use codepoints in lpad, rpad, translate#21405
neilconway wants to merge 7 commits intoapache:mainfrom
neilconway:neilc/fix-pad-translate-codepoints

Conversation

@neilconway
Copy link
Copy Markdown
Contributor

@neilconway neilconway commented Apr 6, 2026

Which issue does this PR close?

Rationale for this change

lpad, rpad, and translate use grapheme segmentation. This is inconsistent with how these functions behave in Postgres and DuckDB, as well as the SQL standard -- segmentation based on Unicode codepoints is used instead. It also happens that grapheme-based segmentation is significantly more expensive than codepoint-based segmentation.

In the case of lpad and rpad, graphemes and codepoints were used inconsistently: the input string was measured in code points but the fill string was measured in graphemes.

#3054 switched to using codepoints for most string-related functions in DataFusion but these three functions still need to be changed.

Benchmarks (M4 Max):

lpad size=1024:

  • lpad utf8 [str_len=5, target=20]: 12.4 µs → 12.8 µs, +3.0%
  • lpad stringview [str_len=5, target=20]: 11.5 µs → 11.7 µs, +1.4%
  • lpad utf8 [str_len=20, target=50]: 11.3 µs → 11.3 µs, +0.1%
  • lpad stringview [str_len=20, target=50]: 11.8 µs → 12.0 µs, +1.6%
  • lpad utf8 unicode [target=20]: 98.4 µs → 24.4 µs, -75.1%
  • lpad stringview unicode [target=20]: 99.8 µs → 26.0 µs, -74.0%
  • lpad utf8 scalar [str_len=5, target=20, fill='x']: 8.7 µs → 8.8 µs, +1.0%
  • lpad stringview scalar [str_len=5, target=20, fill='x']: 10.2 µs → 10.1 µs, -0.1%
  • lpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 44.7 µs → 10.9 µs, -75.7%
  • lpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 152.5 µs → 11.7 µs, -92.3%

lpad size=4096:

  • lpad utf8 [str_len=5, target=20]: 55.9 µs → 55.1 µs, -1.4%
  • lpad stringview [str_len=5, target=20]: 49.2 µs → 50.1 µs, +1.8%
  • lpad utf8 [str_len=20, target=50]: 46.6 µs → 46.4 µs, -0.5%
  • lpad stringview [str_len=20, target=50]: 47.5 µs → 48.5 µs, +2.1%
  • lpad utf8 unicode [target=20]: 401.3 µs → 100.1 µs, -75.0%
  • lpad stringview unicode [target=20]: 397.7 µs → 104.9 µs, -73.6%
  • lpad utf8 scalar [str_len=5, target=20, fill='x']: 34.2 µs → 35.0 µs, +2.4%
  • lpad stringview scalar [str_len=5, target=20, fill='x']: 40.1 µs → 40.4 µs, +0.6%
  • lpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 178.3 µs → 42.9 µs, -76.0%
  • lpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 601.3 µs → 46.2 µs, -92.3%

rpad size=1024:

  • rpad utf8 [str_len=5, target=20]: 15.5 µs → 14.4 µs, -7.1%
  • rpad stringview [str_len=5, target=20]: 13.8 µs → 14.0 µs, +1.7%
  • rpad utf8 [str_len=20, target=50]: 12.6 µs → 12.7 µs, +1.3%
  • rpad stringview [str_len=20, target=50]: 13.0 µs → 13.1 µs, +0.7%
  • rpad utf8 unicode [target=20]: 103.5 µs → 26.0 µs, -74.8%
  • rpad stringview unicode [target=20]: 101.2 µs → 27.6 µs, -72.7%
  • rpad utf8 scalar [str_len=5, target=20, fill='x']: 11.4 µs → 10.9 µs, -3.9%
  • rpad stringview scalar [str_len=5, target=20, fill='x']: 12.2 µs → 12.6 µs, +2.8%
  • rpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 46.3 µs → 12.4 µs, -73.1%
  • rpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 155.6 µs → 11.6 µs, -92.4%

rpad size=4096:

  • rpad utf8 [str_len=5, target=20]: 70.1 µs → 61.6 µs, -12.2%
  • rpad stringview [str_len=5, target=20]: 60.4 µs → 56.8 µs, -6.0%
  • rpad utf8 [str_len=20, target=50]: 50.6 µs → 51.2 µs, +1.2%
  • rpad stringview [str_len=20, target=50]: 53.7 µs → 53.3 µs, -0.8%
  • rpad utf8 unicode [target=20]: 407.1 µs → 104.0 µs, -74.5%
  • rpad stringview unicode [target=20]: 404.8 µs → 114.5 µs, -71.7%
  • rpad utf8 scalar [str_len=5, target=20, fill='x']: 47.5 µs → 45.6 µs, -4.0%
  • rpad stringview scalar [str_len=5, target=20, fill='x']: 56.4 µs → 58.5 µs, +3.6%
  • rpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 184.1 µs → 48.1 µs, -73.9%
  • rpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 606.4 µs → 45.6 µs, -92.5%

translate size=1024:

  • array_from_to [str_len=8]: 140.0 µs → 37.6 µs, -73.2%
  • scalar_from_to [str_len=8]: 9.0 µs → 8.8 µs, -2.7%
  • array_from_to [str_len=32]: 371.3 µs → 65.6 µs, -82.3%
  • scalar_from_to [str_len=32]: 19.9 µs → 19.2 µs, -3.6%
  • array_from_to [str_len=128]: 1249.6 µs → 188.7 µs, -84.9%
  • scalar_from_to [str_len=128]: 70.2 µs → 64.7 µs, -7.9%
  • array_from_to [str_len=1024]: 9349.4 µs → 1378.1 µs, -85.3%
  • scalar_from_to [str_len=1024]: 506.5 µs → 445.8 µs, -12.0%

translate size=4096:

  • array_from_to [str_len=8]: 548.0 µs → 147.1 µs, -73.2%
  • scalar_from_to [str_len=8]: 33.9 µs → 32.8 µs, -3.1%
  • array_from_to [str_len=32]: 1457.2 µs → 266.0 µs, -81.7%
  • scalar_from_to [str_len=32]: 78.0 µs → 75.5 µs, -3.2%
  • array_from_to [str_len=128]: 4935.0 µs → 791.1 µs, -84.0%
  • scalar_from_to [str_len=128]: 278.2 µs → 260.7 µs, -6.3%
  • array_from_to [str_len=1024]: 37496 µs → 5591 µs, -85.1%
  • scalar_from_to [str_len=1024]: 2058.0 µs → 1770 µs, -14.0%

What changes are included in this PR?

  • Switch from grapheme segmentation to codepoint segmentation for lpad, rpad, and translate
  • Add SLT tests
  • Refactor a few helper functions
  • Remove dependency on unicode_segmentation crate as it is no longer used

Are these changes tested?

Yes. The new SLT tests were also run against DuckDB and Postgres to confirm the behavior is consistent.

Are there any user-facing changes?

Yes. This PR changes the behavior of lpad, rpad, and translate, although the new behavior is more consistent with the SQL standard and with other SQL implementations.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Apr 6, 2026
@neilconway
Copy link
Copy Markdown
Contributor Author

FYI @martin-g -- this came up as part of reviewing #20657

Copy link
Copy Markdown
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

builder.append_value(&string[..end]);
if target_len < char_count {
builder
.append_value(&string[..byte_offset_of_char(string, target_len)]);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both byte_offset_of_char(string, ...) and let char_count = string.chars().count(); iterate over the chars in string.
It could be optimized to iterate just once with something like:

let mut char_count = 0;
let mut truncate_offset = None;
for (i, (byte_idx, _)) in string.char_indices().enumerate() {
    if i == target_len {
        truncate_offset = Some(byte_idx);
    }
    char_count += 1;
}
if let Some(offset) = truncate_offset {
    builder.append_value(&string[..offset]);
    ... 

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I added a helper here to do this in one place and DRY up the code a bit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I posted updated benchmark results, this is a nice win.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

} else {
builder.write_str(string)?;
for _ in 0..(target_len - graphemes_buf.len()) {
for _ in 0..(target_len - char_count) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't this use some repeat extend or something?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could, although this is just what the pre-existing code did. I think repeat would allocate and write_str in a loop will not; not sure offhand which is better but I'd think it's a wash? We could look at this in a followup PR if you'd like?

let end: usize =
graphemes_buf[..target_len].iter().map(|g| g.len()).sum();
builder.append_value(&string[..end]);
if target_len < char_count {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would <= not be correct here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the code here as part of some improvements @martin-g suggested, so I think this is moot now.

@martin-g
Copy link
Copy Markdown
Member

martin-g commented Apr 7, 2026

Maybe a note should be added to the migration guide because there is a small behavior change in these functions.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 7, 2026
@neilconway
Copy link
Copy Markdown
Contributor Author

Maybe a note should be added to the migration guide because there is a small behavior change in these functions.

Thanks! Done.

@neilconway
Copy link
Copy Markdown
Contributor Author

@martin-g @Dandandan Is this ready to be merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

lpad, rpad, translate should use codepoints, not graphemes

3 participants