feat(arrow-row): add MSD radix sort kernel for row-encoded keys by mbutrovich · Pull Request #9683 · apache/arrow-rs

mbutrovich · 2026-04-09T20:00:24Z

Which issue does this PR close?

N/A.

Rationale for this change

The existing lexsort_to_indices uses comparison sort on columnar arrays, which is O(n log n × comparison_cost) where comparison cost scales with the number of columns. The Arrow row format (RowConverter) produces memcmp-comparable byte sequences, making it a natural fit for radix sort — O(n × key_width) — which can overcome the encoding overhead by eliminating per-comparison column traversal.

Inspired by DuckDB's sorting redesign, which uses MSD radix sort on normalized key prefixes with a comparison sort fallback, this PR adds an radix_sort_to_indices kernel that operates directly on row-encoded keys. Like DuckDB, we limit radix depth to 8 bytes before falling back to comparison sort, balancing radix efficiency against diminishing returns on deep recursion.

What changes are included in this PR?

arrow-row/src/radix.rs (new): MSD radix sort on Rows with:
- Ping-pong buffers to eliminate per-level memcpy
- Byte extraction buffer for single-pass indirection through Rows
- 256-bucket histogram + out-of-place scatter per byte position
- Comparison sort fallback for small buckets (≤32 elements) or after 8 bytes of radix depth
- Fallback compares from byte_pos onward via Row::data_from(), skipping the shared prefix
arrow-row/src/lib.rs: Exposes pub mod radix, adds Row::byte_from() and Row::data_from() APIs
arrow/benches/lexsort.rs: Adds lexsort_radix benchmark variant alongside existing lexsort_to_indices and lexsort_rows

Benchmark results

All three variants include the full pipeline (encoding + sort) so the comparison against lexsort_to_indices (which doesn't encode) is apples-to-apples.

Schema, N	lexsort_to_indices	lexsort_rows	lexsort_radix
[i32, i32_opt], 4096	88.156 µs	117.32 µs	76.358 µs
[i32, i32_opt], 32768	857.82 µs	1.2250 ms	331.26 µs
[i32, str_opt(16)], 32768	861.89 µs	1.8851 ms	488.62 µs
[str_opt(16), str(16)], 32768	2.4453 ms	1.6565 ms	972.92 µs
[3x str], 32768	2.4723 ms	1.9196 ms	1.1955 ms
[5x str], 32768	2.4513 ms	2.2403 ms	1.5101 ms
[i32_opt, dict], 32768	1.1853 ms	1.3607 ms	696.75 µs
[dict, dict], 32768	501.86 µs	726.90 µs	985.69 µs
[3x dict, str(16)], 32768	4.1119 ms	2.0401 ms	1.5996 ms
[3x dict, str_opt(50)], 32768	4.2241 ms	2.1783 ms	1.7227 ms
[i32_opt, i32_list], 32768	1.4701 ms	2.9004 ms	1.4326 ms
[i32, i32_list, str(16)], 32768	874.37 µs	2.2616 ms	1.1727 ms

Radix sort is the fastest in the majority of cases. The main exception is pure low-cardinality dictionary columns where lexsort_to_indices avoids encoding overhead entirely. The module documentation provides guidance on when to use each approach.

Are these changes tested?

17 tests including:

Deterministic tests for integers, strings, multi-column, nulls, all-equal, empty, single-element
All 4 sort option combinations (ascending/descending × nulls_first)
Float64 with NaN/Infinity, booleans
Threshold boundary tests (sizes 1–1000 around the fallback threshold)
Fuzz test: 100 iterations × 1–4 random columns × random types × random sort options × 5–500 rows
Cross-validation: verifies radix output matches comparison sort on the same Rows

Are there any user-facing changes?

New public APIs:

arrow_row::radix::radix_sort_to_indices(&Rows) -> Vec<u32> — sort with default parameters
arrow_row::radix::radix_sort_to_indices_with(&Rows, max_depth, fallback_threshold) -> Vec<u32> — sort with tunable radix depth and fallback threshold
Row::byte_from(offset) -> u8 — single byte access at offset
Row::data_from(offset) -> &[u8] — suffix slice from offset

mbutrovich · 2026-04-09T20:01:11Z

Full cargo bench --bench lexsort results on M3 Max:

Schema, N	lexsort_to_indices	lexsort_rows	lexsort_radix	Radix speedup vs lexsort
[i32, i32_opt], 4096	88.156 µs	117.32 µs	76.358 µs	1.15x
[i32, i32_opt], 32768	857.82 µs	1.2250 ms	331.26 µs	2.59x
[i32, str_opt(16)], 4096	85.824 µs	157.61 µs	93.776 µs	0.92x
[i32, str_opt(16)], 32768	861.89 µs	1.8851 ms	488.62 µs	1.76x
[i32, str(16)], 4096	88.002 µs	137.62 µs	94.440 µs	0.93x
[i32, str(16)], 32768	862.82 µs	1.4557 ms	432.29 µs	2.00x
[str_opt(16), str(16)], 4096	230.53 µs	159.14 µs	104.02 µs	2.22x
[str_opt(16), str(16)], 32768	2.4453 ms	1.6565 ms	972.92 µs	2.51x
[str_opt(16), str_opt(50), str(16)], 4096	237.83 µs	201.11 µs	133.01 µs	1.79x
[str_opt(16), str_opt(50), str(16)], 32768	2.4723 ms	1.9196 ms	1.1955 ms	2.07x
[str_opt(16), str(16), str_opt(16), str_opt(16), str_opt(16)], 4096	231.52 µs	233.41 µs	167.42 µs	1.38x
[str_opt(16), str(16), str_opt(16), str_opt(16), str_opt(16)], 32768	2.4513 ms	2.2403 ms	1.5101 ms	1.62x
[i32_opt, dict(100,str_opt(50))], 4096	110.39 µs	137.04 µs	102.88 µs	1.07x
[i32_opt, dict(100,str_opt(50))], 32768	1.1853 ms	1.3607 ms	696.75 µs	1.70x
[dict(100,str_opt(50)), dict(100,str_opt(50))], 4096	50.525 µs	86.935 µs	118.20 µs	0.43x
[dict(100,str_opt(50)), dict(100,str_opt(50))], 32768	501.86 µs	726.90 µs	985.69 µs	0.51x
[dict(100,str_opt(50)), dict(100,str_opt(50)), dict(100,str_opt(50)), str(16)], 4096	376.04 µs	193.07 µs	156.42 µs	2.40x
[dict(100,str_opt(50)), dict(100,str_opt(50)), dict(100,str_opt(50)), str(16)], 32768	4.1119 ms	2.0401 ms	1.5996 ms	2.57x
[dict(100,str_opt(50)), dict(100,str_opt(50)), dict(100,str_opt(50)), str_opt(50)], 4096	381.40 µs	213.88 µs	171.46 µs	2.22x
[dict(100,str_opt(50)), dict(100,str_opt(50)), dict(100,str_opt(50)), str_opt(50)], 32768	4.2241 ms	2.1783 ms	1.7227 ms	2.45x
[i32_opt, i32_list], 4096	138.51 µs	244.34 µs	158.92 µs	0.87x
[i32_opt, i32_list], 32768	1.4701 ms	2.9004 ms	1.4326 ms	1.03x
[i32_opt, i32_list_opt], 4096	149.11 µs	256.12 µs	163.10 µs	0.91x
[i32_opt, i32_list_opt], 32768	1.6334 ms	2.8821 ms	1.4583 ms	1.12x
[i32_list_opt, i32_opt], 4096	251.76 µs	218.67 µs	179.29 µs	1.40x
[i32_list_opt, i32_opt], 32768	2.7634 ms	2.5661 ms	1.5591 ms	1.77x
[i32, str_list(4)], 4096	86.279 µs	419.34 µs	305.55 µs	0.28x
[i32, str_list(4)], 32768	880.38 µs	4.7173 ms	3.2387 ms	0.27x
[str_list(4), i32], 4096	295.34 µs	346.00 µs	275.48 µs	1.07x
[str_list(4), i32], 32768	3.3798 ms	4.3955 ms	3.7986 ms	0.89x
[i32, str_list_opt(4)], 4096	85.896 µs	412.01 µs	297.34 µs	0.29x
[i32, str_list_opt(4)], 32768	859.64 µs	4.6929 ms	3.0876 ms	0.28x
[str_list_opt(4), i32], 4096	353.23 µs	349.57 µs	293.80 µs	1.20x
[str_list_opt(4), i32], 32768	4.1794 ms	4.3348 ms	3.8228 ms	1.09x
[i32, i32_list, str(16)], 4096	86.872 µs	225.27 µs	183.78 µs	0.47x
[i32, i32_list, str(16)], 32768	874.37 µs	2.2616 ms	1.1727 ms	0.75x
[i32_opt, i32_list_opt, str_opt(50)], 4096	157.11 µs	239.66 µs	196.12 µs	0.80x
[i32_opt, i32_list_opt, str_opt(50)], 32768	1.7631 ms	2.4358 ms	1.6973 ms	1.04x

arrow-row/src/radix.rs

mbutrovich · 2026-04-10T00:32:37Z

Thanks for the feedback @Dandandan!

I tried the 4 optimizations you suggested, and here are the results:

Schema, N	Before	After	Change
[i32, i32_opt], 4096	71.987 µs	70.200 µs	-2.5%
[i32, i32_opt], 32768	359.07 µs	341.22 µs	-5.0%
[i32, str_opt(16)], 4096	88.564 µs	88.874 µs	+0.4%
[i32, str_opt(16)], 32768	510.93 µs	493.01 µs	-3.5%
[i32, str(16)], 4096	85.021 µs	92.958 µs	+9.3%
[i32, str(16)], 32768	457.38 µs	425.87 µs	-6.9%
[str_opt(16), str(16)], 4096	123.71 µs	133.51 µs	+7.9%
[str_opt(16), str(16)], 32768	946.89 µs	993.63 µs	+4.9%
[str_opt(16), str_opt(50), str(16)], 4096	155.40 µs	165.69 µs	+6.6%
[str_opt(16), str_opt(50), str(16)], 32768	1.2148 ms	1.2654 ms	+4.2%
[str_opt(16), str(16), str_opt(16), str_opt(16), str_opt(16)], 4096	188.98 µs	199.13 µs	+5.4%
[str_opt(16), str(16), str_opt(16), str_opt(16), str_opt(16)], 32768	1.5222 ms	1.5486 ms	+1.7%
[i32_opt, dict(100,str_opt(50))], 4096	100.16 µs	101.17 µs	+1.0%
[i32_opt, dict(100,str_opt(50))], 32768	627.81 µs	686.16 µs	+9.3%
[dict(100,str_opt(50)), dict(100,str_opt(50))], 4096	90.044 µs	89.670 µs	-0.4%
[dict(100,str_opt(50)), dict(100,str_opt(50))], 32768	722.44 µs	741.11 µs	+2.6%
[dict(100,str_opt(50))x3, str(16)], 4096	179.50 µs	187.09 µs	+4.2%
[dict(100,str_opt(50))x3, str(16)], 32768	1.5303 ms	1.5696 ms	+2.6%
[dict(100,str_opt(50))x3, str_opt(50)], 4096	192.86 µs	200.81 µs	+4.1%
[dict(100,str_opt(50))x3, str_opt(50)], 32768	1.7082 ms	1.7309 ms	+1.3%
[i32_opt, i32_list], 4096	157.57 µs	160.09 µs	+1.6%
[i32_opt, i32_list], 32768	1.3393 ms	1.4233 ms	+6.3%
[i32_opt, i32_list_opt], 4096	161.53 µs	164.17 µs	+1.6%
[i32_opt, i32_list_opt], 32768	1.3731 ms	1.4992 ms	+9.2%
[i32_list_opt, i32_opt], 4096	173.67 µs	179.41 µs	+3.3%
[i32_list_opt, i32_opt], 32768	1.5963 ms	1.5744 ms	-1.4%
[i32, str_list(4)], 4096	287.85 µs	291.31 µs	+1.2%
[i32, str_list(4)], 32768	3.3054 ms	3.2674 ms	-1.1%
[str_list(4), i32], 4096	297.47 µs	360.26 µs	+21.1%
[str_list(4), i32], 32768	3.8276 ms	3.8494 ms	+0.6%
[i32, str_list_opt(4)], 4096	261.91 µs	325.47 µs	+24.3%
[i32, str_list_opt(4)], 32768	3.1478 ms	3.1075 ms	-1.3%
[str_list_opt(4), i32], 4096	300.98 µs	334.20 µs	+11.0%
[str_list_opt(4), i32], 32768	3.7982 ms	3.8360 ms	+1.0%
[i32, i32_list, str(16)], 4096	170.03 µs	175.74 µs	+3.4%
[i32, i32_list, str(16)], 32768	1.1633 ms	1.1742 ms	+0.9%
[i32_opt, i32_list_opt, str_opt(50)], 4096	195.77 µs	198.17 µs	+1.2%
[i32_opt, i32_list_opt, str_opt(50)], 32768	1.6350 ms	1.7419 ms	+6.5%

Claude's attempt at interpreting the results:

Integer schemas at large N improved ~3-5%. Most other schemas regressed, especially list-leading schemas at small N (+21-24%).

Likely causes:

Ping-pong fallback overhead: The old code did one bulk copy_from_slice per level. Ping-pong eliminates that but replaces it with many small per bucket copies in the fallback path. For schemas that produce many small buckets (strings, lists, dicts), many scattered copies are less efficient than one contiguous memcpy.

data_from comparison overhead: The old fallback used ra.cmp(&rb) — a direct memcmp. The new data_from(byte_pos).cmp(...) adds a bounds check + subslice per row per comparison. For schemas hitting the fallback frequently, this outweighs the benefit of skipping the shared prefix.

Integer schemas benefit because they have fewer, larger buckets that stay in the radix path longer where the eliminated per-level memcpy pays off.

Do you want the commit anyway just to iterate off of?

mbutrovich · 2026-04-10T12:32:42Z

I updated numbers in the PR description and full table in the comment after it to reflect current PR status.

My attempt to bring it to DataFusion has so far been disappointing: apache/datafusion#21525

I think we need a deeper redesign of that stream rather than just replacing the existing logic. apache/datafusion#21525 (comment)

mbutrovich · 2026-04-10T12:36:04Z

@Dandandan I also made a couple of API additions to Row. Let me know if that's what you had in mind.

mbutrovich added 4 commits April 9, 2026 14:07

Implement radix sort.

6eda8be

Update comments.

5fb4231

Hoist temp buffer out.

026a254

Clarify confusing comment.

c8b7a96

github-actions bot added the arrow Changes to the arrow crate label Apr 9, 2026

mbutrovich changed the title ~~Radix row sort~~ feat(arrow-row): add MSD radix sort kernel for row-encoded keys Apr 9, 2026

Fix clippy.

8401ebe

mbutrovich mentioned this pull request Apr 9, 2026

perf: Bring over apache/arrow-rs/9683 radix sort, integrate into ExternalSorter apache/datafusion#21525

Draft

Dandandan reviewed Apr 9, 2026

View reviewed changes

arrow-row/src/radix.rs Outdated Show resolved Hide resolved

Dandandan reviewed Apr 9, 2026

View reviewed changes

arrow-row/src/radix.rs Outdated Show resolved Hide resolved

Dandandan reviewed Apr 9, 2026

View reviewed changes

arrow-row/src/radix.rs Outdated Show resolved Hide resolved

Dandandan reviewed Apr 9, 2026

View reviewed changes

arrow-row/src/radix.rs Outdated Show resolved Hide resolved

mbutrovich added 4 commits April 9, 2026 20:33

Address PR feedback.

f9a4c41

Address PR feedback.

89ceaf1

Byte buffer extraction.

15351fe

Update defaults based on sensitivity analysis.

4c4498f

mbutrovich requested a review from Dandandan April 10, 2026 12:36

mbutrovich mentioned this pull request Apr 10, 2026

Revisit ExternalSorter sort strategy and sort_in_place_threshold_bytes apache/datafusion#21543

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(arrow-row): add MSD radix sort kernel for row-encoded keys#9683

feat(arrow-row): add MSD radix sort kernel for row-encoded keys#9683
mbutrovich wants to merge 9 commits intoapache:mainfrom
mbutrovich:radix_row_sort

mbutrovich commented Apr 9, 2026 •

edited

Loading

Uh oh!

mbutrovich commented Apr 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbutrovich commented Apr 10, 2026 •

edited

Loading

Uh oh!

mbutrovich commented Apr 10, 2026 •

edited

Loading

Uh oh!

mbutrovich commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mbutrovich commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Benchmark results

Are these changes tested?

Are there any user-facing changes?

Uh oh!

mbutrovich commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbutrovich commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbutrovich commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbutrovich commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mbutrovich commented Apr 9, 2026 •

edited

Loading

mbutrovich commented Apr 9, 2026 •

edited

Loading

mbutrovich commented Apr 10, 2026 •

edited

Loading

mbutrovich commented Apr 10, 2026 •

edited

Loading