Skip to content

feat(arrow-row): add MSD radix sort kernel for row-encoded keys#9683

Open
mbutrovich wants to merge 9 commits intoapache:mainfrom
mbutrovich:radix_row_sort
Open

feat(arrow-row): add MSD radix sort kernel for row-encoded keys#9683
mbutrovich wants to merge 9 commits intoapache:mainfrom
mbutrovich:radix_row_sort

Conversation

@mbutrovich
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich commented Apr 9, 2026

Which issue does this PR close?

N/A.

Rationale for this change

The existing lexsort_to_indices uses comparison sort on columnar arrays, which is O(n log n × comparison_cost) where comparison cost scales with the number of columns. The Arrow row format (RowConverter) produces memcmp-comparable byte sequences, making it a natural fit for radix sort — O(n × key_width) — which can overcome the encoding overhead by eliminating per-comparison column traversal.

Inspired by DuckDB's sorting redesign, which uses MSD radix sort on normalized key prefixes with a comparison sort fallback, this PR adds an radix_sort_to_indices kernel that operates directly on row-encoded keys. Like DuckDB, we limit radix depth to 8 bytes before falling back to comparison sort, balancing radix efficiency against diminishing returns on deep recursion.

What changes are included in this PR?

  • arrow-row/src/radix.rs (new): MSD radix sort on Rows with:
    • Ping-pong buffers to eliminate per-level memcpy
    • Byte extraction buffer for single-pass indirection through Rows
    • 256-bucket histogram + out-of-place scatter per byte position
    • Comparison sort fallback for small buckets (≤32 elements) or after 8 bytes of radix depth
    • Fallback compares from byte_pos onward via Row::data_from(), skipping the shared prefix
  • arrow-row/src/lib.rs: Exposes pub mod radix, adds Row::byte_from() and Row::data_from() APIs
  • arrow/benches/lexsort.rs: Adds lexsort_radix benchmark variant alongside existing lexsort_to_indices and lexsort_rows

Benchmark results

All three variants include the full pipeline (encoding + sort) so the comparison against lexsort_to_indices (which doesn't encode) is apples-to-apples.

Schema, N lexsort_to_indices lexsort_rows lexsort_radix
[i32, i32_opt], 4096 88.156 µs 117.32 µs 76.358 µs
[i32, i32_opt], 32768 857.82 µs 1.2250 ms 331.26 µs
[i32, str_opt(16)], 32768 861.89 µs 1.8851 ms 488.62 µs
[str_opt(16), str(16)], 32768 2.4453 ms 1.6565 ms 972.92 µs
[3x str], 32768 2.4723 ms 1.9196 ms 1.1955 ms
[5x str], 32768 2.4513 ms 2.2403 ms 1.5101 ms
[i32_opt, dict], 32768 1.1853 ms 1.3607 ms 696.75 µs
[dict, dict], 32768 501.86 µs 726.90 µs 985.69 µs
[3x dict, str(16)], 32768 4.1119 ms 2.0401 ms 1.5996 ms
[3x dict, str_opt(50)], 32768 4.2241 ms 2.1783 ms 1.7227 ms
[i32_opt, i32_list], 32768 1.4701 ms 2.9004 ms 1.4326 ms
[i32, i32_list, str(16)], 32768 874.37 µs 2.2616 ms 1.1727 ms

Radix sort is the fastest in the majority of cases. The main exception is pure low-cardinality dictionary columns where lexsort_to_indices avoids encoding overhead entirely. The module documentation provides guidance on when to use each approach.

Are these changes tested?

17 tests including:

  • Deterministic tests for integers, strings, multi-column, nulls, all-equal, empty, single-element
  • All 4 sort option combinations (ascending/descending × nulls_first)
  • Float64 with NaN/Infinity, booleans
  • Threshold boundary tests (sizes 1–1000 around the fallback threshold)
  • Fuzz test: 100 iterations × 1–4 random columns × random types × random sort options × 5–500 rows
  • Cross-validation: verifies radix output matches comparison sort on the same Rows

Are there any user-facing changes?

New public APIs:

  • arrow_row::radix::radix_sort_to_indices(&Rows) -> Vec<u32> — sort with default parameters
  • arrow_row::radix::radix_sort_to_indices_with(&Rows, max_depth, fallback_threshold) -> Vec<u32> — sort with tunable radix depth and fallback threshold
  • Row::byte_from(offset) -> u8 — single byte access at offset
  • Row::data_from(offset) -> &[u8] — suffix slice from offset

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 9, 2026
@mbutrovich
Copy link
Copy Markdown
Contributor Author

mbutrovich commented Apr 9, 2026

Full cargo bench --bench lexsort results on M3 Max:

Schema, N lexsort_to_indices lexsort_rows lexsort_radix Radix speedup vs lexsort
[i32, i32_opt], 4096 88.156 µs 117.32 µs 76.358 µs 1.15x
[i32, i32_opt], 32768 857.82 µs 1.2250 ms 331.26 µs 2.59x
[i32, str_opt(16)], 4096 85.824 µs 157.61 µs 93.776 µs 0.92x
[i32, str_opt(16)], 32768 861.89 µs 1.8851 ms 488.62 µs 1.76x
[i32, str(16)], 4096 88.002 µs 137.62 µs 94.440 µs 0.93x
[i32, str(16)], 32768 862.82 µs 1.4557 ms 432.29 µs 2.00x
[str_opt(16), str(16)], 4096 230.53 µs 159.14 µs 104.02 µs 2.22x
[str_opt(16), str(16)], 32768 2.4453 ms 1.6565 ms 972.92 µs 2.51x
[str_opt(16), str_opt(50), str(16)], 4096 237.83 µs 201.11 µs 133.01 µs 1.79x
[str_opt(16), str_opt(50), str(16)], 32768 2.4723 ms 1.9196 ms 1.1955 ms 2.07x
[str_opt(16), str(16), str_opt(16), str_opt(16), str_opt(16)], 4096 231.52 µs 233.41 µs 167.42 µs 1.38x
[str_opt(16), str(16), str_opt(16), str_opt(16), str_opt(16)], 32768 2.4513 ms 2.2403 ms 1.5101 ms 1.62x
[i32_opt, dict(100,str_opt(50))], 4096 110.39 µs 137.04 µs 102.88 µs 1.07x
[i32_opt, dict(100,str_opt(50))], 32768 1.1853 ms 1.3607 ms 696.75 µs 1.70x
[dict(100,str_opt(50)), dict(100,str_opt(50))], 4096 50.525 µs 86.935 µs 118.20 µs 0.43x
[dict(100,str_opt(50)), dict(100,str_opt(50))], 32768 501.86 µs 726.90 µs 985.69 µs 0.51x
[dict(100,str_opt(50)), dict(100,str_opt(50)), dict(100,str_opt(50)), str(16)], 4096 376.04 µs 193.07 µs 156.42 µs 2.40x
[dict(100,str_opt(50)), dict(100,str_opt(50)), dict(100,str_opt(50)), str(16)], 32768 4.1119 ms 2.0401 ms 1.5996 ms 2.57x
[dict(100,str_opt(50)), dict(100,str_opt(50)), dict(100,str_opt(50)), str_opt(50)], 4096 381.40 µs 213.88 µs 171.46 µs 2.22x
[dict(100,str_opt(50)), dict(100,str_opt(50)), dict(100,str_opt(50)), str_opt(50)], 32768 4.2241 ms 2.1783 ms 1.7227 ms 2.45x
[i32_opt, i32_list], 4096 138.51 µs 244.34 µs 158.92 µs 0.87x
[i32_opt, i32_list], 32768 1.4701 ms 2.9004 ms 1.4326 ms 1.03x
[i32_opt, i32_list_opt], 4096 149.11 µs 256.12 µs 163.10 µs 0.91x
[i32_opt, i32_list_opt], 32768 1.6334 ms 2.8821 ms 1.4583 ms 1.12x
[i32_list_opt, i32_opt], 4096 251.76 µs 218.67 µs 179.29 µs 1.40x
[i32_list_opt, i32_opt], 32768 2.7634 ms 2.5661 ms 1.5591 ms 1.77x
[i32, str_list(4)], 4096 86.279 µs 419.34 µs 305.55 µs 0.28x
[i32, str_list(4)], 32768 880.38 µs 4.7173 ms 3.2387 ms 0.27x
[str_list(4), i32], 4096 295.34 µs 346.00 µs 275.48 µs 1.07x
[str_list(4), i32], 32768 3.3798 ms 4.3955 ms 3.7986 ms 0.89x
[i32, str_list_opt(4)], 4096 85.896 µs 412.01 µs 297.34 µs 0.29x
[i32, str_list_opt(4)], 32768 859.64 µs 4.6929 ms 3.0876 ms 0.28x
[str_list_opt(4), i32], 4096 353.23 µs 349.57 µs 293.80 µs 1.20x
[str_list_opt(4), i32], 32768 4.1794 ms 4.3348 ms 3.8228 ms 1.09x
[i32, i32_list, str(16)], 4096 86.872 µs 225.27 µs 183.78 µs 0.47x
[i32, i32_list, str(16)], 32768 874.37 µs 2.2616 ms 1.1727 ms 0.75x
[i32_opt, i32_list_opt, str_opt(50)], 4096 157.11 µs 239.66 µs 196.12 µs 0.80x
[i32_opt, i32_list_opt, str_opt(50)], 32768 1.7631 ms 2.4358 ms 1.6973 ms 1.04x

@mbutrovich mbutrovich changed the title Radix row sort feat(arrow-row): add MSD radix sort kernel for row-encoded keys Apr 9, 2026
@mbutrovich
Copy link
Copy Markdown
Contributor Author

mbutrovich commented Apr 10, 2026

Thanks for the feedback @Dandandan!

I tried the 4 optimizations you suggested, and here are the results:

Schema, N Before After Change
[i32, i32_opt], 4096 71.987 µs 70.200 µs -2.5%
[i32, i32_opt], 32768 359.07 µs 341.22 µs -5.0%
[i32, str_opt(16)], 4096 88.564 µs 88.874 µs +0.4%
[i32, str_opt(16)], 32768 510.93 µs 493.01 µs -3.5%
[i32, str(16)], 4096 85.021 µs 92.958 µs +9.3%
[i32, str(16)], 32768 457.38 µs 425.87 µs -6.9%
[str_opt(16), str(16)], 4096 123.71 µs 133.51 µs +7.9%
[str_opt(16), str(16)], 32768 946.89 µs 993.63 µs +4.9%
[str_opt(16), str_opt(50), str(16)], 4096 155.40 µs 165.69 µs +6.6%
[str_opt(16), str_opt(50), str(16)], 32768 1.2148 ms 1.2654 ms +4.2%
[str_opt(16), str(16), str_opt(16), str_opt(16), str_opt(16)], 4096 188.98 µs 199.13 µs +5.4%
[str_opt(16), str(16), str_opt(16), str_opt(16), str_opt(16)], 32768 1.5222 ms 1.5486 ms +1.7%
[i32_opt, dict(100,str_opt(50))], 4096 100.16 µs 101.17 µs +1.0%
[i32_opt, dict(100,str_opt(50))], 32768 627.81 µs 686.16 µs +9.3%
[dict(100,str_opt(50)), dict(100,str_opt(50))], 4096 90.044 µs 89.670 µs -0.4%
[dict(100,str_opt(50)), dict(100,str_opt(50))], 32768 722.44 µs 741.11 µs +2.6%
[dict(100,str_opt(50))x3, str(16)], 4096 179.50 µs 187.09 µs +4.2%
[dict(100,str_opt(50))x3, str(16)], 32768 1.5303 ms 1.5696 ms +2.6%
[dict(100,str_opt(50))x3, str_opt(50)], 4096 192.86 µs 200.81 µs +4.1%
[dict(100,str_opt(50))x3, str_opt(50)], 32768 1.7082 ms 1.7309 ms +1.3%
[i32_opt, i32_list], 4096 157.57 µs 160.09 µs +1.6%
[i32_opt, i32_list], 32768 1.3393 ms 1.4233 ms +6.3%
[i32_opt, i32_list_opt], 4096 161.53 µs 164.17 µs +1.6%
[i32_opt, i32_list_opt], 32768 1.3731 ms 1.4992 ms +9.2%
[i32_list_opt, i32_opt], 4096 173.67 µs 179.41 µs +3.3%
[i32_list_opt, i32_opt], 32768 1.5963 ms 1.5744 ms -1.4%
[i32, str_list(4)], 4096 287.85 µs 291.31 µs +1.2%
[i32, str_list(4)], 32768 3.3054 ms 3.2674 ms -1.1%
[str_list(4), i32], 4096 297.47 µs 360.26 µs +21.1%
[str_list(4), i32], 32768 3.8276 ms 3.8494 ms +0.6%
[i32, str_list_opt(4)], 4096 261.91 µs 325.47 µs +24.3%
[i32, str_list_opt(4)], 32768 3.1478 ms 3.1075 ms -1.3%
[str_list_opt(4), i32], 4096 300.98 µs 334.20 µs +11.0%
[str_list_opt(4), i32], 32768 3.7982 ms 3.8360 ms +1.0%
[i32, i32_list, str(16)], 4096 170.03 µs 175.74 µs +3.4%
[i32, i32_list, str(16)], 32768 1.1633 ms 1.1742 ms +0.9%
[i32_opt, i32_list_opt, str_opt(50)], 4096 195.77 µs 198.17 µs +1.2%
[i32_opt, i32_list_opt, str_opt(50)], 32768 1.6350 ms 1.7419 ms +6.5%

Claude's attempt at interpreting the results:

Integer schemas at large N improved ~3-5%. Most other schemas regressed, especially list-leading schemas at small N (+21-24%).

Likely causes:

  1. Ping-pong fallback overhead: The old code did one bulk copy_from_slice per level. Ping-pong eliminates that but replaces it with many small per bucket copies in the fallback path. For schemas that produce many small buckets (strings, lists, dicts), many scattered copies are less efficient than one contiguous memcpy.
  2. data_from comparison overhead: The old fallback used ra.cmp(&rb) — a direct memcmp. The new data_from(byte_pos).cmp(...) adds a bounds check + subslice per row per comparison. For schemas hitting the fallback frequently, this outweighs the benefit of skipping the shared prefix.

Integer schemas benefit because they have fewer, larger buckets that stay in the radix path longer where the eliminated per-level memcpy pays off.

Do you want the commit anyway just to iterate off of?

@mbutrovich
Copy link
Copy Markdown
Contributor Author

mbutrovich commented Apr 10, 2026

I updated numbers in the PR description and full table in the comment after it to reflect current PR status.

My attempt to bring it to DataFusion has so far been disappointing: apache/datafusion#21525

I think we need a deeper redesign of that stream rather than just replacing the existing logic. apache/datafusion#21525 (comment)

@mbutrovich
Copy link
Copy Markdown
Contributor Author

@Dandandan I also made a couple of API additions to Row. Let me know if that's what you had in mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants