PERF: add an Arrow/NumPy hybrid groupby path for decimal and string types. #63416

fangchenli · 2025-12-18T23:07:03Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.
If I used AI to develop this pull request, I prompted it to follow AGENTS.md.

decimal128: 1.6-2.1x faster, std and sem now work (previously raised NotImplementedError)
string: 2.5-2.7x faster for min/max operations. (1.3-1.4x for first/last, questionable, this PR still doesn't support first/last, the speedup might come from code path change, need more investigation)

This PR was tested for int and float as well. pandas' groupby is around 20-35% faster than the Arrow-native one for int and float types. We could easily change the condition to turn it on for int and float if Arrow's groupby got more optimized in the future.

During the reordering stage, it falls back to NumPy due to the limitation of pyarrow.compute.scatter. And the workaround is more expensive. This hybrid approach is not perfect, but it gets us one step closer to an Arrow-native implementation.

…branches - Split test_groupby_aggregations into test_groupby_decimal_aggregations and test_groupby_string_aggregations - Split test_groupby_dropna into test_groupby_dropna_true and test_groupby_dropna_false - Use explicit Decimal values instead of range() casts for decimal tests - Parametrize values directly to avoid runtime branching 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add type annotation for aggs list to handle mixed tuple types - Rename result to fallback_result to avoid type conflict 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

String types only support min, max, count - skip sum, prod, mean, std, var. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

fangchenli and others added 5 commits December 18, 2025 10:04

format

e350cfd

Merge remote-tracking branch 'upstream' into perf/groupby-arrow-native

ec9659c

improve comments and benchmark, turn on arrow path for int and float

79892b3

Merge remote-tracking branch 'upstream' into perf/groupby-arrow-native

0b55c93

fangchenli added Performance Memory or execution speed performance Arrow pyarrow functionality labels Dec 18, 2025

fangchenli and others added 4 commits December 18, 2025 15:40

TYP: fix mypy errors in _groupby_op_pyarrow

1322e72

- Add type annotation for aggs list to handle mixed tuple types - Rename result to fallback_result to avoid type conflict 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

BUG: skip unsupported string aggregations in ASV benchmark

7da132c

String types only support min, max, count - skip sum, prod, mean, std, var. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Merge remote-tracking branch 'upstream' into perf/groupby-arrow-native

5b1f31e

remove int and float

81d328e

fangchenli marked this pull request as ready for review December 19, 2025 06:17

fangchenli marked this pull request as draft December 19, 2025 06:21

fangchenli marked this pull request as ready for review December 19, 2025 08:14

let string type fall through

949061c

fangchenli marked this pull request as draft December 19, 2025 08:16

fangchenli added 3 commits December 19, 2025 10:32

Merge remote-tracking branch 'upstream' into perf/groupby-arrow-native

5af4252

fix bug in string dtype fall throught

b360811

Merge remote-tracking branch 'upstream' into perf/groupby-arrow-native

903de1e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: add an Arrow/NumPy hybrid groupby path for decimal and string types. #63416

PERF: add an Arrow/NumPy hybrid groupby path for decimal and string types. #63416

fangchenli commented Dec 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

PERF: add an Arrow/NumPy hybrid groupby path for decimal and string types. #63416

Are you sure you want to change the base?

PERF: add an Arrow/NumPy hybrid groupby path for decimal and string types. #63416

Conversation

fangchenli commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fangchenli commented Dec 18, 2025 •

edited

Loading