Skip to content

Conversation

@fangchenli
Copy link
Member

@fangchenli fangchenli commented Dec 18, 2025

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.
  • If I used AI to develop this pull request, I prompted it to follow AGENTS.md.

xref: #55234

  • decimal128: 1.6-2.1x faster, std and sem now work (previously raised NotImplementedError)
  • string: 2.5-2.7x faster for min/max operations. (1.3-1.4x for first/last, questionable, this PR still doesn't support first/last, the speedup might come from code path change, need more investigation)

This PR was tested for int and float as well. pandas' groupby is around 20-35% faster than the Arrow-native one for int and float types. We could easily change the condition to turn it on for int and float if Arrow's groupby got more optimized in the future.

During the reordering stage, it falls back to NumPy due to the limitation of pyarrow.compute.scatter. And the workaround is more expensive. This hybrid approach is not perfect, but it gets us one step closer to an Arrow-native implementation.

fangchenli and others added 5 commits December 18, 2025 10:04
…branches

- Split test_groupby_aggregations into test_groupby_decimal_aggregations and test_groupby_string_aggregations
- Split test_groupby_dropna into test_groupby_dropna_true and test_groupby_dropna_false
- Use explicit Decimal values instead of range() casts for decimal tests
- Parametrize values directly to avoid runtime branching

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@fangchenli fangchenli added Performance Memory or execution speed performance Arrow pyarrow functionality labels Dec 18, 2025
fangchenli and others added 4 commits December 18, 2025 15:40
- Add type annotation for aggs list to handle mixed tuple types
- Rename result to fallback_result to avoid type conflict

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
String types only support min, max, count - skip sum, prod, mean, std, var.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@fangchenli fangchenli marked this pull request as ready for review December 19, 2025 06:17
@fangchenli fangchenli marked this pull request as draft December 19, 2025 06:21
@fangchenli fangchenli marked this pull request as ready for review December 19, 2025 08:14
@fangchenli fangchenli marked this pull request as draft December 19, 2025 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Arrow pyarrow functionality Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant