Skip to content

refactor(pruning): remove column param from PruningStatistics::row_counts#21369

Merged
adriangb merged 2 commits intoapache:mainfrom
pydantic:remove-row-counts-column-param
Apr 6, 2026
Merged

refactor(pruning): remove column param from PruningStatistics::row_counts#21369
adriangb merged 2 commits intoapache:mainfrom
pydantic:remove-row-counts-column-param

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented Apr 4, 2026

Which issue does this PR close?

N/A — standalone API improvement, prerequisite for #21157.

Rationale for this change

PruningStatistics::row_counts(&self, column: &Column) takes a column parameter, but row counts are container-level (same for all columns). 8 of 11 implementations ignore the parameter with _column. The Parquet impl (RowGroupPruningStatistics) unnecessarily constructs a StatisticsConverter from the column just to call row_group_row_counts(), which doesn't use the column at all.

The existing code even has a comment acknowledging this:

"row counts are the same for all columns in a row group"

And a test comment:

"This is debatable, personally I think row_count should not take a Column as an argument at all since all columns should have the same number of rows."

What changes are included in this PR?

Breaking change: fn row_counts(&self, column: &Column) -> Option<ArrayRef> becomes fn row_counts(&self) -> Option<ArrayRef>.

  • Remove column parameter from trait definition and all 11 implementations
  • RowGroupPruningStatistics: read num_rows() directly from row group metadata instead of routing through StatisticsConverter
  • PrunableStatistics: remove column-exists validation (row count is container-level)
  • Update all call sites and tests

Are these changes tested?

Yes — all existing tests updated and passing. The behavior change is:

  • row_counts() on PrunableStatistics now returns data even for non-existent columns (correct, since row count is container-level)
  • RowGroupPruningStatistics::row_counts() always returns row counts (previously could fail if column wasn't in Parquet schema)

Are there any user-facing changes?

Yes — breaking change to PruningStatistics trait. Downstream implementations need to remove the column parameter from their row_counts method.

🤖 Generated with Claude Code

Row counts are container-level (same for all columns), but the trait
method forced a `column: &Column` parameter. 8 of 11 implementations
ignored it with `_column`. The Parquet impl used it unnecessarily to
construct a `StatisticsConverter` just to call `row_group_row_counts()`,
which doesn't need the column.

This is a breaking change to the `PruningStatistics` trait:
`fn row_counts(&self, column: &Column)` → `fn row_counts(&self)`.

Also fixes RowGroupPruningStatistics to read row counts directly from
row group metadata instead of routing through StatisticsConverter, and
removes PrunableStatistics' column-exists validation for row counts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added common Related to common crate datasource Changes to the datasource crate labels Apr 4, 2026
@adriangb
Copy link
Copy Markdown
Contributor Author

adriangb commented Apr 4, 2026

cc @kumarUjjawal

@adriangb adriangb changed the title fix(pruning): remove column param from PruningStatistics::row_counts refactor(pruning): remove column param from PruningStatistics::row_counts Apr 4, 2026
@adriangb adriangb requested a review from alamb April 4, 2026 20:31
@alamb alamb added the api change Changes the API exposed to users of the crate label Apr 6, 2026
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good change but it is a breaking API change as this is part of the public API.

Maybe worth a small note in the upgrade guide

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 6, 2026
> always return `None`. Use the `downcast_ref` method above instead, or
> dereference through the `Arc` first with `plan.as_ref() as &dyn Any`.

### `PruningStatistics::row_counts` no longer takes a `column` parameter
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb added upgrade guide here

@adriangb adriangb added this pull request to the merge queue Apr 6, 2026
Merged via the queue into apache:main with commit 206ac6b Apr 6, 2026
32 checks passed
@adriangb adriangb deleted the remove-row-counts-column-param branch April 6, 2026 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate common Related to common crate datasource Changes to the datasource crate documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants