Support parquet content-defined chunking options

### Is your feature request related to a problem or challenge?

This issue is filed to track the changes proposed in https://github.com/apache/datafusion/pull/21110.

DataFusion currently does not expose the new Parquet Content-Defined Chunking (CDC) support added in `parquet-rs` by https://github.com/apache/arrow-rs/pull/9450. Traditional Parquet writing splits data pages at fixed sizes, so inserting or deleting a row causes subsequent pages to shift and can force nearly all bytes to be re-uploaded in content-addressable storage systems.

CDC instead determines page boundaries using a rolling hash over column values, so unchanged data can produce identical pages across writes. This can reduce storage and upload costs and improve deduplication behavior for rewritten datasets.

### Describe the solution you'd like

Expose the Parquet CDC writer options in DataFusion so users can enable the feature when writing Parquet files.

This should cover the configuration surface introduced upstream in `parquet-rs`, including:

- enabling content-defined chunking with the default settings
- configuring explicit CDC parameters such as `min_chunk_size`, `max_chunk_size`, and `norm_level`

The implementation and rationale are largely derived from https://github.com/apache/arrow-rs/pull/9450, and this issue exists to track carrying those changes through in DataFusion via https://github.com/apache/datafusion/pull/21110.

### Describe alternatives you've considered

Continue using the existing fixed-size Parquet page splitting behavior and do not expose CDC-related writer options in DataFusion.

That preserves current behavior, but it means users cannot take advantage of the improved page stability and deduplication characteristics now available in `parquet-rs`.

### Additional context

- Tracking PR in DataFusion: https://github.com/apache/datafusion/pull/21110
- Upstream implementation in arrow-rs: https://github.com/apache/arrow-rs/pull/9450
- Related C++ implementation referenced by the upstream PR: https://github.com/apache/arrow/pull/45360
- Background on the feature: https://huggingface.co/blog/parquet-cdc

Most of the technical content above is intentionally derived from https://github.com/apache/arrow-rs/pull/9450, with the additional context that this issue tracks the corresponding DataFusion work in https://github.com/apache/datafusion/pull/21110.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support parquet content-defined chunking options #21408

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Support parquet content-defined chunking options #21408

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions