Is your feature request related to a problem or challenge?
This issue is filed to track the changes proposed in #21110.
DataFusion currently does not expose the new Parquet Content-Defined Chunking (CDC) support added in parquet-rs by apache/arrow-rs#9450. Traditional Parquet writing splits data pages at fixed sizes, so inserting or deleting a row causes subsequent pages to shift and can force nearly all bytes to be re-uploaded in content-addressable storage systems.
CDC instead determines page boundaries using a rolling hash over column values, so unchanged data can produce identical pages across writes. This can reduce storage and upload costs and improve deduplication behavior for rewritten datasets.
Describe the solution you'd like
Expose the Parquet CDC writer options in DataFusion so users can enable the feature when writing Parquet files.
This should cover the configuration surface introduced upstream in parquet-rs, including:
- enabling content-defined chunking with the default settings
- configuring explicit CDC parameters such as
min_chunk_size, max_chunk_size, and norm_level
The implementation and rationale are largely derived from apache/arrow-rs#9450, and this issue exists to track carrying those changes through in DataFusion via #21110.
Describe alternatives you've considered
Continue using the existing fixed-size Parquet page splitting behavior and do not expose CDC-related writer options in DataFusion.
That preserves current behavior, but it means users cannot take advantage of the improved page stability and deduplication characteristics now available in parquet-rs.
Additional context
Most of the technical content above is intentionally derived from apache/arrow-rs#9450, with the additional context that this issue tracks the corresponding DataFusion work in #21110.
Is your feature request related to a problem or challenge?
This issue is filed to track the changes proposed in #21110.
DataFusion currently does not expose the new Parquet Content-Defined Chunking (CDC) support added in
parquet-rsby apache/arrow-rs#9450. Traditional Parquet writing splits data pages at fixed sizes, so inserting or deleting a row causes subsequent pages to shift and can force nearly all bytes to be re-uploaded in content-addressable storage systems.CDC instead determines page boundaries using a rolling hash over column values, so unchanged data can produce identical pages across writes. This can reduce storage and upload costs and improve deduplication behavior for rewritten datasets.
Describe the solution you'd like
Expose the Parquet CDC writer options in DataFusion so users can enable the feature when writing Parquet files.
This should cover the configuration surface introduced upstream in
parquet-rs, including:min_chunk_size,max_chunk_size, andnorm_levelThe implementation and rationale are largely derived from apache/arrow-rs#9450, and this issue exists to track carrying those changes through in DataFusion via #21110.
Describe alternatives you've considered
Continue using the existing fixed-size Parquet page splitting behavior and do not expose CDC-related writer options in DataFusion.
That preserves current behavior, but it means users cannot take advantage of the improved page stability and deduplication characteristics now available in
parquet-rs.Additional context
Most of the technical content above is intentionally derived from apache/arrow-rs#9450, with the additional context that this issue tracks the corresponding DataFusion work in #21110.