You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge?
In the parquet opener, DataFusion currently does per-file schema adaptation and pruning setup, including predicate rewrites and pruning predicate construction:
/// Returns a `ArrowReaderMetadata` with the page index loaded, loading
/// it from the underlying `AsyncFileReader` if necessary.
asyncfnload_page_index<T:AsyncFileReader>(
reader_metadata:ArrowReaderMetadata,
input:&mutT,
options:ArrowReaderOptions,
) -> Result<ArrowReaderMetadata>{
let parquet_metadata = reader_metadata.metadata();
let missing_column_index = parquet_metadata.column_index().is_none();
let missing_offset_index = parquet_metadata.offset_index().is_none();
// You may ask yourself: why are we even checking if the page index is already loaded here?
// Didn't we explicitly *not* load it above?
// Well it's possible that a custom implementation of `AsyncFileReader` gives you
// the page index even if you didn't ask for it (e.g. because it's cached)
// so it's important to check that here to avoid extra work.
if missing_column_index || missing_offset_index {
As @adriangb noted on #21480 (comment), many deployments only have a small number of physical schemas, often just one, so repeating the same work across many files is wasteful.
PR #21480 from @fpetkovski improved this area by avoiding page pruning predicate construction unless page indexes are enabled, but we can do better and cache equivalent pruning setup across files with the same physical schema.
Describe the solution you'd like
Cache parquet pruning setup across files when the physical schema and other correctness-relevant inputs are the same.
Is your feature request related to a problem or challenge?
In the parquet opener, DataFusion currently does per-file schema adaptation and pruning setup, including predicate rewrites and pruning predicate construction:
datafusion/datafusion/datasource-parquet/src/opener.rs
Lines 743 to 788 in 590a517
datafusion/datafusion/datasource-parquet/src/opener.rs
Lines 1523 to 1547 in 590a517
As @adriangb noted on #21480 (comment), many deployments only have a small number of physical schemas, often just one, so repeating the same work across many files is wasteful.
PR #21480 from @fpetkovski improved this area by avoiding page pruning predicate construction unless page indexes are enabled, but we can do better and cache equivalent pruning setup across files with the same physical schema.
Describe the solution you'd like
Cache parquet pruning setup across files when the physical schema and other correctness-relevant inputs are the same.
This likely includes:
Describe alternatives you've considered
Do nothing
Additional context
Relevant links:
Conditionally build page pruning predicates #21480 (comment)
Conditionally build page pruning predicates #21480
datafusion/datafusion/datasource-parquet/src/opener.rs
Lines 793 to 839 in 590a517