Skip to content

feat: add ExtractLeafExpressions optimizer rule for get_field pushdown#20117

Open
adriangb wants to merge 22 commits intoapache:mainfrom
pydantic:get-field-pushdown-try-3
Open

feat: add ExtractLeafExpressions optimizer rule for get_field pushdown#20117
adriangb wants to merge 22 commits intoapache:mainfrom
pydantic:get-field-pushdown-try-3

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Feb 2, 2026

Summary

This PR adds a new optimizer rule ExtractLeafExpressions that extracts MoveTowardsLeafNodes sub-expressions (like get_field) from Filter, Sort, Limit, Aggregate, and Projection nodes into intermediate projections.

This normalization allows OptimizeProjections (which runs next) to merge consecutive projections and push get_field expressions down to the scan, enabling Parquet column pruning for struct fields.

Example

SELECT id, s['label'] FROM t WHERE s['value'] > 150

Before: get_field(s, 'label') stayed in ProjectionExec, reading full struct s

After: Both get_field expressions pushed to DataSourceExec:

DataSourceExec: projection=[get_field(s, value) as __leaf_5, get_field(s, label) as __leaf_4, id]

How It Works

The rule:

  1. Extracts MoveTowardsLeafNodes expressions into __datafusion_extracted_N aliases
  2. Creates inner projections with extracted expressions + pass-through columns
  3. Creates outer projections to restore original schema names
  4. Handles deduplication of identical expressions
  5. Skips expressions already aliased with __datafusion_extracted_* to ensure idempotency

This is partially modeled after:

  • CommonSubexprEliminate which also creates expressions with aliases and extracts them into "2 phase" projections
  • PushDownFilter which handles pushing expressions past joins, aggregates, etc.
  • OptimizeProjections which also manipulates projections

Interaction with other optimizer rules

This rule has some interaction with PushDownFilter. I had to teach PushDownFilter to not push filters past the pushed down projections, otherwise it would undo the work this optimizer rule did. There is no point in pushing filters past these expressions as they are so cheap to compute it's better to evaluate them before filters.

Test plan

  • New unit tests for projection extraction in extract_leaf_expressions.rs
  • Updated sqllogictest expectations in projection_pushdown.slt
  • All optimizer tests pass (cargo test -p datafusion-optimizer)

🤖 Generated with Claude Code

This PR adds a new optimizer rule `ExtractLeafExpressions` that extracts
`MoveTowardsLeafNodes` sub-expressions (like `get_field`) from Filter,
Sort, Limit, Aggregate, and Projection nodes into intermediate projections.

This normalization allows `OptimizeProjections` (which runs next) to merge
consecutive projections and push `get_field` expressions down to the scan,
enabling Parquet column pruning for struct fields.

Example transformation for projections:
```sql
SELECT id, s['label'] FROM t WHERE s['value'] > 150
```

Before: `get_field(s, 'label')` stayed in ProjectionExec, reading full struct
After: Both `get_field` expressions pushed to DataSourceExec

The rule:
- Extracts `MoveTowardsLeafNodes` expressions into `__leaf_N` aliases
- Creates inner projections with extracted expressions + pass-through columns
- Creates outer projections to restore original schema names
- Handles deduplication of identical expressions
- Skips expressions already aliased with `__leaf_*` to ensure idempotency

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Feb 2, 2026
@adriangb adriangb marked this pull request as draft February 2, 2026 20:17
@adriangb adriangb requested a review from Copilot February 2, 2026 21:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces the ExtractLeafExpressions optimizer rule to enable better Parquet column pruning by extracting get_field expressions into intermediate projections. The rule normalizes query plans so that field accessor expressions can be pushed down to DataSource nodes, allowing only required struct fields to be read from Parquet files.

Changes:

  • New ExtractLeafExpressions optimizer rule that extracts MoveTowardsLeafNodes expressions (like get_field) from Filter, Sort, Limit, Aggregate, and Projection nodes
  • Modified PushDownFilter to avoid pushing filters through __leaf_* extraction projections
  • Updated test expectations across multiple SQL logic test files to reflect new query plans with extracted field expressions

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
datafusion/optimizer/src/extract_leaf_expressions.rs New optimizer rule implementation with bottom-up traversal to extract and push down leaf expressions
datafusion/optimizer/src/optimizer.rs Registers ExtractLeafExpressions to run before OptimizeProjections
datafusion/optimizer/src/lib.rs Exports the new extract_leaf_expressions module
datafusion/optimizer/src/push_down_filter.rs Adds logic to prevent filter pushdown through __leaf_* extraction projections
datafusion/optimizer/src/test/mod.rs Adds test helper functions for tables with struct fields
datafusion/sqllogictest/test_files/projection_pushdown.slt Updates expected query plans showing __leaf_* aliases and extraction projections
datafusion/sqllogictest/test_files/struct.slt Updates expected projection output to include AS clause for field access
datafusion/sqllogictest/test_files/projection.slt Updates expected logical plan to include AS clause for field access
datafusion/sqllogictest/test_files/push_down_filter.slt Updates expected physical plan showing extraction projection before FilterExec
datafusion/sqllogictest/test_files/explain.slt Adds new optimizer stage output line for extract_leaf_expressions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Feb 3, 2026
@adriangb adriangb marked this pull request as ready for review February 3, 2026 00:11
adriangb and others added 2 commits February 2, 2026 19:30
Implement `extract_from_join` to extract `MoveTowardsLeafNodes`
sub-expressions (like get_field) from Join nodes:

- Extract from `on` expressions (equijoin keys)
- Extract from `filter` expressions (non-equi conditions)
- Route extractions to appropriate side (left/right) based on columns
- Add recovery projection to restore original schema

Also adds unit tests and sqllogictest integration tests for:
- Join with get_field in equijoin condition
- Join with get_field in filter (WHERE clause)
- Join with extractions from both sides
- Left join with get_field extraction
- Baseline join without extraction

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Comment on lines +140 to +141
// Everything else passes through unchanged
_ => Ok(Transformed::no(plan)),
Copy link
Contributor Author

@adriangb adriangb Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what else we could handle here. Maybe Extension?

Before we merge this PR we expand this to explicitly ignore all other nodes so that if a new node is added one has to decide how this rule should handle it. I'll wait to do that since that's another +30 LOC diff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we used the map_expressions API, as suggested above, we would get support for Extension nodes "for free"

----
logical_plan
01)Projection: simple_struct.id, get_field(simple_struct.s, Utf8("value")) + Int64(1)
01)Projection: simple_struct.id, get_field(simple_struct.s, Utf8("value")) + Int64(1) AS simple_struct.s[value] + Int64(1)
Copy link
Contributor Author

@adriangb adriangb Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is not a change in the output schema name: it is already simple_struct.s[value].


#####################
# Section 12: Cleanup
# Section 12: Join Tests - get_field Extraction from Join Nodes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can break these out into another PR to reduce the diff if that's helpful.

adriangb and others added 8 commits February 3, 2026 09:05
When `find_extraction_target` returns a Projection that renames columns
(e.g. `user AS x`), both `build_extraction_projection` and
`merge_into_extracted_projection` were adding extracted expressions that
reference the target's output columns (e.g. `col("x")`) to a projection
evaluated against the target's input (which only has `user`).

Fix by resolving extracted expressions and columns_needed through the
projection's rename mapping using `replace_cols_by_name` before merging.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Comment on lines +121 to +122
04)------ProjectionExec: expr=[get_field(__unnest_placeholder(d.column2,depth=1)@1, a) as __datafusion_extracted_1, column1@0 as column1, __unnest_placeholder(d.column2,depth=1)@1 as __unnest_placeholder(d.column2,depth=1)]
05)--------UnnestExec
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be quite complex to try to push the get_filter through the unnest, and ultimately I don't think Parquet would be able to optimize the scan (maybe I'm wrong about this?) so there would be little point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it wasn't pushed through before I think it is fine that it (still) isn't pushed through

// Don't push filters through extracted expression projections.
// Pushing filters through would rewrite expressions like `__datafusion_extracted_1 > 150`
// back to `get_field(s,'value') > 150`, undoing the extraction.
if is_extracted_expr_projection(&projection) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is obviously not great, but I don't see another way to avoid this. Otherwise if we have:

Filter: get_field(col, 'foo') > 1
  TableScan: projection=[col]

And we run our new rule to get:

Projection: col('col')
  Filter: __datafusion_extracted_1  > 1
    Projection: get_field(col, 'foo') as  __datafusion_extracted_1, col
        TableScan: projection=[col]

Then this rule runs and will produce:

Projection: col('col')
    Projection: get_field(col, 'foo') as  __datafusion_extracted_1, col
        Filter: get_field(col, 'foo') > 1
             TableScan projection=[col]

Because it wants to push the filter under the projection.
I'd argue as a general rule there's no point in pushing a filter under a projection that is purely column selections / get_field expressions especially if we can't then push it further.
Maybe a more robust fix would be to have the filter pushdown optimizer rule traverse the rest of the plan, find the position it plans to push into and then check if there's any advantage to doing some (i.e. is it pushing the filter under an expensive operator that benefits from less input data, or is it just doing a trivial pointless pushdown like in the case above). But that would be a lot more involved so I chose this simpler solution for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue as a general rule there's no point in pushing a filter under a projection that is purely column selections / get_field expressions especially if we can't then push it further.

Yes I agree with this statement.

I don't really have a better suggestion other than to perhaps make the exception more general "don't push filters under projections that doesn't do computation / etc"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment would be better / easier to understand the need for the special case if you included the great example from your comment

Projection: col('col')
  Filter: __datafusion_extracted_1  > 1
    Projection: get_field(col, 'foo') as  __datafusion_extracted_1, col
        TableScan: projection=[col]

Then this rule runs and will produce:

Projection: col('col')
    Projection: get_field(col, 'foo') as  __datafusion_extracted_1, col
        Filter: get_field(col, 'foo') > 1
             TableScan projection=[col]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't projection pushdown push the projection later inside Filter again? What does a final plan look like?

@alamb
Copy link
Contributor

alamb commented Feb 3, 2026

run benchmark sql_planner

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch_bench.sh compare_branch_bench.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing get-field-pushdown-try-3 (e95acd3) to 9962911 diff
BENCH_NAME=sql_planner
BENCH_COMMAND=cargo bench --features=parquet --bench sql_planner
BENCH_FILTER=
BENCH_BRANCH_NAME=get-field-pushdown-try-3
Results will be posted here when complete

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adriangb -- this is very exciting to see so close.

After this PR, what else is left to close out these issues?

Major points:

One concern, which you have also touched on, is the coupling of ExtractLeafExpressions and OptimizeProjections, in the sense that those passes now have implicit dependencies on this new pass

Did you consider incorporating this logic directly into the OptimizeProjections? It seems like this transformation is really just a mechanism to enable OptimizeProjections 🤔

Cc @AdamGS as you said you are interested in this for Vortex as well

///
/// This is used by optimizers to make decisions about expression placement,
/// such as whether to push expressions down through projections.
pub fn placement(&self) -> ExpressionPlacement {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the name ExpressionPlacement 👍

Comment on lines +121 to +122
04)------ProjectionExec: expr=[get_field(__unnest_placeholder(d.column2,depth=1)@1, a) as __datafusion_extracted_1, column1@0 as column1, __unnest_placeholder(d.column2,depth=1)@1 as __unnest_placeholder(d.column2,depth=1)]
05)--------UnnestExec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it wasn't pushed through before I think it is fine that it (still) isn't pushed through

03)----DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[id, s], file_type=parquet, predicate=id@0 > 2, pruning_predicate=id_null_count@1 != row_count@2 AND id_max@0 > 2, required_guarantees=[]
01)ProjectionExec: expr=[id@1 as id, __datafusion_extracted_1@0 as simple_struct.s[value]]
02)--FilterExec: id@1 > 2
03)----DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[get_field(s@1, value) as __datafusion_extracted_1, id], file_type=parquet, predicate=id@0 > 2, pruning_predicate=id_null_count@1 != row_count@2 AND id_max@0 > 2, required_guarantees=[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the optimizer pass in action -- the get_field was pushed down -- the plan looks good to me

/// Extracts `MoveTowardsLeafNodes` sub-expressions from all nodes into projections.
///
/// This normalizes the plan so that all `MoveTowardsLeafNodes` computations (like field
/// accessors) live in Projection nodes, making them eligible for pushdown.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "live in projection nodes" mean here? Like that all MoveTowardsLeafNodes computations appear as top level Exprs in a ProjectionExec?

initial_logical_plan
01)Projection: simple_explain_test.a, simple_explain_test.b, simple_explain_test.c
02)--TableScan: simple_explain_test
logical_plan after resolve_grouping_function SAME TEXT AS ABOVE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😓 that is a lot of rewrites (not related to this PR, I am just thinking about planning speed in general)

//! - `Limit` - passes all input columns through
//!
//! **Projection Nodes** (merge through):
//! - Replace column refs with underlying expressions from the child projection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to split up the comments into module on the struct?

It might make sense to leave the module level comments relatively minimal and move #Algorithm and everything else down to the doc comment on ExtractLeafExpressions so the algorithm and examples are close together

/// The `OptimizeProjections` rule can then push this projection down to the scan.
///
/// **Important:** The `PushDownFilter` rule is aware of projections created by this rule
/// and will not push filters through them. See `is_extracted_expr_projection` in utils.rs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to make this a link too so it is checked automatically by rustdoc rather than can get out of sync

Suggested change
/// and will not push filters through them. See `is_extracted_expr_projection` in utils.rs.
/// and will not push filters through them. See [`is_extracted_expr_projection`]

match &plan {
// Schema-preserving nodes - extract and push down
LogicalPlan::Filter(_) | LogicalPlan::Sort(_) | LogicalPlan::Limit(_) => {
extract_from_schema_preserving(plan, alias_generator)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why there needs to be specialized code for different LogicalPlan types -- this seems like it is exactly the use case LogicalPlan::map_expressions() is designed to handle.

Couldn't you use map_expressions to rewrite any get_field expressions, and then add the relevant projection below it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll give it a try

Comment on lines +140 to +141
// Everything else passes through unchanged
_ => Ok(Transformed::no(plan)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we used the map_expressions API, as suggested above, we would get support for Extension nodes "for free"

let rebuilt_input = extractor.build_extraction_projection(&target, path)?;

// Create the node with new input
let new_inputs: Vec<LogicalPlan> = std::iter::once(rebuilt_input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code above seems to assume there is a single input -- so it seems strange to have code here for multiple inputs 🤔

@alamb
Copy link
Contributor

alamb commented Feb 3, 2026

BTW codex found a test that shows a single projection being extracted doesn't get pushed down

I can make this a separate PR if you like

note how the get_field is not pushed into the datasource:

###
# Test 2.1b: Projection-only get_field (potential optimization target)
###

query TT
EXPLAIN SELECT s['label'] FROM simple_struct;
----
logical_plan
01)Projection: get_field(simple_struct.s, Utf8("label"))
02)--TableScan: simple_struct projection=[s]
physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[get_field(s@1, label) as simple_struct.s[label]], file_type=parquet

# Verify correctness
query T
SELECT s['label'] FROM simple_struct ORDER BY s['label'];
----
alpha
beta
delta
epsilon
gamma

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                 get-field-pushdown-try-3               main
-----                                                 ------------------------               ----
logical_aggregate_with_join                           1.01   643.3±10.62µs        ? ?/sec    1.00    639.4±6.12µs        ? ?/sec
logical_select_all_from_1000                          1.00     10.3±0.09ms        ? ?/sec    1.12     11.5±0.15ms        ? ?/sec
logical_select_one_from_700                           1.01    421.7±3.89µs        ? ?/sec    1.00    415.8±2.36µs        ? ?/sec
logical_trivial_join_high_numbered_columns            1.01   380.8±12.39µs        ? ?/sec    1.00    376.5±3.19µs        ? ?/sec
logical_trivial_join_low_numbered_columns             1.01    365.3±4.47µs        ? ?/sec    1.00   361.5±11.63µs        ? ?/sec
physical_intersection                                 1.00  1627.7±31.01µs        ? ?/sec    1.00  1622.2±125.58µs        ? ?/sec
physical_join_consider_sort                           1.02      2.3±0.05ms        ? ?/sec    1.00      2.3±0.02ms        ? ?/sec
physical_join_distinct                                1.01    356.9±8.17µs        ? ?/sec    1.00    353.0±3.93µs        ? ?/sec
physical_many_self_joins                              1.01     12.7±0.08ms        ? ?/sec    1.00     12.5±0.26ms        ? ?/sec
physical_plan_clickbench_all                          1.01    203.2±1.51ms        ? ?/sec    1.00    200.9±1.71ms        ? ?/sec
physical_plan_clickbench_q1                           1.01      2.2±0.02ms        ? ?/sec    1.00      2.1±0.02ms        ? ?/sec
physical_plan_clickbench_q10                          1.05      3.8±0.14ms        ? ?/sec    1.00      3.7±0.10ms        ? ?/sec
physical_plan_clickbench_q11                          1.04      4.3±0.09ms        ? ?/sec    1.00      4.1±0.04ms        ? ?/sec
physical_plan_clickbench_q12                          1.05      4.5±0.09ms        ? ?/sec    1.00      4.2±0.04ms        ? ?/sec
physical_plan_clickbench_q13                          1.06      4.0±0.06ms        ? ?/sec    1.00      3.8±0.12ms        ? ?/sec
physical_plan_clickbench_q14                          1.05      4.3±0.09ms        ? ?/sec    1.00      4.1±0.04ms        ? ?/sec
physical_plan_clickbench_q15                          1.04      4.0±0.07ms        ? ?/sec    1.00      3.9±0.09ms        ? ?/sec
physical_plan_clickbench_q16                          1.05      3.9±0.07ms        ? ?/sec    1.00      3.7±0.04ms        ? ?/sec
physical_plan_clickbench_q17                          1.05      4.0±0.07ms        ? ?/sec    1.00      3.8±0.02ms        ? ?/sec
physical_plan_clickbench_q18                          1.02      2.7±0.03ms        ? ?/sec    1.00      2.7±0.02ms        ? ?/sec
physical_plan_clickbench_q19                          1.02      4.3±0.12ms        ? ?/sec    1.00      4.2±0.09ms        ? ?/sec
physical_plan_clickbench_q2                           1.03      2.9±0.03ms        ? ?/sec    1.00      2.8±0.06ms        ? ?/sec
physical_plan_clickbench_q20                          1.01      2.2±0.03ms        ? ?/sec    1.00      2.2±0.02ms        ? ?/sec
physical_plan_clickbench_q21                          1.01      2.8±0.04ms        ? ?/sec    1.00      2.8±0.05ms        ? ?/sec
physical_plan_clickbench_q22                          1.04      4.1±0.07ms        ? ?/sec    1.00      3.9±0.05ms        ? ?/sec
physical_plan_clickbench_q23                          1.03      4.3±0.08ms        ? ?/sec    1.00      4.2±0.03ms        ? ?/sec
physical_plan_clickbench_q24                          1.02      4.9±0.07ms        ? ?/sec    1.00      4.8±0.04ms        ? ?/sec
physical_plan_clickbench_q25                          1.01      3.5±0.03ms        ? ?/sec    1.00      3.5±0.09ms        ? ?/sec
physical_plan_clickbench_q26                          1.02      3.0±0.06ms        ? ?/sec    1.00      2.9±0.04ms        ? ?/sec
physical_plan_clickbench_q27                          1.01      3.6±0.10ms        ? ?/sec    1.00      3.5±0.06ms        ? ?/sec
physical_plan_clickbench_q28                          1.02      4.5±0.10ms        ? ?/sec    1.00      4.5±0.06ms        ? ?/sec
physical_plan_clickbench_q29                          1.02      4.8±0.07ms        ? ?/sec    1.00      4.7±0.04ms        ? ?/sec
physical_plan_clickbench_q3                           1.02      2.6±0.03ms        ? ?/sec    1.00      2.5±0.03ms        ? ?/sec
physical_plan_clickbench_q30                          1.04     16.1±0.21ms        ? ?/sec    1.00     15.5±0.22ms        ? ?/sec
physical_plan_clickbench_q31                          1.02      4.5±0.03ms        ? ?/sec    1.00      4.5±0.05ms        ? ?/sec
physical_plan_clickbench_q32                          1.02      4.5±0.04ms        ? ?/sec    1.00      4.5±0.04ms        ? ?/sec
physical_plan_clickbench_q33                          1.02      3.7±0.09ms        ? ?/sec    1.00      3.6±0.08ms        ? ?/sec
physical_plan_clickbench_q34                          1.02      3.3±0.03ms        ? ?/sec    1.00      3.2±0.03ms        ? ?/sec
physical_plan_clickbench_q35                          1.01      3.4±0.03ms        ? ?/sec    1.00      3.3±0.03ms        ? ?/sec
physical_plan_clickbench_q36                          1.03      4.3±0.08ms        ? ?/sec    1.00      4.2±0.04ms        ? ?/sec
physical_plan_clickbench_q37                          1.02      4.7±0.09ms        ? ?/sec    1.00      4.7±0.06ms        ? ?/sec
physical_plan_clickbench_q38                          1.01      4.7±0.04ms        ? ?/sec    1.00      4.7±0.07ms        ? ?/sec
physical_plan_clickbench_q39                          1.01      4.1±0.07ms        ? ?/sec    1.00      4.1±0.05ms        ? ?/sec
physical_plan_clickbench_q4                           1.03      2.2±0.04ms        ? ?/sec    1.00      2.2±0.02ms        ? ?/sec
physical_plan_clickbench_q40                          1.01      5.0±0.05ms        ? ?/sec    1.00      4.9±0.07ms        ? ?/sec
physical_plan_clickbench_q41                          1.01      4.3±0.03ms        ? ?/sec    1.00      4.3±0.08ms        ? ?/sec
physical_plan_clickbench_q42                          1.01      4.3±0.07ms        ? ?/sec    1.00      4.2±0.05ms        ? ?/sec
physical_plan_clickbench_q43                          1.01      4.6±0.06ms        ? ?/sec    1.00      4.5±0.06ms        ? ?/sec
physical_plan_clickbench_q44                          1.01      2.3±0.08ms        ? ?/sec    1.00      2.3±0.02ms        ? ?/sec
physical_plan_clickbench_q45                          1.01      2.3±0.02ms        ? ?/sec    1.00      2.3±0.03ms        ? ?/sec
physical_plan_clickbench_q46                          1.01      3.2±0.08ms        ? ?/sec    1.00      3.2±0.03ms        ? ?/sec
physical_plan_clickbench_q47                          1.02      4.8±0.05ms        ? ?/sec    1.00      4.7±0.07ms        ? ?/sec
physical_plan_clickbench_q48                          1.00      5.2±0.06ms        ? ?/sec    1.00      5.3±0.19ms        ? ?/sec
physical_plan_clickbench_q49                          1.01      5.5±0.11ms        ? ?/sec    1.00      5.4±0.08ms        ? ?/sec
physical_plan_clickbench_q5                           1.04      2.6±0.10ms        ? ?/sec    1.00      2.5±0.02ms        ? ?/sec
physical_plan_clickbench_q50                          1.01      4.2±0.11ms        ? ?/sec    1.00      4.2±0.04ms        ? ?/sec
physical_plan_clickbench_q51                          1.01      3.6±0.08ms        ? ?/sec    1.00      3.6±0.07ms        ? ?/sec
physical_plan_clickbench_q6                           1.05      2.6±0.07ms        ? ?/sec    1.00      2.5±0.02ms        ? ?/sec
physical_plan_clickbench_q7                           1.04      2.2±0.04ms        ? ?/sec    1.00      2.1±0.02ms        ? ?/sec
physical_plan_clickbench_q8                           1.05      3.6±0.10ms        ? ?/sec    1.00      3.5±0.03ms        ? ?/sec
physical_plan_clickbench_q9                           1.04      3.8±0.09ms        ? ?/sec    1.00      3.6±0.08ms        ? ?/sec
physical_plan_tpcds_all                               1.00  1968.6±20.06ms        ? ?/sec    1.00  1966.2±18.94ms        ? ?/sec
physical_plan_tpch_all                                1.00    129.4±1.21ms        ? ?/sec    1.00    129.9±1.06ms        ? ?/sec
physical_plan_tpch_q1                                 1.03      3.1±0.03ms        ? ?/sec    1.00      3.0±0.06ms        ? ?/sec
physical_plan_tpch_q10                                1.01      7.4±0.06ms        ? ?/sec    1.00      7.3±0.10ms        ? ?/sec
physical_plan_tpch_q11                                1.01      8.7±0.15ms        ? ?/sec    1.00      8.6±0.17ms        ? ?/sec
physical_plan_tpch_q12                                1.01      3.1±0.03ms        ? ?/sec    1.00      3.1±0.03ms        ? ?/sec
physical_plan_tpch_q13                                1.01      3.1±0.06ms        ? ?/sec    1.00      3.0±0.02ms        ? ?/sec
physical_plan_tpch_q14                                1.02      3.2±0.03ms        ? ?/sec    1.00      3.1±0.03ms        ? ?/sec
physical_plan_tpch_q16                                1.01      5.3±0.04ms        ? ?/sec    1.00      5.2±0.05ms        ? ?/sec
physical_plan_tpch_q17                                1.03      5.8±0.04ms        ? ?/sec    1.00      5.6±0.07ms        ? ?/sec
physical_plan_tpch_q18                                1.01      6.1±0.06ms        ? ?/sec    1.00      6.0±0.10ms        ? ?/sec
physical_plan_tpch_q19                                1.03      5.3±0.03ms        ? ?/sec    1.00      5.1±0.10ms        ? ?/sec
physical_plan_tpch_q2                                 1.03     12.7±0.13ms        ? ?/sec    1.00     12.4±0.17ms        ? ?/sec
physical_plan_tpch_q20                                1.00      8.2±0.06ms        ? ?/sec    1.01      8.3±0.15ms        ? ?/sec
physical_plan_tpch_q21                                1.01     10.3±0.10ms        ? ?/sec    1.00     10.2±0.14ms        ? ?/sec
physical_plan_tpch_q22                                1.01      6.6±0.10ms        ? ?/sec    1.00      6.5±0.06ms        ? ?/sec
physical_plan_tpch_q3                                 1.02      5.7±0.04ms        ? ?/sec    1.00      5.6±0.04ms        ? ?/sec
physical_plan_tpch_q4                                 1.01      3.1±0.05ms        ? ?/sec    1.00      3.0±0.07ms        ? ?/sec
physical_plan_tpch_q5                                 1.01      6.0±0.04ms        ? ?/sec    1.00      6.0±0.10ms        ? ?/sec
physical_plan_tpch_q6                                 1.01  1617.3±16.07µs        ? ?/sec    1.00  1604.1±25.37µs        ? ?/sec
physical_plan_tpch_q7                                 1.02      7.4±0.08ms        ? ?/sec    1.00      7.2±0.12ms        ? ?/sec
physical_plan_tpch_q8                                 1.01      9.5±0.07ms        ? ?/sec    1.00      9.3±0.09ms        ? ?/sec
physical_plan_tpch_q9                                 1.00      6.8±0.04ms        ? ?/sec    1.00      6.8±0.10ms        ? ?/sec
physical_select_aggregates_from_200                   1.00     17.7±0.13ms        ? ?/sec    1.00     17.7±0.11ms        ? ?/sec
physical_select_all_from_1000                         1.00     23.6±0.24ms        ? ?/sec    1.05     24.7±0.17ms        ? ?/sec
physical_select_one_from_700                          1.02   1357.1±8.44µs        ? ?/sec    1.00  1327.1±11.77µs        ? ?/sec
physical_sorted_union_order_by_10_int64               1.02     11.3±0.09ms        ? ?/sec    1.00     11.1±0.11ms        ? ?/sec
physical_sorted_union_order_by_10_uint64              1.02     30.8±0.37ms        ? ?/sec    1.00     30.2±0.37ms        ? ?/sec
physical_sorted_union_order_by_50_int64               1.01    200.9±2.34ms        ? ?/sec    1.00    199.8±4.52ms        ? ?/sec
physical_sorted_union_order_by_50_uint64              1.04   1148.2±7.76ms        ? ?/sec    1.00  1104.5±15.61ms        ? ?/sec
physical_theta_join_consider_sort                     1.07      2.8±0.02ms        ? ?/sec    1.00      2.6±0.02ms        ? ?/sec
physical_unnest_to_join                               1.05      3.2±0.04ms        ? ?/sec    1.00      3.1±0.04ms        ? ?/sec
physical_window_function_partition_by_12_on_values    1.02  1612.1±18.59µs        ? ?/sec    1.00  1587.3±11.05µs        ? ?/sec
physical_window_function_partition_by_30_on_values    1.02      3.0±0.03ms        ? ?/sec    1.00      2.9±0.06ms        ? ?/sec
physical_window_function_partition_by_4_on_values     1.01  1095.9±10.86µs        ? ?/sec    1.00  1085.1±20.62µs        ? ?/sec
physical_window_function_partition_by_7_on_values     1.02  1280.3±30.33µs        ? ?/sec    1.00  1261.3±17.40µs        ? ?/sec
physical_window_function_partition_by_8_on_values     1.02   1347.7±7.20µs        ? ?/sec    1.00  1323.7±12.37µs        ? ?/sec
with_param_values_many_columns                        1.00    577.1±5.70µs        ? ?/sec    1.05   606.5±14.52µs        ? ?/sec

@adriangb
Copy link
Contributor Author

adriangb commented Feb 3, 2026

🤖: Benchmark completed

Details

group                                                 get-field-pushdown-try-3               main
-----                                                 ------------------------               ----
logical_aggregate_with_join                           1.01   643.3±10.62µs        ? ?/sec    1.00    639.4±6.12µs        ? ?/sec
logical_select_all_from_1000                          1.00     10.3±0.09ms        ? ?/sec    1.12     11.5±0.15ms        ? ?/sec
logical_select_one_from_700                           1.01    421.7±3.89µs        ? ?/sec    1.00    415.8±2.36µs        ? ?/sec
logical_trivial_join_high_numbered_columns            1.01   380.8±12.39µs        ? ?/sec    1.00    376.5±3.19µs        ? ?/sec
logical_trivial_join_low_numbered_columns             1.01    365.3±4.47µs        ? ?/sec    1.00   361.5±11.63µs        ? ?/sec
physical_intersection                                 1.00  1627.7±31.01µs        ? ?/sec    1.00  1622.2±125.58µs        ? ?/sec
physical_join_consider_sort                           1.02      2.3±0.05ms        ? ?/sec    1.00      2.3±0.02ms        ? ?/sec
physical_join_distinct                                1.01    356.9±8.17µs        ? ?/sec    1.00    353.0±3.93µs        ? ?/sec
physical_many_self_joins                              1.01     12.7±0.08ms        ? ?/sec    1.00     12.5±0.26ms        ? ?/sec
physical_plan_clickbench_all                          1.01    203.2±1.51ms        ? ?/sec    1.00    200.9±1.71ms        ? ?/sec
physical_plan_clickbench_q1                           1.01      2.2±0.02ms        ? ?/sec    1.00      2.1±0.02ms        ? ?/sec
physical_plan_clickbench_q10                          1.05      3.8±0.14ms        ? ?/sec    1.00      3.7±0.10ms        ? ?/sec
physical_plan_clickbench_q11                          1.04      4.3±0.09ms        ? ?/sec    1.00      4.1±0.04ms        ? ?/sec
physical_plan_clickbench_q12                          1.05      4.5±0.09ms        ? ?/sec    1.00      4.2±0.04ms        ? ?/sec
physical_plan_clickbench_q13                          1.06      4.0±0.06ms        ? ?/sec    1.00      3.8±0.12ms        ? ?/sec
physical_plan_clickbench_q14                          1.05      4.3±0.09ms        ? ?/sec    1.00      4.1±0.04ms        ? ?/sec
physical_plan_clickbench_q15                          1.04      4.0±0.07ms        ? ?/sec    1.00      3.9±0.09ms        ? ?/sec
physical_plan_clickbench_q16                          1.05      3.9±0.07ms        ? ?/sec    1.00      3.7±0.04ms        ? ?/sec
physical_plan_clickbench_q17                          1.05      4.0±0.07ms        ? ?/sec    1.00      3.8±0.02ms        ? ?/sec
physical_plan_clickbench_q18                          1.02      2.7±0.03ms        ? ?/sec    1.00      2.7±0.02ms        ? ?/sec
physical_plan_clickbench_q19                          1.02      4.3±0.12ms        ? ?/sec    1.00      4.2±0.09ms        ? ?/sec
physical_plan_clickbench_q2                           1.03      2.9±0.03ms        ? ?/sec    1.00      2.8±0.06ms        ? ?/sec
physical_plan_clickbench_q20                          1.01      2.2±0.03ms        ? ?/sec    1.00      2.2±0.02ms        ? ?/sec
physical_plan_clickbench_q21                          1.01      2.8±0.04ms        ? ?/sec    1.00      2.8±0.05ms        ? ?/sec
physical_plan_clickbench_q22                          1.04      4.1±0.07ms        ? ?/sec    1.00      3.9±0.05ms        ? ?/sec
physical_plan_clickbench_q23                          1.03      4.3±0.08ms        ? ?/sec    1.00      4.2±0.03ms        ? ?/sec
physical_plan_clickbench_q24                          1.02      4.9±0.07ms        ? ?/sec    1.00      4.8±0.04ms        ? ?/sec
physical_plan_clickbench_q25                          1.01      3.5±0.03ms        ? ?/sec    1.00      3.5±0.09ms        ? ?/sec
physical_plan_clickbench_q26                          1.02      3.0±0.06ms        ? ?/sec    1.00      2.9±0.04ms        ? ?/sec
physical_plan_clickbench_q27                          1.01      3.6±0.10ms        ? ?/sec    1.00      3.5±0.06ms        ? ?/sec
physical_plan_clickbench_q28                          1.02      4.5±0.10ms        ? ?/sec    1.00      4.5±0.06ms        ? ?/sec
physical_plan_clickbench_q29                          1.02      4.8±0.07ms        ? ?/sec    1.00      4.7±0.04ms        ? ?/sec
physical_plan_clickbench_q3                           1.02      2.6±0.03ms        ? ?/sec    1.00      2.5±0.03ms        ? ?/sec
physical_plan_clickbench_q30                          1.04     16.1±0.21ms        ? ?/sec    1.00     15.5±0.22ms        ? ?/sec
physical_plan_clickbench_q31                          1.02      4.5±0.03ms        ? ?/sec    1.00      4.5±0.05ms        ? ?/sec
physical_plan_clickbench_q32                          1.02      4.5±0.04ms        ? ?/sec    1.00      4.5±0.04ms        ? ?/sec
physical_plan_clickbench_q33                          1.02      3.7±0.09ms        ? ?/sec    1.00      3.6±0.08ms        ? ?/sec
physical_plan_clickbench_q34                          1.02      3.3±0.03ms        ? ?/sec    1.00      3.2±0.03ms        ? ?/sec
physical_plan_clickbench_q35                          1.01      3.4±0.03ms        ? ?/sec    1.00      3.3±0.03ms        ? ?/sec
physical_plan_clickbench_q36                          1.03      4.3±0.08ms        ? ?/sec    1.00      4.2±0.04ms        ? ?/sec
physical_plan_clickbench_q37                          1.02      4.7±0.09ms        ? ?/sec    1.00      4.7±0.06ms        ? ?/sec
physical_plan_clickbench_q38                          1.01      4.7±0.04ms        ? ?/sec    1.00      4.7±0.07ms        ? ?/sec
physical_plan_clickbench_q39                          1.01      4.1±0.07ms        ? ?/sec    1.00      4.1±0.05ms        ? ?/sec
physical_plan_clickbench_q4                           1.03      2.2±0.04ms        ? ?/sec    1.00      2.2±0.02ms        ? ?/sec
physical_plan_clickbench_q40                          1.01      5.0±0.05ms        ? ?/sec    1.00      4.9±0.07ms        ? ?/sec
physical_plan_clickbench_q41                          1.01      4.3±0.03ms        ? ?/sec    1.00      4.3±0.08ms        ? ?/sec
physical_plan_clickbench_q42                          1.01      4.3±0.07ms        ? ?/sec    1.00      4.2±0.05ms        ? ?/sec
physical_plan_clickbench_q43                          1.01      4.6±0.06ms        ? ?/sec    1.00      4.5±0.06ms        ? ?/sec
physical_plan_clickbench_q44                          1.01      2.3±0.08ms        ? ?/sec    1.00      2.3±0.02ms        ? ?/sec
physical_plan_clickbench_q45                          1.01      2.3±0.02ms        ? ?/sec    1.00      2.3±0.03ms        ? ?/sec
physical_plan_clickbench_q46                          1.01      3.2±0.08ms        ? ?/sec    1.00      3.2±0.03ms        ? ?/sec
physical_plan_clickbench_q47                          1.02      4.8±0.05ms        ? ?/sec    1.00      4.7±0.07ms        ? ?/sec
physical_plan_clickbench_q48                          1.00      5.2±0.06ms        ? ?/sec    1.00      5.3±0.19ms        ? ?/sec
physical_plan_clickbench_q49                          1.01      5.5±0.11ms        ? ?/sec    1.00      5.4±0.08ms        ? ?/sec
physical_plan_clickbench_q5                           1.04      2.6±0.10ms        ? ?/sec    1.00      2.5±0.02ms        ? ?/sec
physical_plan_clickbench_q50                          1.01      4.2±0.11ms        ? ?/sec    1.00      4.2±0.04ms        ? ?/sec
physical_plan_clickbench_q51                          1.01      3.6±0.08ms        ? ?/sec    1.00      3.6±0.07ms        ? ?/sec
physical_plan_clickbench_q6                           1.05      2.6±0.07ms        ? ?/sec    1.00      2.5±0.02ms        ? ?/sec
physical_plan_clickbench_q7                           1.04      2.2±0.04ms        ? ?/sec    1.00      2.1±0.02ms        ? ?/sec
physical_plan_clickbench_q8                           1.05      3.6±0.10ms        ? ?/sec    1.00      3.5±0.03ms        ? ?/sec
physical_plan_clickbench_q9                           1.04      3.8±0.09ms        ? ?/sec    1.00      3.6±0.08ms        ? ?/sec
physical_plan_tpcds_all                               1.00  1968.6±20.06ms        ? ?/sec    1.00  1966.2±18.94ms        ? ?/sec
physical_plan_tpch_all                                1.00    129.4±1.21ms        ? ?/sec    1.00    129.9±1.06ms        ? ?/sec
physical_plan_tpch_q1                                 1.03      3.1±0.03ms        ? ?/sec    1.00      3.0±0.06ms        ? ?/sec
physical_plan_tpch_q10                                1.01      7.4±0.06ms        ? ?/sec    1.00      7.3±0.10ms        ? ?/sec
physical_plan_tpch_q11                                1.01      8.7±0.15ms        ? ?/sec    1.00      8.6±0.17ms        ? ?/sec
physical_plan_tpch_q12                                1.01      3.1±0.03ms        ? ?/sec    1.00      3.1±0.03ms        ? ?/sec
physical_plan_tpch_q13                                1.01      3.1±0.06ms        ? ?/sec    1.00      3.0±0.02ms        ? ?/sec
physical_plan_tpch_q14                                1.02      3.2±0.03ms        ? ?/sec    1.00      3.1±0.03ms        ? ?/sec
physical_plan_tpch_q16                                1.01      5.3±0.04ms        ? ?/sec    1.00      5.2±0.05ms        ? ?/sec
physical_plan_tpch_q17                                1.03      5.8±0.04ms        ? ?/sec    1.00      5.6±0.07ms        ? ?/sec
physical_plan_tpch_q18                                1.01      6.1±0.06ms        ? ?/sec    1.00      6.0±0.10ms        ? ?/sec
physical_plan_tpch_q19                                1.03      5.3±0.03ms        ? ?/sec    1.00      5.1±0.10ms        ? ?/sec
physical_plan_tpch_q2                                 1.03     12.7±0.13ms        ? ?/sec    1.00     12.4±0.17ms        ? ?/sec
physical_plan_tpch_q20                                1.00      8.2±0.06ms        ? ?/sec    1.01      8.3±0.15ms        ? ?/sec
physical_plan_tpch_q21                                1.01     10.3±0.10ms        ? ?/sec    1.00     10.2±0.14ms        ? ?/sec
physical_plan_tpch_q22                                1.01      6.6±0.10ms        ? ?/sec    1.00      6.5±0.06ms        ? ?/sec
physical_plan_tpch_q3                                 1.02      5.7±0.04ms        ? ?/sec    1.00      5.6±0.04ms        ? ?/sec
physical_plan_tpch_q4                                 1.01      3.1±0.05ms        ? ?/sec    1.00      3.0±0.07ms        ? ?/sec
physical_plan_tpch_q5                                 1.01      6.0±0.04ms        ? ?/sec    1.00      6.0±0.10ms        ? ?/sec
physical_plan_tpch_q6                                 1.01  1617.3±16.07µs        ? ?/sec    1.00  1604.1±25.37µs        ? ?/sec
physical_plan_tpch_q7                                 1.02      7.4±0.08ms        ? ?/sec    1.00      7.2±0.12ms        ? ?/sec
physical_plan_tpch_q8                                 1.01      9.5±0.07ms        ? ?/sec    1.00      9.3±0.09ms        ? ?/sec
physical_plan_tpch_q9                                 1.00      6.8±0.04ms        ? ?/sec    1.00      6.8±0.10ms        ? ?/sec
physical_select_aggregates_from_200                   1.00     17.7±0.13ms        ? ?/sec    1.00     17.7±0.11ms        ? ?/sec
physical_select_all_from_1000                         1.00     23.6±0.24ms        ? ?/sec    1.05     24.7±0.17ms        ? ?/sec
physical_select_one_from_700                          1.02   1357.1±8.44µs        ? ?/sec    1.00  1327.1±11.77µs        ? ?/sec
physical_sorted_union_order_by_10_int64               1.02     11.3±0.09ms        ? ?/sec    1.00     11.1±0.11ms        ? ?/sec
physical_sorted_union_order_by_10_uint64              1.02     30.8±0.37ms        ? ?/sec    1.00     30.2±0.37ms        ? ?/sec
physical_sorted_union_order_by_50_int64               1.01    200.9±2.34ms        ? ?/sec    1.00    199.8±4.52ms        ? ?/sec
physical_sorted_union_order_by_50_uint64              1.04   1148.2±7.76ms        ? ?/sec    1.00  1104.5±15.61ms        ? ?/sec
physical_theta_join_consider_sort                     1.07      2.8±0.02ms        ? ?/sec    1.00      2.6±0.02ms        ? ?/sec
physical_unnest_to_join                               1.05      3.2±0.04ms        ? ?/sec    1.00      3.1±0.04ms        ? ?/sec
physical_window_function_partition_by_12_on_values    1.02  1612.1±18.59µs        ? ?/sec    1.00  1587.3±11.05µs        ? ?/sec
physical_window_function_partition_by_30_on_values    1.02      3.0±0.03ms        ? ?/sec    1.00      2.9±0.06ms        ? ?/sec
physical_window_function_partition_by_4_on_values     1.01  1095.9±10.86µs        ? ?/sec    1.00  1085.1±20.62µs        ? ?/sec
physical_window_function_partition_by_7_on_values     1.02  1280.3±30.33µs        ? ?/sec    1.00  1261.3±17.40µs        ? ?/sec
physical_window_function_partition_by_8_on_values     1.02   1347.7±7.20µs        ? ?/sec    1.00  1323.7±12.37µs        ? ?/sec
with_param_values_many_columns                        1.00    577.1±5.70µs        ? ?/sec    1.05   606.5±14.52µs        ? ?/sec

@alamb are there any of these that use structs? It seems like this has no impact on the benchmarks (good!) but maybe we should add some that hit the full rewrite?

@adriangb
Copy link
Contributor Author

adriangb commented Feb 3, 2026

BTW codex found a test that shows a single projection being extracted doesn't get pushed down

I can make this a separate PR if you like

note how the get_field is not pushed into the datasource:

###
# Test 2.1b: Projection-only get_field (potential optimization target)
###

query TT
EXPLAIN SELECT s['label'] FROM simple_struct;
----
logical_plan
01)Projection: get_field(simple_struct.s, Utf8("label"))
02)--TableScan: simple_struct projection=[s]
physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[get_field(s@1, label) as simple_struct.s[label]], file_type=parquet

# Verify correctness
query T
SELECT s['label'] FROM simple_struct ORDER BY s['label'];
----
alpha
beta
delta
epsilon
gamma

Thanks for reporting. I’ll make a new PR with this test + join tests + benchmarks.

@adriangb
Copy link
Contributor Author

adriangb commented Feb 3, 2026

Did you consider incorporating this logic directly into the OptimizeProjections? It seems like this transformation is really just a mechanism to enable OptimizeProjections 🤔

OptimizeProjections merges projections but it does not push them down. There is no logical optimizer rule for pushing down projections. We could move that part of this rule (the pushdown) into OptimizeProjections or a new rule but I’d find it easier to work it all in one place for now.

@alamb
Copy link
Contributor

alamb commented Feb 3, 2026

@alamb are there any of these that use structs? It seems like this has no impact on the benchmarks (good!) but maybe we should add some that hit the full rewrite?

I was more trying to quantify the effect on planning time of adding a new optimizer pass -- it seems like it is small but noticable slowdown (1-3%). I'll see if I can reproduce those results

@alamb
Copy link
Contributor

alamb commented Feb 3, 2026

run benchmark sql_planner

@adriangb
Copy link
Contributor Author

adriangb commented Feb 3, 2026

@alamb are there any of these that use structs? It seems like this has no impact on the benchmarks (good!) but maybe we should add some that hit the full rewrite?

I was more trying to quantify the effect on planning time of adding a new optimizer pass -- it seems like it is small but noticable slowdown (1-3%). I'll see if I can reproduce those results

It makes sense that there’s a small slowdown, it has to visit every node in the plan even if it doesn’t modify it at all. That said a lot of the numbers were within the variation ie not statistically different.

@alamb
Copy link
Contributor

alamb commented Feb 3, 2026

It makes sense that there’s a small slowdown, it has to visit every node in the plan even if it doesn’t modify it at all. That said a lot of the numbers were within the variation ie not statistically different.

Yeah, there is a tension here for sure

I do think in general it would be a good idea to consolidate optimizer passes given each basically deep copies the plan

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch_bench.sh compare_branch_bench.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing get-field-pushdown-try-3 (e95acd3) to 9962911 diff
BENCH_NAME=sql_planner
BENCH_COMMAND=cargo bench --features=parquet --bench sql_planner
BENCH_FILTER=
BENCH_BRANCH_NAME=get-field-pushdown-try-3
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                 get-field-pushdown-try-3               main
-----                                                 ------------------------               ----
logical_aggregate_with_join                           1.00    645.5±4.79µs        ? ?/sec    1.01   650.9±30.77µs        ? ?/sec
logical_select_all_from_1000                          1.00     10.3±0.08ms        ? ?/sec    1.12     11.6±0.31ms        ? ?/sec
logical_select_one_from_700                           1.01    421.8±3.21µs        ? ?/sec    1.00    419.6±8.67µs        ? ?/sec
logical_trivial_join_high_numbered_columns            1.01   380.8±12.06µs        ? ?/sec    1.00    378.2±7.50µs        ? ?/sec
logical_trivial_join_low_numbered_columns             1.02   369.0±24.49µs        ? ?/sec    1.00    362.5±4.04µs        ? ?/sec
physical_intersection                                 1.02  1633.7±27.83µs        ? ?/sec    1.00  1601.4±29.28µs        ? ?/sec
physical_join_consider_sort                           1.02      2.3±0.04ms        ? ?/sec    1.00      2.3±0.02ms        ? ?/sec
physical_join_distinct                                1.03    360.7±3.20µs        ? ?/sec    1.00    351.4±2.61µs        ? ?/sec
physical_many_self_joins                              1.01     12.8±0.09ms        ? ?/sec    1.00     12.7±0.23ms        ? ?/sec
physical_plan_clickbench_all                          1.02    205.6±2.81ms        ? ?/sec    1.00    200.7±2.24ms        ? ?/sec
physical_plan_clickbench_q1                           1.02      2.2±0.03ms        ? ?/sec    1.00      2.1±0.03ms        ? ?/sec
physical_plan_clickbench_q10                          1.02      3.7±0.04ms        ? ?/sec    1.00      3.6±0.03ms        ? ?/sec
physical_plan_clickbench_q11                          1.11      4.6±0.18ms        ? ?/sec    1.00      4.1±0.10ms        ? ?/sec
physical_plan_clickbench_q12                          1.04      4.4±0.08ms        ? ?/sec    1.00      4.2±0.03ms        ? ?/sec
physical_plan_clickbench_q13                          1.03      3.8±0.07ms        ? ?/sec    1.00      3.8±0.03ms        ? ?/sec
physical_plan_clickbench_q14                          1.03      4.2±0.07ms        ? ?/sec    1.00      4.1±0.03ms        ? ?/sec
physical_plan_clickbench_q15                          1.02      3.9±0.03ms        ? ?/sec    1.00      3.8±0.04ms        ? ?/sec
physical_plan_clickbench_q16                          1.01      3.7±0.06ms        ? ?/sec    1.00      3.7±0.03ms        ? ?/sec
physical_plan_clickbench_q17                          1.02      3.9±0.07ms        ? ?/sec    1.00      3.8±0.08ms        ? ?/sec
physical_plan_clickbench_q18                          1.01      2.7±0.03ms        ? ?/sec    1.00      2.7±0.03ms        ? ?/sec
physical_plan_clickbench_q19                          1.05      4.4±0.09ms        ? ?/sec    1.00      4.2±0.04ms        ? ?/sec
physical_plan_clickbench_q2                           1.06      3.0±0.06ms        ? ?/sec    1.00      2.8±0.05ms        ? ?/sec
physical_plan_clickbench_q20                          1.05      2.3±0.04ms        ? ?/sec    1.00      2.2±0.02ms        ? ?/sec
physical_plan_clickbench_q21                          1.02      2.8±0.03ms        ? ?/sec    1.00      2.8±0.02ms        ? ?/sec
physical_plan_clickbench_q22                          1.02      4.0±0.05ms        ? ?/sec    1.00      3.9±0.05ms        ? ?/sec
physical_plan_clickbench_q23                          1.02      4.3±0.12ms        ? ?/sec    1.00      4.2±0.12ms        ? ?/sec
physical_plan_clickbench_q24                          1.03      4.9±0.08ms        ? ?/sec    1.00      4.8±0.04ms        ? ?/sec
physical_plan_clickbench_q25                          1.02      3.5±0.03ms        ? ?/sec    1.00      3.5±0.02ms        ? ?/sec
physical_plan_clickbench_q26                          1.02      3.0±0.03ms        ? ?/sec    1.00      2.9±0.03ms        ? ?/sec
physical_plan_clickbench_q27                          1.02      3.6±0.05ms        ? ?/sec    1.00      3.5±0.03ms        ? ?/sec
physical_plan_clickbench_q28                          1.04      4.6±0.19ms        ? ?/sec    1.00      4.5±0.11ms        ? ?/sec
physical_plan_clickbench_q29                          1.02      4.8±0.07ms        ? ?/sec    1.00      4.8±0.19ms        ? ?/sec
physical_plan_clickbench_q3                           1.02      2.6±0.07ms        ? ?/sec    1.00      2.5±0.04ms        ? ?/sec
physical_plan_clickbench_q30                          1.03     16.1±0.23ms        ? ?/sec    1.00     15.7±0.41ms        ? ?/sec
physical_plan_clickbench_q31                          1.02      4.6±0.03ms        ? ?/sec    1.00      4.5±0.11ms        ? ?/sec
physical_plan_clickbench_q32                          1.02      4.5±0.08ms        ? ?/sec    1.00      4.5±0.03ms        ? ?/sec
physical_plan_clickbench_q33                          1.02      3.7±0.04ms        ? ?/sec    1.00      3.6±0.03ms        ? ?/sec
physical_plan_clickbench_q34                          1.01      3.3±0.03ms        ? ?/sec    1.00      3.3±0.02ms        ? ?/sec
physical_plan_clickbench_q35                          1.02      3.4±0.07ms        ? ?/sec    1.00      3.3±0.02ms        ? ?/sec
physical_plan_clickbench_q36                          1.03      4.3±0.11ms        ? ?/sec    1.00      4.2±0.05ms        ? ?/sec
physical_plan_clickbench_q37                          1.01      4.8±0.05ms        ? ?/sec    1.00      4.7±0.12ms        ? ?/sec
physical_plan_clickbench_q38                          1.01      4.7±0.08ms        ? ?/sec    1.00      4.7±0.05ms        ? ?/sec
physical_plan_clickbench_q39                          1.01      4.1±0.15ms        ? ?/sec    1.00      4.1±0.09ms        ? ?/sec
physical_plan_clickbench_q4                           1.00      2.2±0.02ms        ? ?/sec    1.00      2.2±0.03ms        ? ?/sec
physical_plan_clickbench_q40                          1.02      5.0±0.04ms        ? ?/sec    1.00      4.9±0.05ms        ? ?/sec
physical_plan_clickbench_q41                          1.01      4.3±0.05ms        ? ?/sec    1.00      4.3±0.05ms        ? ?/sec
physical_plan_clickbench_q42                          1.02      4.3±0.09ms        ? ?/sec    1.00      4.2±0.05ms        ? ?/sec
physical_plan_clickbench_q43                          1.02      4.6±0.10ms        ? ?/sec    1.00      4.5±0.05ms        ? ?/sec
physical_plan_clickbench_q44                          1.01      2.3±0.03ms        ? ?/sec    1.00      2.3±0.01ms        ? ?/sec
physical_plan_clickbench_q45                          1.02      2.3±0.04ms        ? ?/sec    1.00      2.3±0.02ms        ? ?/sec
physical_plan_clickbench_q46                          1.02      3.3±0.03ms        ? ?/sec    1.00      3.2±0.02ms        ? ?/sec
physical_plan_clickbench_q47                          1.04      4.9±0.08ms        ? ?/sec    1.00      4.7±0.05ms        ? ?/sec
physical_plan_clickbench_q48                          1.03      5.3±0.07ms        ? ?/sec    1.00      5.2±0.07ms        ? ?/sec
physical_plan_clickbench_q49                          1.01      5.5±0.13ms        ? ?/sec    1.00      5.5±0.08ms        ? ?/sec
physical_plan_clickbench_q5                           1.02      2.6±0.03ms        ? ?/sec    1.00      2.5±0.01ms        ? ?/sec
physical_plan_clickbench_q50                          1.00      4.2±0.03ms        ? ?/sec    1.00      4.2±0.12ms        ? ?/sec
physical_plan_clickbench_q51                          1.02      3.6±0.04ms        ? ?/sec    1.00      3.5±0.04ms        ? ?/sec
physical_plan_clickbench_q6                           1.02      2.6±0.03ms        ? ?/sec    1.00      2.5±0.02ms        ? ?/sec
physical_plan_clickbench_q7                           1.01      2.1±0.02ms        ? ?/sec    1.00      2.1±0.02ms        ? ?/sec
physical_plan_clickbench_q8                           1.02      3.5±0.05ms        ? ?/sec    1.00      3.4±0.06ms        ? ?/sec
physical_plan_clickbench_q9                           1.01      3.6±0.04ms        ? ?/sec    1.00      3.6±0.10ms        ? ?/sec
physical_plan_tpcds_all                               1.03  1985.3±26.56ms        ? ?/sec    1.00  1934.2±12.37ms        ? ?/sec
physical_plan_tpch_all                                1.02    130.3±2.40ms        ? ?/sec    1.00    127.4±1.33ms        ? ?/sec
physical_plan_tpch_q1                                 1.08      3.2±0.07ms        ? ?/sec    1.00      3.0±0.04ms        ? ?/sec
physical_plan_tpch_q10                                1.08      7.9±0.11ms        ? ?/sec    1.00      7.3±0.08ms        ? ?/sec
physical_plan_tpch_q11                                1.04      8.9±0.09ms        ? ?/sec    1.00      8.6±0.14ms        ? ?/sec
physical_plan_tpch_q12                                1.03      3.2±0.05ms        ? ?/sec    1.00      3.1±0.06ms        ? ?/sec
physical_plan_tpch_q13                                1.04      3.2±0.15ms        ? ?/sec    1.00      3.0±0.07ms        ? ?/sec
physical_plan_tpch_q14                                1.03      3.3±0.08ms        ? ?/sec    1.00      3.2±0.07ms        ? ?/sec
physical_plan_tpch_q16                                1.04      5.4±0.17ms        ? ?/sec    1.00      5.2±0.04ms        ? ?/sec
physical_plan_tpch_q17                                1.05      5.9±0.18ms        ? ?/sec    1.00      5.7±0.05ms        ? ?/sec
physical_plan_tpch_q18                                1.04      6.2±0.11ms        ? ?/sec    1.00      6.0±0.08ms        ? ?/sec
physical_plan_tpch_q19                                1.06      5.3±0.09ms        ? ?/sec    1.00      5.1±0.07ms        ? ?/sec
physical_plan_tpch_q2                                 1.05     13.0±0.16ms        ? ?/sec    1.00     12.4±0.13ms        ? ?/sec
physical_plan_tpch_q20                                1.10      8.8±0.11ms        ? ?/sec    1.00      8.0±0.05ms        ? ?/sec
physical_plan_tpch_q21                                1.05     10.6±0.20ms        ? ?/sec    1.00     10.2±0.16ms        ? ?/sec
physical_plan_tpch_q22                                1.03      6.6±0.07ms        ? ?/sec    1.00      6.4±0.07ms        ? ?/sec
physical_plan_tpch_q3                                 1.04      5.9±0.10ms        ? ?/sec    1.00      5.6±0.14ms        ? ?/sec
physical_plan_tpch_q4                                 1.03      3.1±0.05ms        ? ?/sec    1.00      3.0±0.09ms        ? ?/sec
physical_plan_tpch_q5                                 1.03      6.2±0.06ms        ? ?/sec    1.00      6.0±0.04ms        ? ?/sec
physical_plan_tpch_q6                                 1.03  1652.5±25.45µs        ? ?/sec    1.00  1598.0±18.11µs        ? ?/sec
physical_plan_tpch_q7                                 1.03      7.5±0.16ms        ? ?/sec    1.00      7.2±0.04ms        ? ?/sec
physical_plan_tpch_q8                                 1.04      9.8±0.11ms        ? ?/sec    1.00      9.4±0.06ms        ? ?/sec
physical_plan_tpch_q9                                 1.05      7.0±0.09ms        ? ?/sec    1.00      6.7±0.07ms        ? ?/sec
physical_select_aggregates_from_200                   1.00     17.7±0.14ms        ? ?/sec    1.01     17.9±0.20ms        ? ?/sec
physical_select_all_from_1000                         1.00     23.6±0.18ms        ? ?/sec    1.06     25.1±0.22ms        ? ?/sec
physical_select_one_from_700                          1.02  1360.2±32.81µs        ? ?/sec    1.00  1337.9±14.27µs        ? ?/sec
physical_sorted_union_order_by_10_int64               1.03     11.4±0.44ms        ? ?/sec    1.00     11.1±0.05ms        ? ?/sec
physical_sorted_union_order_by_10_uint64              1.04     31.0±0.23ms        ? ?/sec    1.00     29.9±0.23ms        ? ?/sec
physical_sorted_union_order_by_50_int64               1.02    201.8±2.06ms        ? ?/sec    1.00    197.7±2.07ms        ? ?/sec
physical_sorted_union_order_by_50_uint64              1.06  1164.9±18.35ms        ? ?/sec    1.00   1095.4±9.10ms        ? ?/sec
physical_theta_join_consider_sort                     1.08      2.9±0.06ms        ? ?/sec    1.00      2.6±0.03ms        ? ?/sec
physical_unnest_to_join                               1.05      3.2±0.05ms        ? ?/sec    1.00      3.1±0.02ms        ? ?/sec
physical_window_function_partition_by_12_on_values    1.01  1623.0±34.06µs        ? ?/sec    1.00  1602.8±26.71µs        ? ?/sec
physical_window_function_partition_by_30_on_values    1.01      3.0±0.02ms        ? ?/sec    1.00      3.0±0.03ms        ? ?/sec
physical_window_function_partition_by_4_on_values     1.00   1103.3±7.88µs        ? ?/sec    1.00  1106.5±18.21µs        ? ?/sec
physical_window_function_partition_by_7_on_values     1.00  1282.8±10.81µs        ? ?/sec    1.00  1279.9±13.22µs        ? ?/sec
physical_window_function_partition_by_8_on_values     1.01  1370.3±24.28µs        ? ?/sec    1.00  1350.5±11.53µs        ? ?/sec
with_param_values_many_columns                        1.00    575.5±4.77µs        ? ?/sec    1.05    602.3±6.73µs        ? ?/sec

@adriangb
Copy link
Contributor Author

adriangb commented Feb 4, 2026

BTW codex found a test that shows a single projection being extracted doesn't get pushed down
I can make this a separate PR if you like
note how the get_field is not pushed into the datasource:

###
# Test 2.1b: Projection-only get_field (potential optimization target)
###

query TT
EXPLAIN SELECT s['label'] FROM simple_struct;
----
logical_plan
01)Projection: get_field(simple_struct.s, Utf8("label"))
02)--TableScan: simple_struct projection=[s]
physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[get_field(s@1, label) as simple_struct.s[label]], file_type=parquet

# Verify correctness
query T
SELECT s['label'] FROM simple_struct ORDER BY s['label'];
----
alpha
beta
delta
epsilon
gamma

Thanks for reporting. I’ll make a new PR with this test + join tests + benchmarks.

BTW codex found a test that shows a single projection being extracted doesn't get pushed down
I can make this a separate PR if you like
note how the get_field is not pushed into the datasource:

###
# Test 2.1b: Projection-only get_field (potential optimization target)
###

query TT
EXPLAIN SELECT s['label'] FROM simple_struct;
----
logical_plan
01)Projection: get_field(simple_struct.s, Utf8("label"))
02)--TableScan: simple_struct projection=[s]
physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[get_field(s@1, label) as simple_struct.s[label]], file_type=parquet

# Verify correctness
query T
SELECT s['label'] FROM simple_struct ORDER BY s['label'];
----
alpha
beta
delta
epsilon
gamma

Thanks for reporting. I’ll make a new PR with this test + join tests + benchmarks.

Done in #20143

adriangb added a commit to pydantic/datafusion that referenced this pull request Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants