feat: add ExtractLeafExpressions optimizer rule for get_field pushdown by adriangb · Pull Request #20117 · apache/datafusion

adriangb · 2026-02-02T19:06:39Z

Summary

This PR adds a new optimizer rule ExtractLeafExpressions that extracts MoveTowardsLeafNodes sub-expressions (like get_field) from Filter, Sort, Limit, Aggregate, and Projection nodes into intermediate projections.

This normalization allows OptimizeProjections (which runs next) to merge consecutive projections and push get_field expressions down to the scan, enabling Parquet column pruning for struct fields.

Example

SELECT id, s['label'] FROM t WHERE s['value'] > 150

Before: get_field(s, 'label') stayed in ProjectionExec, reading full struct s

After: Both get_field expressions pushed to DataSourceExec:

DataSourceExec: projection=[get_field(s, value) as __leaf_5, get_field(s, label) as __leaf_4, id]

How It Works

The rule:

Extracts MoveTowardsLeafNodes expressions into __datafusion_extracted_N aliases
Creates inner projections with extracted expressions + pass-through columns
Creates outer projections to restore original schema names
Handles deduplication of identical expressions
Skips expressions already aliased with __datafusion_extracted_* to ensure idempotency

This is partially modeled after:

CommonSubexprEliminate which also creates expressions with aliases and extracts them into "2 phase" projections
PushDownFilter which handles pushing expressions past joins, aggregates, etc.
OptimizeProjections which also manipulates projections

Interaction with other optimizer rules

This rule has some interaction with PushDownFilter. I had to teach PushDownFilter to not push filters past the pushed down projections, otherwise it would undo the work this optimizer rule did. There is no point in pushing filters past these expressions as they are so cheap to compute it's better to evaluate them before filters.

Test plan

New unit tests for projection extraction in extract_leaf_expressions.rs
Updated sqllogictest expectations in projection_pushdown.slt
All optimizer tests pass (cargo test -p datafusion-optimizer)

🤖 Generated with Claude Code

This PR adds a new optimizer rule `ExtractLeafExpressions` that extracts `MoveTowardsLeafNodes` sub-expressions (like `get_field`) from Filter, Sort, Limit, Aggregate, and Projection nodes into intermediate projections. This normalization allows `OptimizeProjections` (which runs next) to merge consecutive projections and push `get_field` expressions down to the scan, enabling Parquet column pruning for struct fields. Example transformation for projections: ```sql SELECT id, s['label'] FROM t WHERE s['value'] > 150 ``` Before: `get_field(s, 'label')` stayed in ProjectionExec, reading full struct After: Both `get_field` expressions pushed to DataSourceExec The rule: - Extracts `MoveTowardsLeafNodes` expressions into `__leaf_N` aliases - Creates inner projections with extracted expressions + pass-through columns - Creates outer projections to restore original schema names - Handles deduplication of identical expressions - Skips expressions already aliased with `__leaf_*` to ensure idempotency Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Copilot

Pull request overview

This PR introduces the ExtractLeafExpressions optimizer rule to enable better Parquet column pruning by extracting get_field expressions into intermediate projections. The rule normalizes query plans so that field accessor expressions can be pushed down to DataSource nodes, allowing only required struct fields to be read from Parquet files.

Changes:

New ExtractLeafExpressions optimizer rule that extracts MoveTowardsLeafNodes expressions (like get_field) from Filter, Sort, Limit, Aggregate, and Projection nodes
Modified PushDownFilter to avoid pushing filters through __leaf_* extraction projections
Updated test expectations across multiple SQL logic test files to reflect new query plans with extracted field expressions

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
datafusion/optimizer/src/extract_leaf_expressions.rs	New optimizer rule implementation with bottom-up traversal to extract and push down leaf expressions
datafusion/optimizer/src/optimizer.rs	Registers `ExtractLeafExpressions` to run before `OptimizeProjections`
datafusion/optimizer/src/lib.rs	Exports the new `extract_leaf_expressions` module
datafusion/optimizer/src/push_down_filter.rs	Adds logic to prevent filter pushdown through `__leaf_*` extraction projections
datafusion/optimizer/src/test/mod.rs	Adds test helper functions for tables with struct fields
datafusion/sqllogictest/test_files/projection_pushdown.slt	Updates expected query plans showing `__leaf_*` aliases and extraction projections
datafusion/sqllogictest/test_files/struct.slt	Updates expected projection output to include AS clause for field access
datafusion/sqllogictest/test_files/projection.slt	Updates expected logical plan to include AS clause for field access
datafusion/sqllogictest/test_files/push_down_filter.slt	Updates expected physical plan showing extraction projection before FilterExec
datafusion/sqllogictest/test_files/explain.slt	Adds new optimizer stage output line for `extract_leaf_expressions`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

datafusion/optimizer/src/push_down_filter.rs

datafusion/optimizer/src/extract_leaf_expressions.rs

Implement `extract_from_join` to extract `MoveTowardsLeafNodes` sub-expressions (like get_field) from Join nodes: - Extract from `on` expressions (equijoin keys) - Extract from `filter` expressions (non-equi conditions) - Route extractions to appropriate side (left/right) based on columns - Add recovery projection to restore original schema Also adds unit tests and sqllogictest integration tests for: - Join with get_field in equijoin condition - Join with get_field in filter (WHERE clause) - Join with extractions from both sides - Left join with get_field extraction - Baseline join without extraction Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

adriangb · 2026-02-03T00:38:58Z

datafusion/optimizer/src/extract_leaf_expressions.rs

+        // Everything else passes through unchanged
+        _ => Ok(Transformed::no(plan)),


I'm not sure what else we could handle here. Maybe Extension?

Before we merge this PR we expand this to explicitly ignore all other nodes so that if a new node is added one has to decide how this rule should handle it. I'll wait to do that since that's another +30 LOC diff.

If we used the map_expressions API, as suggested above, we would get support for Extension nodes "for free"

datafusion/optimizer/src/extract_leaf_expressions.rs

adriangb · 2026-02-03T00:46:38Z

datafusion/sqllogictest/test_files/projection_pushdown.slt

 ----
 logical_plan
-01)Projection: simple_struct.id, get_field(simple_struct.s, Utf8("value")) + Int64(1)
+01)Projection: simple_struct.id, get_field(simple_struct.s, Utf8("value")) + Int64(1) AS simple_struct.s[value] + Int64(1)


Note that this is not a change in the output schema name: it is already simple_struct.s[value].

adriangb · 2026-02-03T00:47:01Z

datafusion/sqllogictest/test_files/projection_pushdown.slt


 #####################
-# Section 12: Cleanup
+# Section 12: Join Tests - get_field Extraction from Join Nodes


I can break these out into another PR to reduce the diff if that's helpful.

When `find_extraction_target` returns a Projection that renames columns (e.g. `user AS x`), both `build_extraction_projection` and `merge_into_extracted_projection` were adding extracted expressions that reference the target's output columns (e.g. `col("x")`) to a projection evaluated against the target's input (which only has `user`). Fix by resolving extracted expressions and columns_needed through the projection's rename mapping using `replace_cols_by_name` before merging. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

adriangb · 2026-02-03T17:45:54Z

datafusion/sqllogictest/test_files/push_down_filter.slt

+04)------ProjectionExec: expr=[get_field(__unnest_placeholder(d.column2,depth=1)@1, a) as __datafusion_extracted_1, column1@0 as column1, __unnest_placeholder(d.column2,depth=1)@1 as __unnest_placeholder(d.column2,depth=1)]
+05)--------UnnestExec


I think it would be quite complex to try to push the get_filter through the unnest, and ultimately I don't think Parquet would be able to optimize the scan (maybe I'm wrong about this?) so there would be little point.

Since it wasn't pushed through before I think it is fine that it (still) isn't pushed through

adriangb · 2026-02-03T17:55:34Z

datafusion/optimizer/src/push_down_filter.rs

+    // Don't push filters through extracted expression projections.
+    // Pushing filters through would rewrite expressions like `__datafusion_extracted_1 > 150`
+    // back to `get_field(s,'value') > 150`, undoing the extraction.
+    if is_extracted_expr_projection(&projection) {


This is obviously not great, but I don't see another way to avoid this. Otherwise if we have:

Filter: get_field(col, 'foo') > 1 TableScan: projection=[col]

And we run our new rule to get:

Projection: col('col') Filter: __datafusion_extracted_1 > 1 Projection: get_field(col, 'foo') as __datafusion_extracted_1, col TableScan: projection=[col]

Then this rule runs and will produce:

Projection: col('col') Projection: get_field(col, 'foo') as __datafusion_extracted_1, col Filter: get_field(col, 'foo') > 1 TableScan projection=[col]

Because it wants to push the filter under the projection.
I'd argue as a general rule there's no point in pushing a filter under a projection that is purely column selections / get_field expressions especially if we can't then push it further.
Maybe a more robust fix would be to have the filter pushdown optimizer rule traverse the rest of the plan, find the position it plans to push into and then check if there's any advantage to doing some (i.e. is it pushing the filter under an expensive operator that benefits from less input data, or is it just doing a trivial pointless pushdown like in the case above). But that would be a lot more involved so I chose this simpler solution for now.

I'd argue as a general rule there's no point in pushing a filter under a projection that is purely column selections / get_field expressions especially if we can't then push it further.

Yes I agree with this statement.

I don't really have a better suggestion other than to perhaps make the exception more general "don't push filters under projections that doesn't do computation / etc"

I think the comment would be better / easier to understand the need for the special case if you included the great example from your comment

Projection: col('col') Filter: __datafusion_extracted_1 > 1 Projection: get_field(col, 'foo') as __datafusion_extracted_1, col TableScan: projection=[col]

Then this rule runs and will produce:

Projection: col('col') Projection: get_field(col, 'foo') as __datafusion_extracted_1, col Filter: get_field(col, 'foo') > 1 TableScan projection=[col]

Won't projection pushdown push the projection later inside Filter again? What does a final plan look like?

alamb · 2026-02-03T18:15:15Z

run benchmark sql_planner

alamb-ghbot · 2026-02-03T18:37:17Z

🤖 ./gh_compare_branch_bench.sh compare_branch_bench.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing get-field-pushdown-try-3 (e95acd3) to 9962911 diff
BENCH_NAME=sql_planner
BENCH_COMMAND=cargo bench --features=parquet --bench sql_planner
BENCH_FILTER=
BENCH_BRANCH_NAME=get-field-pushdown-try-3
Results will be posted here when complete

alamb

Thank you @adriangb -- this is very exciting to see so close.

After this PR, what else is left to close out these issues?

Major points:

I think we can avoid a lot of boilerplate code and make this code easier to maintain by using the map_expressions API: https://github.com/apache/datafusion/pull/20117/changes#r2760584800
Can we avoid coupling the passes? (below)

One concern, which you have also touched on, is the coupling of ExtractLeafExpressions and OptimizeProjections, in the sense that those passes now have implicit dependencies on this new pass

Did you consider incorporating this logic directly into the OptimizeProjections? It seems like this transformation is really just a mechanism to enable OptimizeProjections 🤔

Cc @AdamGS as you said you are interested in this for Vortex as well

alamb · 2026-02-03T18:16:36Z

datafusion/expr/src/expr.rs

    ///
    /// This is used by optimizers to make decisions about expression placement,
    /// such as whether to push expressions down through projections.
    pub fn placement(&self) -> ExpressionPlacement {


I really like the name ExpressionPlacement 👍

alamb · 2026-02-03T18:18:12Z

datafusion/sqllogictest/test_files/push_down_filter.slt

+04)------ProjectionExec: expr=[get_field(__unnest_placeholder(d.column2,depth=1)@1, a) as __datafusion_extracted_1, column1@0 as column1, __unnest_placeholder(d.column2,depth=1)@1 as __unnest_placeholder(d.column2,depth=1)]
+05)--------UnnestExec


Since it wasn't pushed through before I think it is fine that it (still) isn't pushed through

alamb · 2026-02-03T18:29:52Z

datafusion/sqllogictest/test_files/projection_pushdown.slt

-03)----DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[id, s], file_type=parquet, predicate=id@0 > 2, pruning_predicate=id_null_count@1 != row_count@2 AND id_max@0 > 2, required_guarantees=[]
+01)ProjectionExec: expr=[id@1 as id, __datafusion_extracted_1@0 as simple_struct.s[value]]
+02)--FilterExec: id@1 > 2
+03)----DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[get_field(s@1, value) as __datafusion_extracted_1, id], file_type=parquet, predicate=id@0 > 2, pruning_predicate=id_null_count@1 != row_count@2 AND id_max@0 > 2, required_guarantees=[]


Here is the optimizer pass in action -- the get_field was pushed down -- the plan looks good to me

alamb · 2026-02-03T18:41:16Z

datafusion/optimizer/src/extract_leaf_expressions.rs

+/// Extracts `MoveTowardsLeafNodes` sub-expressions from all nodes into projections.
+///
+/// This normalizes the plan so that all `MoveTowardsLeafNodes` computations (like field
+/// accessors) live in Projection nodes, making them eligible for pushdown.


What does "live in projection nodes" mean here? Like that all MoveTowardsLeafNodes computations appear as top level Exprs in a ProjectionExec?

alamb · 2026-02-03T18:43:13Z

datafusion/sqllogictest/test_files/explain.slt

 initial_logical_plan
 01)Projection: simple_explain_test.a, simple_explain_test.b, simple_explain_test.c
 02)--TableScan: simple_explain_test
 logical_plan after resolve_grouping_function SAME TEXT AS ABOVE


😓 that is a lot of rewrites (not related to this PR, I am just thinking about planning speed in general)

alamb · 2026-02-03T18:59:19Z

datafusion/optimizer/src/extract_leaf_expressions.rs

+//! - `Limit` - passes all input columns through
+//!
+//! **Projection Nodes** (merge through):
+//! - Replace column refs with underlying expressions from the child projection


Is there a reason to split up the comments into module on the struct?

It might make sense to leave the module level comments relatively minimal and move #Algorithm and everything else down to the doc comment on ExtractLeafExpressions so the algorithm and examples are close together

alamb · 2026-02-03T19:00:03Z

datafusion/optimizer/src/extract_leaf_expressions.rs

+/// The `OptimizeProjections` rule can then push this projection down to the scan.
+///
+/// **Important:** The `PushDownFilter` rule is aware of projections created by this rule
+/// and will not push filters through them. See `is_extracted_expr_projection` in utils.rs.


would be nice to make this a link too so it is checked automatically by rustdoc rather than can get out of sync

Suggested change

/// and will not push filters through them. See `is_extracted_expr_projection` in utils.rs.

/// and will not push filters through them. See [`is_extracted_expr_projection`]

alamb · 2026-02-03T19:07:54Z

datafusion/optimizer/src/extract_leaf_expressions.rs

+    match &plan {
+        // Schema-preserving nodes - extract and push down
+        LogicalPlan::Filter(_) | LogicalPlan::Sort(_) | LogicalPlan::Limit(_) => {
+            extract_from_schema_preserving(plan, alias_generator)


I don't understand why there needs to be specialized code for different LogicalPlan types -- this seems like it is exactly the use case LogicalPlan::map_expressions() is designed to handle.

Couldn't you use map_expressions to rewrite any get_field expressions, and then add the relevant projection below it?

I’ll give it a try

alamb · 2026-02-03T19:09:40Z

datafusion/optimizer/src/extract_leaf_expressions.rs

+        // Everything else passes through unchanged
+        _ => Ok(Transformed::no(plan)),


If we used the map_expressions API, as suggested above, we would get support for Extension nodes "for free"

alamb · 2026-02-03T19:10:44Z

datafusion/optimizer/src/extract_leaf_expressions.rs

+    let rebuilt_input = extractor.build_extraction_projection(&target, path)?;
+
+    // Create the node with new input
+    let new_inputs: Vec<LogicalPlan> = std::iter::once(rebuilt_input)


the code above seems to assume there is a single input -- so it seems strange to have code here for multiple inputs 🤔

alamb · 2026-02-03T19:32:03Z

BTW codex found a test that shows a single projection being extracted doesn't get pushed down

I can make this a separate PR if you like

note how the get_field is not pushed into the datasource:

###
# Test 2.1b: Projection-only get_field (potential optimization target)
###

query TT
EXPLAIN SELECT s['label'] FROM simple_struct;
----
logical_plan
01)Projection: get_field(simple_struct.s, Utf8("label"))
02)--TableScan: simple_struct projection=[s]
physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[get_field(s@1, label) as simple_struct.s[label]], file_type=parquet

# Verify correctness
query T
SELECT s['label'] FROM simple_struct ORDER BY s['label'];
----
alpha
beta
delta
epsilon
gamma

alamb-ghbot · 2026-02-03T19:50:02Z

🤖: Benchmark completed

Details

group                                                 get-field-pushdown-try-3               main
-----                                                 ------------------------               ----
logical_aggregate_with_join                           1.01   643.3±10.62µs        ? ?/sec    1.00    639.4±6.12µs        ? ?/sec
logical_select_all_from_1000                          1.00     10.3±0.09ms        ? ?/sec    1.12     11.5±0.15ms        ? ?/sec
logical_select_one_from_700                           1.01    421.7±3.89µs        ? ?/sec    1.00    415.8±2.36µs        ? ?/sec
logical_trivial_join_high_numbered_columns            1.01   380.8±12.39µs        ? ?/sec    1.00    376.5±3.19µs        ? ?/sec
logical_trivial_join_low_numbered_columns             1.01    365.3±4.47µs        ? ?/sec    1.00   361.5±11.63µs        ? ?/sec
physical_intersection                                 1.00  1627.7±31.01µs        ? ?/sec    1.00  1622.2±125.58µs        ? ?/sec
physical_join_consider_sort                           1.02      2.3±0.05ms        ? ?/sec    1.00      2.3±0.02ms        ? ?/sec
physical_join_distinct                                1.01    356.9±8.17µs        ? ?/sec    1.00    353.0±3.93µs        ? ?/sec
physical_many_self_joins                              1.01     12.7±0.08ms        ? ?/sec    1.00     12.5±0.26ms        ? ?/sec
physical_plan_clickbench_all                          1.01    203.2±1.51ms        ? ?/sec    1.00    200.9±1.71ms        ? ?/sec
physical_plan_clickbench_q1                           1.01      2.2±0.02ms        ? ?/sec    1.00      2.1±0.02ms        ? ?/sec
physical_plan_clickbench_q10                          1.05      3.8±0.14ms        ? ?/sec    1.00      3.7±0.10ms        ? ?/sec
physical_plan_clickbench_q11                          1.04      4.3±0.09ms        ? ?/sec    1.00      4.1±0.04ms        ? ?/sec
physical_plan_clickbench_q12                          1.05      4.5±0.09ms        ? ?/sec    1.00      4.2±0.04ms        ? ?/sec
physical_plan_clickbench_q13                          1.06      4.0±0.06ms        ? ?/sec    1.00      3.8±0.12ms        ? ?/sec
physical_plan_clickbench_q14                          1.05      4.3±0.09ms        ? ?/sec    1.00      4.1±0.04ms        ? ?/sec
physical_plan_clickbench_q15                          1.04      4.0±0.07ms        ? ?/sec    1.00      3.9±0.09ms        ? ?/sec
physical_plan_clickbench_q16                          1.05      3.9±0.07ms        ? ?/sec    1.00      3.7±0.04ms        ? ?/sec
physical_plan_clickbench_q17                          1.05      4.0±0.07ms        ? ?/sec    1.00      3.8±0.02ms        ? ?/sec
physical_plan_clickbench_q18                          1.02      2.7±0.03ms        ? ?/sec    1.00      2.7±0.02ms        ? ?/sec
physical_plan_clickbench_q19                          1.02      4.3±0.12ms        ? ?/sec    1.00      4.2±0.09ms        ? ?/sec
physical_plan_clickbench_q2                           1.03      2.9±0.03ms        ? ?/sec    1.00      2.8±0.06ms        ? ?/sec
physical_plan_clickbench_q20                          1.01      2.2±0.03ms        ? ?/sec    1.00      2.2±0.02ms        ? ?/sec
physical_plan_clickbench_q21                          1.01      2.8±0.04ms        ? ?/sec    1.00      2.8±0.05ms        ? ?/sec
physical_plan_clickbench_q22                          1.04      4.1±0.07ms        ? ?/sec    1.00      3.9±0.05ms        ? ?/sec
physical_plan_clickbench_q23                          1.03      4.3±0.08ms        ? ?/sec    1.00      4.2±0.03ms        ? ?/sec
physical_plan_clickbench_q24                          1.02      4.9±0.07ms        ? ?/sec    1.00      4.8±0.04ms        ? ?/sec
physical_plan_clickbench_q25                          1.01      3.5±0.03ms        ? ?/sec    1.00      3.5±0.09ms        ? ?/sec
physical_plan_clickbench_q26                          1.02      3.0±0.06ms        ? ?/sec    1.00      2.9±0.04ms        ? ?/sec
physical_plan_clickbench_q27                          1.01      3.6±0.10ms        ? ?/sec    1.00      3.5±0.06ms        ? ?/sec
physical_plan_clickbench_q28                          1.02      4.5±0.10ms        ? ?/sec    1.00      4.5±0.06ms        ? ?/sec
physical_plan_clickbench_q29                          1.02      4.8±0.07ms        ? ?/sec    1.00      4.7±0.04ms        ? ?/sec
physical_plan_clickbench_q3                           1.02      2.6±0.03ms        ? ?/sec    1.00      2.5±0.03ms        ? ?/sec
physical_plan_clickbench_q30                          1.04     16.1±0.21ms        ? ?/sec    1.00     15.5±0.22ms        ? ?/sec
physical_plan_clickbench_q31                          1.02      4.5±0.03ms        ? ?/sec    1.00      4.5±0.05ms        ? ?/sec
physical_plan_clickbench_q32                          1.02      4.5±0.04ms        ? ?/sec    1.00      4.5±0.04ms        ? ?/sec
physical_plan_clickbench_q33                          1.02      3.7±0.09ms        ? ?/sec    1.00      3.6±0.08ms        ? ?/sec
physical_plan_clickbench_q34                          1.02      3.3±0.03ms        ? ?/sec    1.00      3.2±0.03ms        ? ?/sec
physical_plan_clickbench_q35                          1.01      3.4±0.03ms        ? ?/sec    1.00      3.3±0.03ms        ? ?/sec
physical_plan_clickbench_q36                          1.03      4.3±0.08ms        ? ?/sec    1.00      4.2±0.04ms        ? ?/sec
physical_plan_clickbench_q37                          1.02      4.7±0.09ms        ? ?/sec    1.00      4.7±0.06ms        ? ?/sec
physical_plan_clickbench_q38                          1.01      4.7±0.04ms        ? ?/sec    1.00      4.7±0.07ms        ? ?/sec
physical_plan_clickbench_q39                          1.01      4.1±0.07ms        ? ?/sec    1.00      4.1±0.05ms        ? ?/sec
physical_plan_clickbench_q4                           1.03      2.2±0.04ms        ? ?/sec    1.00      2.2±0.02ms        ? ?/sec
physical_plan_clickbench_q40                          1.01      5.0±0.05ms        ? ?/sec    1.00      4.9±0.07ms        ? ?/sec
physical_plan_clickbench_q41                          1.01      4.3±0.03ms        ? ?/sec    1.00      4.3±0.08ms        ? ?/sec
physical_plan_clickbench_q42                          1.01      4.3±0.07ms        ? ?/sec    1.00      4.2±0.05ms        ? ?/sec
physical_plan_clickbench_q43                          1.01      4.6±0.06ms        ? ?/sec    1.00      4.5±0.06ms        ? ?/sec
physical_plan_clickbench_q44                          1.01      2.3±0.08ms        ? ?/sec    1.00      2.3±0.02ms        ? ?/sec
physical_plan_clickbench_q45                          1.01      2.3±0.02ms        ? ?/sec    1.00      2.3±0.03ms        ? ?/sec
physical_plan_clickbench_q46                          1.01      3.2±0.08ms        ? ?/sec    1.00      3.2±0.03ms        ? ?/sec
physical_plan_clickbench_q47                          1.02      4.8±0.05ms        ? ?/sec    1.00      4.7±0.07ms        ? ?/sec
physical_plan_clickbench_q48                          1.00      5.2±0.06ms        ? ?/sec    1.00      5.3±0.19ms        ? ?/sec
physical_plan_clickbench_q49                          1.01      5.5±0.11ms        ? ?/sec    1.00      5.4±0.08ms        ? ?/sec
physical_plan_clickbench_q5                           1.04      2.6±0.10ms        ? ?/sec    1.00      2.5±0.02ms        ? ?/sec
physical_plan_clickbench_q50                          1.01      4.2±0.11ms        ? ?/sec    1.00      4.2±0.04ms        ? ?/sec
physical_plan_clickbench_q51                          1.01      3.6±0.08ms        ? ?/sec    1.00      3.6±0.07ms        ? ?/sec
physical_plan_clickbench_q6                           1.05      2.6±0.07ms        ? ?/sec    1.00      2.5±0.02ms        ? ?/sec
physical_plan_clickbench_q7                           1.04      2.2±0.04ms        ? ?/sec    1.00      2.1±0.02ms        ? ?/sec
physical_plan_clickbench_q8                           1.05      3.6±0.10ms        ? ?/sec    1.00      3.5±0.03ms        ? ?/sec
physical_plan_clickbench_q9                           1.04      3.8±0.09ms        ? ?/sec    1.00      3.6±0.08ms        ? ?/sec
physical_plan_tpcds_all                               1.00  1968.6±20.06ms        ? ?/sec    1.00  1966.2±18.94ms        ? ?/sec
physical_plan_tpch_all                                1.00    129.4±1.21ms        ? ?/sec    1.00    129.9±1.06ms        ? ?/sec
physical_plan_tpch_q1                                 1.03      3.1±0.03ms        ? ?/sec    1.00      3.0±0.06ms        ? ?/sec
physical_plan_tpch_q10                                1.01      7.4±0.06ms        ? ?/sec    1.00      7.3±0.10ms        ? ?/sec
physical_plan_tpch_q11                                1.01      8.7±0.15ms        ? ?/sec    1.00      8.6±0.17ms        ? ?/sec
physical_plan_tpch_q12                                1.01      3.1±0.03ms        ? ?/sec    1.00      3.1±0.03ms        ? ?/sec
physical_plan_tpch_q13                                1.01      3.1±0.06ms        ? ?/sec    1.00      3.0±0.02ms        ? ?/sec
physical_plan_tpch_q14                                1.02      3.2±0.03ms        ? ?/sec    1.00      3.1±0.03ms        ? ?/sec
physical_plan_tpch_q16                                1.01      5.3±0.04ms        ? ?/sec    1.00      5.2±0.05ms        ? ?/sec
physical_plan_tpch_q17                                1.03      5.8±0.04ms        ? ?/sec    1.00      5.6±0.07ms        ? ?/sec
physical_plan_tpch_q18                                1.01      6.1±0.06ms        ? ?/sec    1.00      6.0±0.10ms        ? ?/sec
physical_plan_tpch_q19                                1.03      5.3±0.03ms        ? ?/sec    1.00      5.1±0.10ms        ? ?/sec
physical_plan_tpch_q2                                 1.03     12.7±0.13ms        ? ?/sec    1.00     12.4±0.17ms        ? ?/sec
physical_plan_tpch_q20                                1.00      8.2±0.06ms        ? ?/sec    1.01      8.3±0.15ms        ? ?/sec
physical_plan_tpch_q21                                1.01     10.3±0.10ms        ? ?/sec    1.00     10.2±0.14ms        ? ?/sec
physical_plan_tpch_q22                                1.01      6.6±0.10ms        ? ?/sec    1.00      6.5±0.06ms        ? ?/sec
physical_plan_tpch_q3                                 1.02      5.7±0.04ms        ? ?/sec    1.00      5.6±0.04ms        ? ?/sec
physical_plan_tpch_q4                                 1.01      3.1±0.05ms        ? ?/sec    1.00      3.0±0.07ms        ? ?/sec
physical_plan_tpch_q5                                 1.01      6.0±0.04ms        ? ?/sec    1.00      6.0±0.10ms        ? ?/sec
physical_plan_tpch_q6                                 1.01  1617.3±16.07µs        ? ?/sec    1.00  1604.1±25.37µs        ? ?/sec
physical_plan_tpch_q7                                 1.02      7.4±0.08ms        ? ?/sec    1.00      7.2±0.12ms        ? ?/sec
physical_plan_tpch_q8                                 1.01      9.5±0.07ms        ? ?/sec    1.00      9.3±0.09ms        ? ?/sec
physical_plan_tpch_q9                                 1.00      6.8±0.04ms        ? ?/sec    1.00      6.8±0.10ms        ? ?/sec
physical_select_aggregates_from_200                   1.00     17.7±0.13ms        ? ?/sec    1.00     17.7±0.11ms        ? ?/sec
physical_select_all_from_1000                         1.00     23.6±0.24ms        ? ?/sec    1.05     24.7±0.17ms        ? ?/sec
physical_select_one_from_700                          1.02   1357.1±8.44µs        ? ?/sec    1.00  1327.1±11.77µs        ? ?/sec
physical_sorted_union_order_by_10_int64               1.02     11.3±0.09ms        ? ?/sec    1.00     11.1±0.11ms        ? ?/sec
physical_sorted_union_order_by_10_uint64              1.02     30.8±0.37ms        ? ?/sec    1.00     30.2±0.37ms        ? ?/sec
physical_sorted_union_order_by_50_int64               1.01    200.9±2.34ms        ? ?/sec    1.00    199.8±4.52ms        ? ?/sec
physical_sorted_union_order_by_50_uint64              1.04   1148.2±7.76ms        ? ?/sec    1.00  1104.5±15.61ms        ? ?/sec
physical_theta_join_consider_sort                     1.07      2.8±0.02ms        ? ?/sec    1.00      2.6±0.02ms        ? ?/sec
physical_unnest_to_join                               1.05      3.2±0.04ms        ? ?/sec    1.00      3.1±0.04ms        ? ?/sec
physical_window_function_partition_by_12_on_values    1.02  1612.1±18.59µs        ? ?/sec    1.00  1587.3±11.05µs        ? ?/sec
physical_window_function_partition_by_30_on_values    1.02      3.0±0.03ms        ? ?/sec    1.00      2.9±0.06ms        ? ?/sec
physical_window_function_partition_by_4_on_values     1.01  1095.9±10.86µs        ? ?/sec    1.00  1085.1±20.62µs        ? ?/sec
physical_window_function_partition_by_7_on_values     1.02  1280.3±30.33µs        ? ?/sec    1.00  1261.3±17.40µs        ? ?/sec
physical_window_function_partition_by_8_on_values     1.02   1347.7±7.20µs        ? ?/sec    1.00  1323.7±12.37µs        ? ?/sec
with_param_values_many_columns                        1.00    577.1±5.70µs        ? ?/sec    1.05   606.5±14.52µs        ? ?/sec

adriangb · 2026-02-03T20:03:35Z

🤖: Benchmark completed

Details

group                                                 get-field-pushdown-try-3               main
-----                                                 ------------------------               ----
logical_aggregate_with_join                           1.01   643.3±10.62µs        ? ?/sec    1.00    639.4±6.12µs        ? ?/sec
logical_select_all_from_1000                          1.00     10.3±0.09ms        ? ?/sec    1.12     11.5±0.15ms        ? ?/sec
logical_select_one_from_700                           1.01    421.7±3.89µs        ? ?/sec    1.00    415.8±2.36µs        ? ?/sec
logical_trivial_join_high_numbered_columns            1.01   380.8±12.39µs        ? ?/sec    1.00    376.5±3.19µs        ? ?/sec
logical_trivial_join_low_numbered_columns             1.01    365.3±4.47µs        ? ?/sec    1.00   361.5±11.63µs        ? ?/sec
physical_intersection                                 1.00  1627.7±31.01µs        ? ?/sec    1.00  1622.2±125.58µs        ? ?/sec
physical_join_consider_sort                           1.02      2.3±0.05ms        ? ?/sec    1.00      2.3±0.02ms        ? ?/sec
physical_join_distinct                                1.01    356.9±8.17µs        ? ?/sec    1.00    353.0±3.93µs        ? ?/sec
physical_many_self_joins                              1.01     12.7±0.08ms        ? ?/sec    1.00     12.5±0.26ms        ? ?/sec
physical_plan_clickbench_all                          1.01    203.2±1.51ms        ? ?/sec    1.00    200.9±1.71ms        ? ?/sec
physical_plan_clickbench_q1                           1.01      2.2±0.02ms        ? ?/sec    1.00      2.1±0.02ms        ? ?/sec
physical_plan_clickbench_q10                          1.05      3.8±0.14ms        ? ?/sec    1.00      3.7±0.10ms        ? ?/sec
physical_plan_clickbench_q11                          1.04      4.3±0.09ms        ? ?/sec    1.00      4.1±0.04ms        ? ?/sec
physical_plan_clickbench_q12                          1.05      4.5±0.09ms        ? ?/sec    1.00      4.2±0.04ms        ? ?/sec
physical_plan_clickbench_q13                          1.06      4.0±0.06ms        ? ?/sec    1.00      3.8±0.12ms        ? ?/sec
physical_plan_clickbench_q14                          1.05      4.3±0.09ms        ? ?/sec    1.00      4.1±0.04ms        ? ?/sec
physical_plan_clickbench_q15                          1.04      4.0±0.07ms        ? ?/sec    1.00      3.9±0.09ms        ? ?/sec
physical_plan_clickbench_q16                          1.05      3.9±0.07ms        ? ?/sec    1.00      3.7±0.04ms        ? ?/sec
physical_plan_clickbench_q17                          1.05      4.0±0.07ms        ? ?/sec    1.00      3.8±0.02ms        ? ?/sec
physical_plan_clickbench_q18                          1.02      2.7±0.03ms        ? ?/sec    1.00      2.7±0.02ms        ? ?/sec
physical_plan_clickbench_q19                          1.02      4.3±0.12ms        ? ?/sec    1.00      4.2±0.09ms        ? ?/sec
physical_plan_clickbench_q2                           1.03      2.9±0.03ms        ? ?/sec    1.00      2.8±0.06ms        ? ?/sec
physical_plan_clickbench_q20                          1.01      2.2±0.03ms        ? ?/sec    1.00      2.2±0.02ms        ? ?/sec
physical_plan_clickbench_q21                          1.01      2.8±0.04ms        ? ?/sec    1.00      2.8±0.05ms        ? ?/sec
physical_plan_clickbench_q22                          1.04      4.1±0.07ms        ? ?/sec    1.00      3.9±0.05ms        ? ?/sec
physical_plan_clickbench_q23                          1.03      4.3±0.08ms        ? ?/sec    1.00      4.2±0.03ms        ? ?/sec
physical_plan_clickbench_q24                          1.02      4.9±0.07ms        ? ?/sec    1.00      4.8±0.04ms        ? ?/sec
physical_plan_clickbench_q25                          1.01      3.5±0.03ms        ? ?/sec    1.00      3.5±0.09ms        ? ?/sec
physical_plan_clickbench_q26                          1.02      3.0±0.06ms        ? ?/sec    1.00      2.9±0.04ms        ? ?/sec
physical_plan_clickbench_q27                          1.01      3.6±0.10ms        ? ?/sec    1.00      3.5±0.06ms        ? ?/sec
physical_plan_clickbench_q28                          1.02      4.5±0.10ms        ? ?/sec    1.00      4.5±0.06ms        ? ?/sec
physical_plan_clickbench_q29                          1.02      4.8±0.07ms        ? ?/sec    1.00      4.7±0.04ms        ? ?/sec
physical_plan_clickbench_q3                           1.02      2.6±0.03ms        ? ?/sec    1.00      2.5±0.03ms        ? ?/sec
physical_plan_clickbench_q30                          1.04     16.1±0.21ms        ? ?/sec    1.00     15.5±0.22ms        ? ?/sec
physical_plan_clickbench_q31                          1.02      4.5±0.03ms        ? ?/sec    1.00      4.5±0.05ms        ? ?/sec
physical_plan_clickbench_q32                          1.02      4.5±0.04ms        ? ?/sec    1.00      4.5±0.04ms        ? ?/sec
physical_plan_clickbench_q33                          1.02      3.7±0.09ms        ? ?/sec    1.00      3.6±0.08ms        ? ?/sec
physical_plan_clickbench_q34                          1.02      3.3±0.03ms        ? ?/sec    1.00      3.2±0.03ms        ? ?/sec
physical_plan_clickbench_q35                          1.01      3.4±0.03ms        ? ?/sec    1.00      3.3±0.03ms        ? ?/sec
physical_plan_clickbench_q36                          1.03      4.3±0.08ms        ? ?/sec    1.00      4.2±0.04ms        ? ?/sec
physical_plan_clickbench_q37                          1.02      4.7±0.09ms        ? ?/sec    1.00      4.7±0.06ms        ? ?/sec
physical_plan_clickbench_q38                          1.01      4.7±0.04ms        ? ?/sec    1.00      4.7±0.07ms        ? ?/sec
physical_plan_clickbench_q39                          1.01      4.1±0.07ms        ? ?/sec    1.00      4.1±0.05ms        ? ?/sec
physical_plan_clickbench_q4                           1.03      2.2±0.04ms        ? ?/sec    1.00      2.2±0.02ms        ? ?/sec
physical_plan_clickbench_q40                          1.01      5.0±0.05ms        ? ?/sec    1.00      4.9±0.07ms        ? ?/sec
physical_plan_clickbench_q41                          1.01      4.3±0.03ms        ? ?/sec    1.00      4.3±0.08ms        ? ?/sec
physical_plan_clickbench_q42                          1.01      4.3±0.07ms        ? ?/sec    1.00      4.2±0.05ms        ? ?/sec
physical_plan_clickbench_q43                          1.01      4.6±0.06ms        ? ?/sec    1.00      4.5±0.06ms        ? ?/sec
physical_plan_clickbench_q44                          1.01      2.3±0.08ms        ? ?/sec    1.00      2.3±0.02ms        ? ?/sec
physical_plan_clickbench_q45                          1.01      2.3±0.02ms        ? ?/sec    1.00      2.3±0.03ms        ? ?/sec
physical_plan_clickbench_q46                          1.01      3.2±0.08ms        ? ?/sec    1.00      3.2±0.03ms        ? ?/sec
physical_plan_clickbench_q47                          1.02      4.8±0.05ms        ? ?/sec    1.00      4.7±0.07ms        ? ?/sec
physical_plan_clickbench_q48                          1.00      5.2±0.06ms        ? ?/sec    1.00      5.3±0.19ms        ? ?/sec
physical_plan_clickbench_q49                          1.01      5.5±0.11ms        ? ?/sec    1.00      5.4±0.08ms        ? ?/sec
physical_plan_clickbench_q5                           1.04      2.6±0.10ms        ? ?/sec    1.00      2.5±0.02ms        ? ?/sec
physical_plan_clickbench_q50                          1.01      4.2±0.11ms        ? ?/sec    1.00      4.2±0.04ms        ? ?/sec
physical_plan_clickbench_q51                          1.01      3.6±0.08ms        ? ?/sec    1.00      3.6±0.07ms        ? ?/sec
physical_plan_clickbench_q6                           1.05      2.6±0.07ms        ? ?/sec    1.00      2.5±0.02ms        ? ?/sec
physical_plan_clickbench_q7                           1.04      2.2±0.04ms        ? ?/sec    1.00      2.1±0.02ms        ? ?/sec
physical_plan_clickbench_q8                           1.05      3.6±0.10ms        ? ?/sec    1.00      3.5±0.03ms        ? ?/sec
physical_plan_clickbench_q9                           1.04      3.8±0.09ms        ? ?/sec    1.00      3.6±0.08ms        ? ?/sec
physical_plan_tpcds_all                               1.00  1968.6±20.06ms        ? ?/sec    1.00  1966.2±18.94ms        ? ?/sec
physical_plan_tpch_all                                1.00    129.4±1.21ms        ? ?/sec    1.00    129.9±1.06ms        ? ?/sec
physical_plan_tpch_q1                                 1.03      3.1±0.03ms        ? ?/sec    1.00      3.0±0.06ms        ? ?/sec
physical_plan_tpch_q10                                1.01      7.4±0.06ms        ? ?/sec    1.00      7.3±0.10ms        ? ?/sec
physical_plan_tpch_q11                                1.01      8.7±0.15ms        ? ?/sec    1.00      8.6±0.17ms        ? ?/sec
physical_plan_tpch_q12                                1.01      3.1±0.03ms        ? ?/sec    1.00      3.1±0.03ms        ? ?/sec
physical_plan_tpch_q13                                1.01      3.1±0.06ms        ? ?/sec    1.00      3.0±0.02ms        ? ?/sec
physical_plan_tpch_q14                                1.02      3.2±0.03ms        ? ?/sec    1.00      3.1±0.03ms        ? ?/sec
physical_plan_tpch_q16                                1.01      5.3±0.04ms        ? ?/sec    1.00      5.2±0.05ms        ? ?/sec
physical_plan_tpch_q17                                1.03      5.8±0.04ms        ? ?/sec    1.00      5.6±0.07ms        ? ?/sec
physical_plan_tpch_q18                                1.01      6.1±0.06ms        ? ?/sec    1.00      6.0±0.10ms        ? ?/sec
physical_plan_tpch_q19                                1.03      5.3±0.03ms        ? ?/sec    1.00      5.1±0.10ms        ? ?/sec
physical_plan_tpch_q2                                 1.03     12.7±0.13ms        ? ?/sec    1.00     12.4±0.17ms        ? ?/sec
physical_plan_tpch_q20                                1.00      8.2±0.06ms        ? ?/sec    1.01      8.3±0.15ms        ? ?/sec
physical_plan_tpch_q21                                1.01     10.3±0.10ms        ? ?/sec    1.00     10.2±0.14ms        ? ?/sec
physical_plan_tpch_q22                                1.01      6.6±0.10ms        ? ?/sec    1.00      6.5±0.06ms        ? ?/sec
physical_plan_tpch_q3                                 1.02      5.7±0.04ms        ? ?/sec    1.00      5.6±0.04ms        ? ?/sec
physical_plan_tpch_q4                                 1.01      3.1±0.05ms        ? ?/sec    1.00      3.0±0.07ms        ? ?/sec
physical_plan_tpch_q5                                 1.01      6.0±0.04ms        ? ?/sec    1.00      6.0±0.10ms        ? ?/sec
physical_plan_tpch_q6                                 1.01  1617.3±16.07µs        ? ?/sec    1.00  1604.1±25.37µs        ? ?/sec
physical_plan_tpch_q7                                 1.02      7.4±0.08ms        ? ?/sec    1.00      7.2±0.12ms        ? ?/sec
physical_plan_tpch_q8                                 1.01      9.5±0.07ms        ? ?/sec    1.00      9.3±0.09ms        ? ?/sec
physical_plan_tpch_q9                                 1.00      6.8±0.04ms        ? ?/sec    1.00      6.8±0.10ms        ? ?/sec
physical_select_aggregates_from_200                   1.00     17.7±0.13ms        ? ?/sec    1.00     17.7±0.11ms        ? ?/sec
physical_select_all_from_1000                         1.00     23.6±0.24ms        ? ?/sec    1.05     24.7±0.17ms        ? ?/sec
physical_select_one_from_700                          1.02   1357.1±8.44µs        ? ?/sec    1.00  1327.1±11.77µs        ? ?/sec
physical_sorted_union_order_by_10_int64               1.02     11.3±0.09ms        ? ?/sec    1.00     11.1±0.11ms        ? ?/sec
physical_sorted_union_order_by_10_uint64              1.02     30.8±0.37ms        ? ?/sec    1.00     30.2±0.37ms        ? ?/sec
physical_sorted_union_order_by_50_int64               1.01    200.9±2.34ms        ? ?/sec    1.00    199.8±4.52ms        ? ?/sec
physical_sorted_union_order_by_50_uint64              1.04   1148.2±7.76ms        ? ?/sec    1.00  1104.5±15.61ms        ? ?/sec
physical_theta_join_consider_sort                     1.07      2.8±0.02ms        ? ?/sec    1.00      2.6±0.02ms        ? ?/sec
physical_unnest_to_join                               1.05      3.2±0.04ms        ? ?/sec    1.00      3.1±0.04ms        ? ?/sec
physical_window_function_partition_by_12_on_values    1.02  1612.1±18.59µs        ? ?/sec    1.00  1587.3±11.05µs        ? ?/sec
physical_window_function_partition_by_30_on_values    1.02      3.0±0.03ms        ? ?/sec    1.00      2.9±0.06ms        ? ?/sec
physical_window_function_partition_by_4_on_values     1.01  1095.9±10.86µs        ? ?/sec    1.00  1085.1±20.62µs        ? ?/sec
physical_window_function_partition_by_7_on_values     1.02  1280.3±30.33µs        ? ?/sec    1.00  1261.3±17.40µs        ? ?/sec
physical_window_function_partition_by_8_on_values     1.02   1347.7±7.20µs        ? ?/sec    1.00  1323.7±12.37µs        ? ?/sec
with_param_values_many_columns                        1.00    577.1±5.70µs        ? ?/sec    1.05   606.5±14.52µs        ? ?/sec

@alamb are there any of these that use structs? It seems like this has no impact on the benchmarks (good!) but maybe we should add some that hit the full rewrite?

adriangb · 2026-02-03T20:05:45Z

BTW codex found a test that shows a single projection being extracted doesn't get pushed down

I can make this a separate PR if you like

note how the get_field is not pushed into the datasource:

###
# Test 2.1b: Projection-only get_field (potential optimization target)
###

query TT
EXPLAIN SELECT s['label'] FROM simple_struct;
----
logical_plan
01)Projection: get_field(simple_struct.s, Utf8("label"))
02)--TableScan: simple_struct projection=[s]
physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[get_field(s@1, label) as simple_struct.s[label]], file_type=parquet

# Verify correctness
query T
SELECT s['label'] FROM simple_struct ORDER BY s['label'];
----
alpha
beta
delta
epsilon
gamma

Thanks for reporting. I’ll make a new PR with this test + join tests + benchmarks.

adriangb · 2026-02-03T20:59:35Z

Did you consider incorporating this logic directly into the OptimizeProjections? It seems like this transformation is really just a mechanism to enable OptimizeProjections 🤔

OptimizeProjections merges projections but it does not push them down. There is no logical optimizer rule for pushing down projections. We could move that part of this rule (the pushdown) into OptimizeProjections or a new rule but I’d find it easier to work it all in one place for now.

alamb · 2026-02-03T21:05:34Z

@alamb are there any of these that use structs? It seems like this has no impact on the benchmarks (good!) but maybe we should add some that hit the full rewrite?

I was more trying to quantify the effect on planning time of adding a new optimizer pass -- it seems like it is small but noticable slowdown (1-3%). I'll see if I can reproduce those results

alamb · 2026-02-03T21:06:09Z

run benchmark sql_planner

adriangb · 2026-02-03T21:09:50Z

@alamb are there any of these that use structs? It seems like this has no impact on the benchmarks (good!) but maybe we should add some that hit the full rewrite?

I was more trying to quantify the effect on planning time of adding a new optimizer pass -- it seems like it is small but noticable slowdown (1-3%). I'll see if I can reproduce those results

It makes sense that there’s a small slowdown, it has to visit every node in the plan even if it doesn’t modify it at all. That said a lot of the numbers were within the variation ie not statistically different.

alamb · 2026-02-03T21:20:33Z

It makes sense that there’s a small slowdown, it has to visit every node in the plan even if it doesn’t modify it at all. That said a lot of the numbers were within the variation ie not statistically different.

Yeah, there is a tension here for sure

I do think in general it would be a good idea to consolidate optimizer passes given each basically deep copies the plan

alamb-ghbot · 2026-02-03T21:36:43Z

🤖 ./gh_compare_branch_bench.sh compare_branch_bench.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing get-field-pushdown-try-3 (e95acd3) to 9962911 diff
BENCH_NAME=sql_planner
BENCH_COMMAND=cargo bench --features=parquet --bench sql_planner
BENCH_FILTER=
BENCH_BRANCH_NAME=get-field-pushdown-try-3
Results will be posted here when complete

alamb-ghbot · 2026-02-03T22:49:40Z

🤖: Benchmark completed

Details

group                                                 get-field-pushdown-try-3               main
-----                                                 ------------------------               ----
logical_aggregate_with_join                           1.00    645.5±4.79µs        ? ?/sec    1.01   650.9±30.77µs        ? ?/sec
logical_select_all_from_1000                          1.00     10.3±0.08ms        ? ?/sec    1.12     11.6±0.31ms        ? ?/sec
logical_select_one_from_700                           1.01    421.8±3.21µs        ? ?/sec    1.00    419.6±8.67µs        ? ?/sec
logical_trivial_join_high_numbered_columns            1.01   380.8±12.06µs        ? ?/sec    1.00    378.2±7.50µs        ? ?/sec
logical_trivial_join_low_numbered_columns             1.02   369.0±24.49µs        ? ?/sec    1.00    362.5±4.04µs        ? ?/sec
physical_intersection                                 1.02  1633.7±27.83µs        ? ?/sec    1.00  1601.4±29.28µs        ? ?/sec
physical_join_consider_sort                           1.02      2.3±0.04ms        ? ?/sec    1.00      2.3±0.02ms        ? ?/sec
physical_join_distinct                                1.03    360.7±3.20µs        ? ?/sec    1.00    351.4±2.61µs        ? ?/sec
physical_many_self_joins                              1.01     12.8±0.09ms        ? ?/sec    1.00     12.7±0.23ms        ? ?/sec
physical_plan_clickbench_all                          1.02    205.6±2.81ms        ? ?/sec    1.00    200.7±2.24ms        ? ?/sec
physical_plan_clickbench_q1                           1.02      2.2±0.03ms        ? ?/sec    1.00      2.1±0.03ms        ? ?/sec
physical_plan_clickbench_q10                          1.02      3.7±0.04ms        ? ?/sec    1.00      3.6±0.03ms        ? ?/sec
physical_plan_clickbench_q11                          1.11      4.6±0.18ms        ? ?/sec    1.00      4.1±0.10ms        ? ?/sec
physical_plan_clickbench_q12                          1.04      4.4±0.08ms        ? ?/sec    1.00      4.2±0.03ms        ? ?/sec
physical_plan_clickbench_q13                          1.03      3.8±0.07ms        ? ?/sec    1.00      3.8±0.03ms        ? ?/sec
physical_plan_clickbench_q14                          1.03      4.2±0.07ms        ? ?/sec    1.00      4.1±0.03ms        ? ?/sec
physical_plan_clickbench_q15                          1.02      3.9±0.03ms        ? ?/sec    1.00      3.8±0.04ms        ? ?/sec
physical_plan_clickbench_q16                          1.01      3.7±0.06ms        ? ?/sec    1.00      3.7±0.03ms        ? ?/sec
physical_plan_clickbench_q17                          1.02      3.9±0.07ms        ? ?/sec    1.00      3.8±0.08ms        ? ?/sec
physical_plan_clickbench_q18                          1.01      2.7±0.03ms        ? ?/sec    1.00      2.7±0.03ms        ? ?/sec
physical_plan_clickbench_q19                          1.05      4.4±0.09ms        ? ?/sec    1.00      4.2±0.04ms        ? ?/sec
physical_plan_clickbench_q2                           1.06      3.0±0.06ms        ? ?/sec    1.00      2.8±0.05ms        ? ?/sec
physical_plan_clickbench_q20                          1.05      2.3±0.04ms        ? ?/sec    1.00      2.2±0.02ms        ? ?/sec
physical_plan_clickbench_q21                          1.02      2.8±0.03ms        ? ?/sec    1.00      2.8±0.02ms        ? ?/sec
physical_plan_clickbench_q22                          1.02      4.0±0.05ms        ? ?/sec    1.00      3.9±0.05ms        ? ?/sec
physical_plan_clickbench_q23                          1.02      4.3±0.12ms        ? ?/sec    1.00      4.2±0.12ms        ? ?/sec
physical_plan_clickbench_q24                          1.03      4.9±0.08ms        ? ?/sec    1.00      4.8±0.04ms        ? ?/sec
physical_plan_clickbench_q25                          1.02      3.5±0.03ms        ? ?/sec    1.00      3.5±0.02ms        ? ?/sec
physical_plan_clickbench_q26                          1.02      3.0±0.03ms        ? ?/sec    1.00      2.9±0.03ms        ? ?/sec
physical_plan_clickbench_q27                          1.02      3.6±0.05ms        ? ?/sec    1.00      3.5±0.03ms        ? ?/sec
physical_plan_clickbench_q28                          1.04      4.6±0.19ms        ? ?/sec    1.00      4.5±0.11ms        ? ?/sec
physical_plan_clickbench_q29                          1.02      4.8±0.07ms        ? ?/sec    1.00      4.8±0.19ms        ? ?/sec
physical_plan_clickbench_q3                           1.02      2.6±0.07ms        ? ?/sec    1.00      2.5±0.04ms        ? ?/sec
physical_plan_clickbench_q30                          1.03     16.1±0.23ms        ? ?/sec    1.00     15.7±0.41ms        ? ?/sec
physical_plan_clickbench_q31                          1.02      4.6±0.03ms        ? ?/sec    1.00      4.5±0.11ms        ? ?/sec
physical_plan_clickbench_q32                          1.02      4.5±0.08ms        ? ?/sec    1.00      4.5±0.03ms        ? ?/sec
physical_plan_clickbench_q33                          1.02      3.7±0.04ms        ? ?/sec    1.00      3.6±0.03ms        ? ?/sec
physical_plan_clickbench_q34                          1.01      3.3±0.03ms        ? ?/sec    1.00      3.3±0.02ms        ? ?/sec
physical_plan_clickbench_q35                          1.02      3.4±0.07ms        ? ?/sec    1.00      3.3±0.02ms        ? ?/sec
physical_plan_clickbench_q36                          1.03      4.3±0.11ms        ? ?/sec    1.00      4.2±0.05ms        ? ?/sec
physical_plan_clickbench_q37                          1.01      4.8±0.05ms        ? ?/sec    1.00      4.7±0.12ms        ? ?/sec
physical_plan_clickbench_q38                          1.01      4.7±0.08ms        ? ?/sec    1.00      4.7±0.05ms        ? ?/sec
physical_plan_clickbench_q39                          1.01      4.1±0.15ms        ? ?/sec    1.00      4.1±0.09ms        ? ?/sec
physical_plan_clickbench_q4                           1.00      2.2±0.02ms        ? ?/sec    1.00      2.2±0.03ms        ? ?/sec
physical_plan_clickbench_q40                          1.02      5.0±0.04ms        ? ?/sec    1.00      4.9±0.05ms        ? ?/sec
physical_plan_clickbench_q41                          1.01      4.3±0.05ms        ? ?/sec    1.00      4.3±0.05ms        ? ?/sec
physical_plan_clickbench_q42                          1.02      4.3±0.09ms        ? ?/sec    1.00      4.2±0.05ms        ? ?/sec
physical_plan_clickbench_q43                          1.02      4.6±0.10ms        ? ?/sec    1.00      4.5±0.05ms        ? ?/sec
physical_plan_clickbench_q44                          1.01      2.3±0.03ms        ? ?/sec    1.00      2.3±0.01ms        ? ?/sec
physical_plan_clickbench_q45                          1.02      2.3±0.04ms        ? ?/sec    1.00      2.3±0.02ms        ? ?/sec
physical_plan_clickbench_q46                          1.02      3.3±0.03ms        ? ?/sec    1.00      3.2±0.02ms        ? ?/sec
physical_plan_clickbench_q47                          1.04      4.9±0.08ms        ? ?/sec    1.00      4.7±0.05ms        ? ?/sec
physical_plan_clickbench_q48                          1.03      5.3±0.07ms        ? ?/sec    1.00      5.2±0.07ms        ? ?/sec
physical_plan_clickbench_q49                          1.01      5.5±0.13ms        ? ?/sec    1.00      5.5±0.08ms        ? ?/sec
physical_plan_clickbench_q5                           1.02      2.6±0.03ms        ? ?/sec    1.00      2.5±0.01ms        ? ?/sec
physical_plan_clickbench_q50                          1.00      4.2±0.03ms        ? ?/sec    1.00      4.2±0.12ms        ? ?/sec
physical_plan_clickbench_q51                          1.02      3.6±0.04ms        ? ?/sec    1.00      3.5±0.04ms        ? ?/sec
physical_plan_clickbench_q6                           1.02      2.6±0.03ms        ? ?/sec    1.00      2.5±0.02ms        ? ?/sec
physical_plan_clickbench_q7                           1.01      2.1±0.02ms        ? ?/sec    1.00      2.1±0.02ms        ? ?/sec
physical_plan_clickbench_q8                           1.02      3.5±0.05ms        ? ?/sec    1.00      3.4±0.06ms        ? ?/sec
physical_plan_clickbench_q9                           1.01      3.6±0.04ms        ? ?/sec    1.00      3.6±0.10ms        ? ?/sec
physical_plan_tpcds_all                               1.03  1985.3±26.56ms        ? ?/sec    1.00  1934.2±12.37ms        ? ?/sec
physical_plan_tpch_all                                1.02    130.3±2.40ms        ? ?/sec    1.00    127.4±1.33ms        ? ?/sec
physical_plan_tpch_q1                                 1.08      3.2±0.07ms        ? ?/sec    1.00      3.0±0.04ms        ? ?/sec
physical_plan_tpch_q10                                1.08      7.9±0.11ms        ? ?/sec    1.00      7.3±0.08ms        ? ?/sec
physical_plan_tpch_q11                                1.04      8.9±0.09ms        ? ?/sec    1.00      8.6±0.14ms        ? ?/sec
physical_plan_tpch_q12                                1.03      3.2±0.05ms        ? ?/sec    1.00      3.1±0.06ms        ? ?/sec
physical_plan_tpch_q13                                1.04      3.2±0.15ms        ? ?/sec    1.00      3.0±0.07ms        ? ?/sec
physical_plan_tpch_q14                                1.03      3.3±0.08ms        ? ?/sec    1.00      3.2±0.07ms        ? ?/sec
physical_plan_tpch_q16                                1.04      5.4±0.17ms        ? ?/sec    1.00      5.2±0.04ms        ? ?/sec
physical_plan_tpch_q17                                1.05      5.9±0.18ms        ? ?/sec    1.00      5.7±0.05ms        ? ?/sec
physical_plan_tpch_q18                                1.04      6.2±0.11ms        ? ?/sec    1.00      6.0±0.08ms        ? ?/sec
physical_plan_tpch_q19                                1.06      5.3±0.09ms        ? ?/sec    1.00      5.1±0.07ms        ? ?/sec
physical_plan_tpch_q2                                 1.05     13.0±0.16ms        ? ?/sec    1.00     12.4±0.13ms        ? ?/sec
physical_plan_tpch_q20                                1.10      8.8±0.11ms        ? ?/sec    1.00      8.0±0.05ms        ? ?/sec
physical_plan_tpch_q21                                1.05     10.6±0.20ms        ? ?/sec    1.00     10.2±0.16ms        ? ?/sec
physical_plan_tpch_q22                                1.03      6.6±0.07ms        ? ?/sec    1.00      6.4±0.07ms        ? ?/sec
physical_plan_tpch_q3                                 1.04      5.9±0.10ms        ? ?/sec    1.00      5.6±0.14ms        ? ?/sec
physical_plan_tpch_q4                                 1.03      3.1±0.05ms        ? ?/sec    1.00      3.0±0.09ms        ? ?/sec
physical_plan_tpch_q5                                 1.03      6.2±0.06ms        ? ?/sec    1.00      6.0±0.04ms        ? ?/sec
physical_plan_tpch_q6                                 1.03  1652.5±25.45µs        ? ?/sec    1.00  1598.0±18.11µs        ? ?/sec
physical_plan_tpch_q7                                 1.03      7.5±0.16ms        ? ?/sec    1.00      7.2±0.04ms        ? ?/sec
physical_plan_tpch_q8                                 1.04      9.8±0.11ms        ? ?/sec    1.00      9.4±0.06ms        ? ?/sec
physical_plan_tpch_q9                                 1.05      7.0±0.09ms        ? ?/sec    1.00      6.7±0.07ms        ? ?/sec
physical_select_aggregates_from_200                   1.00     17.7±0.14ms        ? ?/sec    1.01     17.9±0.20ms        ? ?/sec
physical_select_all_from_1000                         1.00     23.6±0.18ms        ? ?/sec    1.06     25.1±0.22ms        ? ?/sec
physical_select_one_from_700                          1.02  1360.2±32.81µs        ? ?/sec    1.00  1337.9±14.27µs        ? ?/sec
physical_sorted_union_order_by_10_int64               1.03     11.4±0.44ms        ? ?/sec    1.00     11.1±0.05ms        ? ?/sec
physical_sorted_union_order_by_10_uint64              1.04     31.0±0.23ms        ? ?/sec    1.00     29.9±0.23ms        ? ?/sec
physical_sorted_union_order_by_50_int64               1.02    201.8±2.06ms        ? ?/sec    1.00    197.7±2.07ms        ? ?/sec
physical_sorted_union_order_by_50_uint64              1.06  1164.9±18.35ms        ? ?/sec    1.00   1095.4±9.10ms        ? ?/sec
physical_theta_join_consider_sort                     1.08      2.9±0.06ms        ? ?/sec    1.00      2.6±0.03ms        ? ?/sec
physical_unnest_to_join                               1.05      3.2±0.05ms        ? ?/sec    1.00      3.1±0.02ms        ? ?/sec
physical_window_function_partition_by_12_on_values    1.01  1623.0±34.06µs        ? ?/sec    1.00  1602.8±26.71µs        ? ?/sec
physical_window_function_partition_by_30_on_values    1.01      3.0±0.02ms        ? ?/sec    1.00      3.0±0.03ms        ? ?/sec
physical_window_function_partition_by_4_on_values     1.00   1103.3±7.88µs        ? ?/sec    1.00  1106.5±18.21µs        ? ?/sec
physical_window_function_partition_by_7_on_values     1.00  1282.8±10.81µs        ? ?/sec    1.00  1279.9±13.22µs        ? ?/sec
physical_window_function_partition_by_8_on_values     1.01  1370.3±24.28µs        ? ?/sec    1.00  1350.5±11.53µs        ? ?/sec
with_param_values_many_columns                        1.00    575.5±4.77µs        ? ?/sec    1.05    602.3±6.73µs        ? ?/sec

adriangb · 2026-02-04T03:44:19Z

BTW codex found a test that shows a single projection being extracted doesn't get pushed down
I can make this a separate PR if you like
note how the get_field is not pushed into the datasource:

###
# Test 2.1b: Projection-only get_field (potential optimization target)
###

query TT
EXPLAIN SELECT s['label'] FROM simple_struct;
----
logical_plan
01)Projection: get_field(simple_struct.s, Utf8("label"))
02)--TableScan: simple_struct projection=[s]
physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[get_field(s@1, label) as simple_struct.s[label]], file_type=parquet

# Verify correctness
query T
SELECT s['label'] FROM simple_struct ORDER BY s['label'];
----
alpha
beta
delta
epsilon
gamma

Thanks for reporting. I’ll make a new PR with this test + join tests + benchmarks.

BTW codex found a test that shows a single projection being extracted doesn't get pushed down
I can make this a separate PR if you like
note how the get_field is not pushed into the datasource:

###
# Test 2.1b: Projection-only get_field (potential optimization target)
###

query TT
EXPLAIN SELECT s['label'] FROM simple_struct;
----
logical_plan
01)Projection: get_field(simple_struct.s, Utf8("label"))
02)--TableScan: simple_struct projection=[s]
physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[get_field(s@1, label) as simple_struct.s[label]], file_type=parquet

# Verify correctness
query T
SELECT s['label'] FROM simple_struct ORDER BY s['label'];
----
alpha
beta
delta
epsilon
gamma

Thanks for reporting. I’ll make a new PR with this test + join tests + benchmarks.

Done in #20143

Pulling out of apache#20117

github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Feb 2, 2026

adriangb marked this pull request as draft February 2, 2026 20:17

adriangb added 4 commits February 2, 2026 15:21

remove unused dep

8b1ce95

wip on pushing down in more cases

28af51b

working?

e787197

working!

87bd0df

adriangb requested a review from Copilot February 2, 2026 21:53

Copilot AI reviewed Feb 2, 2026

View reviewed changes

datafusion/optimizer/src/push_down_filter.rs Outdated Show resolved Hide resolved

datafusion/optimizer/src/extract_leaf_expressions.rs Outdated Show resolved Hide resolved

adriangb added 4 commits February 2, 2026 17:14

address pr feedback

9482faa

update slts

5ed55d2

add docstrings

f6ec821

add better assertinos to test

5433740

github-actions bot added the logical-expr Logical plan and expressions label Feb 3, 2026

adriangb marked this pull request as ready for review February 3, 2026 00:11

adriangb and others added 2 commits February 2, 2026 19:30

update slts

67d16ea

adriangb commented Feb 3, 2026

View reviewed changes

fmt

294e2d9

adriangb commented Feb 3, 2026

View reviewed changes

adriangb and others added 8 commits February 3, 2026 09:05

refactor projection handling

b45db3d

add split_projection

64d5171

progress refactoring porojection handling

892feee

refactor, update slts

9dab8e5

refactor to reduce code branches

7ee13a9

lint

cd3db9b

make pub(crate)

41cabbf

minimize change

c944b28

adriangb commented Feb 3, 2026

View reviewed changes

lint

e95acd3

adriangb commented Feb 3, 2026

View reviewed changes

alamb reviewed Feb 3, 2026

View reviewed changes

adriangb mentioned this pull request Feb 4, 2026

Add more struct pushdown tests and planning benchmark #20143

Merged

adriangb added a commit to pydantic/datafusion that referenced this pull request Feb 4, 2026

Add more struct pushdown tests and planning benchmark (apache#20143)

2f12f1c

Pulling out of apache#20117

		// Everything else passes through unchanged
		_ => Ok(Transformed::no(plan)),

		04)------ProjectionExec: expr=[get_field(__unnest_placeholder(d.column2,depth=1)@1, a) as __datafusion_extracted_1, column1@0 as column1, __unnest_placeholder(d.column2,depth=1)@1 as __unnest_placeholder(d.column2,depth=1)]
		05)--------UnnestExec

	/// and will not push filters through them. See `is_extracted_expr_projection` in utils.rs.
	/// and will not push filters through them. See [`is_extracted_expr_projection`]

Conversation

adriangb commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Example

How It Works

Interaction with other optimizer rules

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

adriangb Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adriangb Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 3, 2026

Uh oh!

alamb-ghbot commented Feb 3, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb-ghbot commented Feb 3, 2026

Uh oh!

adriangb commented Feb 3, 2026

Uh oh!

adriangb commented Feb 2, 2026 •

edited

Loading

adriangb Feb 3, 2026 •

edited

Loading

adriangb Feb 3, 2026 •

edited

Loading

alamb commented Feb 3, 2026 •

edited

Loading