-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Improve performance of CASE WHEN x THEN y ELSE NULL expressions
#20097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
run benchmark case_when |
|
🤖 |
|
run benchmark tpcds |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
| let return_type = self.data_type(&batch.schema())?; | ||
| let else_expr = | ||
| try_cast(Arc::clone(else_expr), &batch.schema(), return_type.clone()) | ||
| .unwrap_or_else(|_| Arc::clone(else_expr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why there is a fallback here ?
In which case the types may not match and the fallback is needed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very good question. I think this is actually one of the few lines of code I did not modify. Let me see if I can figure out the answer from git history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I traced this back to https://github.com/apache/datafusion/pull/1601/changes#diff-e27e1014e17614bf2ddd8b375475c5c31a279e58e97d2950a9a5427428645855R372 in PR #1601
. That change is from 2022, but perhaps the author @xudong963 remembers the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember the context either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pepijnve Can't recall the reason now, imo, if not any test is broken, it's good to go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we want to do that as part of this PR or a separate one? Changing this line would be completely unrelated to this PR's current description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new PR, please
| EvalMethod::ScalarOrScalar | ||
| } else if body.when_then_expr.len() == 1 && body.else_expr.is_some() { | ||
| } else if body.when_then_expr.len() == 1 { | ||
| EvalMethod::ExpressionOrExpression(body.project()?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/pepijnve/datafusion/blob/0d640690aa665da749d30148e6a9c6440d2020db/datafusion/physical-expr/src/expressions/case.rs#L76 says ... and both the thenandelse are expressions. But this change makes the else part optional.
Should ExpressionOrExpression's docstring be updated ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I should update the doc strings. I didn't check them to see if they got stale or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc string updated
727ac25 to
9124d37
Compare
|
The TPCDS improvements are plausible as they have CASE in them
These ones don't have case 🤷 |
|
run benchmark tpcds |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
run benchmark case_when |
|
BTW I think we could add some more test coverage -- here are some cases for case.slt that are worth considering (written by codex) # single WHEN, no ELSE (absent)
query
SELECT CASE WHEN a > 0 THEN b END
FROM (VALUES (1, 10), (0, 20)) AS t(a, b);
----
10
NULL
# single WHEN, explicit ELSE NULL
query
SELECT CASE WHEN a > 0 THEN b ELSE NULL END
FROM (VALUES (1, 10), (0, 20)) AS t(a, b);
----
10
NULL
# fallible THEN expression should only be evaluated on true rows
query
SELECT CASE WHEN a > 0 THEN 10 / a END
FROM (VALUES (1), (0)) AS t(a);
----
10
NULL
# all-false path returns typed NULLs
query
SELECT CASE WHEN a < 0 THEN b END
FROM (VALUES (1, 10), (2, 20)) AS t(a, b);
----
NULL
NULL |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Extracting the relevant bits that illustrate the optimisation clearly |
|
Thanks again everyone! |
Which issue does this PR close?
Rationale for this change
While reviewing #19994 it became clear the optimised
ExpressionOrExpressioncode path was not being used when the case expression has noelsebranch or haselse null. In those situations the general evaluation strategies could end up being used.This PR refines the
ExpressionOrExpressionimplementation to also handleelse nullexpressions.What changes are included in this PR?
Use
ExpressionOrExpressionfor expressions of the formCASE WHEN x THEN y [ELSE NULL]Are these changes tested?
Covered by existing SLTs
Are there any user-facing changes?
No