feat: add cast_to_type UDF for type-based casting#21322
feat: add cast_to_type UDF for type-based casting#21322adriangb wants to merge 18 commits intoapache:mainfrom
Conversation
Add a `cast_to_type(expression, reference)` function that casts the first argument to the data type of the second argument, similar to DuckDB's cast_to_type. The second argument's type (not value) determines the target cast type, which is useful in macros and generic SQL where types need to be preserved dynamically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add `try_cast_to_type(expression, reference)` which works like `cast_to_type` but returns NULL on cast failure instead of erroring, similar to the relationship between arrow_cast and arrow_try_cast. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
I like this, i just noticed we might need a bit more tests on |
Generally we prefer SLT tests. They avoid more code to compile, are easier to grok and are closer to real world usage. |
|
@martin-g would you mind reviewing this change? |
| } | ||
|
|
||
| fn return_field_from_args(&self, args: ReturnFieldArgs) -> Result<FieldRef> { | ||
| let nullable = args.arg_fields.iter().any(|f| f.is_nullable()); |
There was a problem hiding this comment.
This looks incorrect to me. reference is only used for its type, but its nullability is being propagated to the result schema. That means expressions like cast_to_type(42, NULL::INTEGER) become nullable in the logical plan schema.
I think cast_to_type should only inherit nullability from the first argument?
There was a problem hiding this comment.
arrow_cast and arrow_try_cast do follow the same pattern so i thought of this coherent, the optimizer likely refines nullability downstream
There was a problem hiding this comment.
For DuckDB it is also nullable:
CREATE TABLE test AS SELECT cast_to_type('42', NULL::INTEGER) AS val;
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'test';┌─────────────┬───────────┬─────────────┐
│ column_name │ data_type │ is_nullable │
│ varchar │ varchar │ varchar │
├─────────────┼───────────┼─────────────┤
│ val │ INTEGER │ YES │
└─────────────┴───────────┴─────────────┘
But apparently even casting a non-nullable column makes a nullable output, which doesn't make sense, it would error if it can't cast:
CREATE TABLE data (x INT NOT NULL);
INSERT INTO data VALUES (1);
CREATE TABLE test AS SELECT cast(x AS INT) AS val FROM data;
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'test';┌─────────────┬───────────┬─────────────┐
│ column_name │ data_type │ is_nullable │
│ varchar │ varchar │ varchar │
├─────────────┼───────────┼─────────────┤
│ val │ INTEGER │ YES │
└─────────────┴───────────┴─────────────┘
So I feel this is more of a limitation of DuckDB than anything.
DataFusion does preserve / compute nullability through similar expressions:
> create table test as select cast('42' AS INT) AS val;
0 row(s) fetched.
Elapsed 0.002 seconds.
> DESCRIBE test;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| val | Int32 | NO |
+-------------+-----------+-------------+
1 row(s) fetched.
Elapsed 0.000 seconds.
> drop table test;
0 row(s) fetched.
Elapsed 0.000 seconds.
> create table test as select arrow_cast('42', 'UInt16') AS val;
0 row(s) fetched.
Elapsed 0.002 seconds.
> DESCRIBE test;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| val | UInt16 | NO |
+-------------+-----------+-------------+
1 row(s) fetched.
Elapsed 0.000 seconds.
I changed this in 7891a1f.
Unlike cast('42' AS NULL) or arrow_cast('42', 'Null') which both fail cast_to_type('42', null) will succeed. My reasoning is that these expressions will be used programmatically thus it's more likely to hit an edge case like this and want to proceed instead of failing. I'm not sure why arrow_cast('42', 'Null') fails.
The second argument (reference) is used solely for its data type, so its nullability should not propagate to the result. Previously `cast_to_type(42, NULL::INTEGER)` was incorrectly marked nullable in the schema even though the input literal is non-null. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| 2024-01-15 | ||
|
|
||
| # Error on invalid cast | ||
| statement error |
There was a problem hiding this comment.
Why there is no regex for the error message here ?
There was a problem hiding this comment.
Let's also add a test case for invalid target type, e.g. NULL::INVALID
There was a problem hiding this comment.
@adriangb Just this has left. I guess you forgot to add it.
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
|
@martin-g any thoughts on how this should behave w.r.t. field metadata? Should we set the metadata to the metadata of the first argument, should we merge them or should we preserve only the metadata of the second argument? Currently these functions are implemented by simplifying into * TryCast is buggy: #21390 |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Which issue does this PR close?
N/A — new feature
Rationale for this change
DuckDB provides a
cast_to_type(expression, reference)function that casts the first argument to the data type of the second argument. This is useful in macros and generic SQL where types need to be preserved or matched dynamically. This PR adds the equivalent function to DataFusion, along with a fallibletry_cast_to_typevariant.What changes are included in this PR?
cast_to_typescalar UDF indatafusion/functions/src/core/cast_to_type.rsreturn_field_from_argsto infer return type from the second argument's data typesimplify()rewrites toExpr::Cast(or no-op if types match), so there is zero runtime overheadtry_cast_to_typescalar UDF indatafusion/functions/src/core/try_cast_to_type.rscast_to_typebut returns NULL on cast failure instead of erroringsimplify()rewrites toExpr::TryCastdatafusion/functions/src/core/mod.rsAre these changes tested?
Yes. New sqllogictest file
cast_to_type.sltcovering both functions:cast_to_type) vs NULL on invalid cast (try_cast_to_type)Are there any user-facing changes?
Two new SQL functions:
cast_to_type(expression, reference)— casts expression to the type of referencetry_cast_to_type(expression, reference)— same, but returns NULL on failure🤖 Generated with Claude Code