Skip to content

feat: add cast_to_type UDF for type-based casting#21322

Open
adriangb wants to merge 18 commits intoapache:mainfrom
pydantic:cast-to
Open

feat: add cast_to_type UDF for type-based casting#21322
adriangb wants to merge 18 commits intoapache:mainfrom
pydantic:cast-to

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented Apr 2, 2026

Which issue does this PR close?

N/A — new feature

Rationale for this change

DuckDB provides a cast_to_type(expression, reference) function that casts the first argument to the data type of the second argument. This is useful in macros and generic SQL where types need to be preserved or matched dynamically. This PR adds the equivalent function to DataFusion, along with a fallible try_cast_to_type variant.

What changes are included in this PR?

  • New cast_to_type scalar UDF in datafusion/functions/src/core/cast_to_type.rs
    • Takes two arguments: the expression to cast, and a reference expression whose type (not value) determines the target cast type
    • Uses return_field_from_args to infer return type from the second argument's data type
    • simplify() rewrites to Expr::Cast (or no-op if types match), so there is zero runtime overhead
  • New try_cast_to_type scalar UDF in datafusion/functions/src/core/try_cast_to_type.rs
    • Same as cast_to_type but returns NULL on cast failure instead of erroring
    • simplify() rewrites to Expr::TryCast
    • Output is always nullable
  • Registration of both functions in datafusion/functions/src/core/mod.rs

Are these changes tested?

Yes. New sqllogictest file cast_to_type.slt covering both functions:

  • Basic casts (string→int, string→double, int→string, int→double)
  • NULL handling
  • Same-type no-op
  • CASE expression as first argument
  • Arithmetic expression as first argument
  • Nested calls
  • Subquery as second argument
  • Column references as second argument
  • Boolean and date casts
  • Error on invalid cast (cast_to_type) vs NULL on invalid cast (try_cast_to_type)
  • Cross-column type matching

Are there any user-facing changes?

Two new SQL functions:

  • cast_to_type(expression, reference) — casts expression to the type of reference
  • try_cast_to_type(expression, reference) — same, but returns NULL on failure

🤖 Generated with Claude Code

Add a `cast_to_type(expression, reference)` function that casts
the first argument to the data type of the second argument, similar
to DuckDB's cast_to_type. The second argument's type (not value)
determines the target cast type, which is useful in macros and
generic SQL where types need to be preserved dynamically.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Apr 2, 2026
adriangb and others added 2 commits April 2, 2026 10:19
Add `try_cast_to_type(expression, reference)` which works like
`cast_to_type` but returns NULL on cast failure instead of erroring,
similar to the relationship between arrow_cast and arrow_try_cast.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 2, 2026
@AndreaBozzo
Copy link
Copy Markdown
Contributor

AndreaBozzo commented Apr 4, 2026

I like this, i just noticed we might need a bit more tests on ScalarUDFImpl in the future, but there are slts on that so could be just fine as is.

@adriangb
Copy link
Copy Markdown
Contributor Author

adriangb commented Apr 4, 2026

I like this, i just noticed we might need a bit more tests on ScalarUDFImpl in the future, but there are slts on that so could be just fine as is.

Generally we prefer SLT tests. They avoid more code to compile, are easier to grok and are closer to real world usage.

@adriangb
Copy link
Copy Markdown
Contributor Author

adriangb commented Apr 4, 2026

@martin-g would you mind reviewing this change?

}

fn return_field_from_args(&self, args: ReturnFieldArgs) -> Result<FieldRef> {
let nullable = args.arg_fields.iter().any(|f| f.is_nullable());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks incorrect to me. reference is only used for its type, but its nullability is being propagated to the result schema. That means expressions like cast_to_type(42, NULL::INTEGER) become nullable in the logical plan schema.

I think cast_to_type should only inherit nullability from the first argument?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrow_cast and arrow_try_cast do follow the same pattern so i thought of this coherent, the optimizer likely refines nullability downstream

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For DuckDB it is also nullable:

CREATE TABLE test AS SELECT cast_to_type('42', NULL::INTEGER) AS val;

SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'test';
┌─────────────┬───────────┬─────────────┐
│ column_name │ data_type │ is_nullable │
│   varchar   │  varchar  │   varchar   │
├─────────────┼───────────┼─────────────┤
│ val         │ INTEGER   │ YES         │
└─────────────┴───────────┴─────────────┘

But apparently even casting a non-nullable column makes a nullable output, which doesn't make sense, it would error if it can't cast:

CREATE TABLE data (x INT NOT NULL);
INSERT INTO data VALUES (1);

CREATE TABLE test AS SELECT cast(x AS INT) AS val FROM data;

SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'test';
┌─────────────┬───────────┬─────────────┐
│ column_name │ data_type │ is_nullable │
│   varchar   │  varchar  │   varchar   │
├─────────────┼───────────┼─────────────┤
│ val         │ INTEGER   │ YES         │
└─────────────┴───────────┴─────────────┘

So I feel this is more of a limitation of DuckDB than anything.

DataFusion does preserve / compute nullability through similar expressions:

> create table test as select cast('42' AS INT) AS val;
0 row(s) fetched.
Elapsed 0.002 seconds.

> DESCRIBE test;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| val         | Int32     | NO          |
+-------------+-----------+-------------+
1 row(s) fetched.
Elapsed 0.000 seconds.

> drop table test;
0 row(s) fetched.
Elapsed 0.000 seconds.

> create table test as select arrow_cast('42', 'UInt16') AS val;
0 row(s) fetched.
Elapsed 0.002 seconds.

> DESCRIBE test;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| val         | UInt16    | NO          |
+-------------+-----------+-------------+
1 row(s) fetched.
Elapsed 0.000 seconds.

I changed this in 7891a1f.

Unlike cast('42' AS NULL) or arrow_cast('42', 'Null') which both fail cast_to_type('42', null) will succeed. My reasoning is that these expressions will be used programmatically thus it's more likely to hit an edge case like this and want to proceed instead of failing. I'm not sure why arrow_cast('42', 'Null') fails.

The second argument (reference) is used solely for its data type, so its
nullability should not propagate to the result.  Previously
`cast_to_type(42, NULL::INTEGER)` was incorrectly marked nullable in the
schema even though the input literal is non-null.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2024-01-15

# Error on invalid cast
statement error
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there is no regex for the error message here ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a test case for invalid target type, e.g. NULL::INVALID

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adriangb Just this has left. I guess you forgot to add it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep sorry!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tackled both of these in 08b7edb

adriangb and others added 12 commits April 5, 2026 17:06
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
@adriangb
Copy link
Copy Markdown
Contributor Author

adriangb commented Apr 5, 2026

@martin-g any thoughts on how this should behave w.r.t. field metadata? Should we set the metadata to the metadata of the first argument, should we merge them or should we preserve only the metadata of the second argument?

Currently these functions are implemented by simplifying into Cast and TryCast, so we'd have to just use the behavior of those functions which is to preserve the source metadata*. However I'd argue for these new functions since the target is an arbitrary expression which has its own metadata it would make the most sense to merge them. That would be difficult to implement though (we can't just simplify into Cast anymore) so I propose for now we go with preserving the source's metadata and if there is a request for a different behavior that can be addressed in the future.

* TryCast is buggy: #21390

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants