feat: Add dy.infer_schema#294
feat: Add dy.infer_schema#294Andreas Albert (AndreasAlbertQC) merged 13 commits intoQuantco:mainfrom
dy.infer_schema#294Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #294 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 54 56 +2
Lines 3121 3211 +90
=========================================
+ Hits 3121 3211 +90 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR adds a new dy.infer_schema() function (addressing Issue #232) that generates dataframely schema code from a Polars DataFrame. The function inspects a DataFrame's column types and null counts to produce schema class definitions with appropriate column types and nullable annotations.
Changes:
- New
dataframely/_generate_schema.pymodule implementinginfer_schema()with three return modes (print to stdout, return as string, or return as an executable Schema class), plus supporting helper functions for code generation. - Public API export of
infer_schemaindataframely/__init__.py. - New test file
tests/test_infer_schema.pycovering basic types, nullable detection, datetime types, nested types, invalid identifiers, and round-trip validation.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
dataframely/_generate_schema.py |
New module with infer_schema() function and helpers for inferring schema from DataFrame columns, handling type mapping, identifier sanitization, and code generation. |
dataframely/__init__.py |
Exports infer_schema in the public API (import and __all__). |
tests/test_infer_schema.py |
Tests for string output mode across all supported types and round-trip validation via schema return mode. |
|
hello Oliver Borchert (@borchero), does this implementation is close to something mergable ? |
Andreas Albert (AndreasAlbertQC)
left a comment
There was a problem hiding this comment.
Hey gab23r , thanks for the PR! I think the core functionality is pretty solid, but I'd like us to tune the API a little and align the structure of the code more closely with what we usually do in this repo. Could you also add an entry to FAQ docs page?
|
I fall into the rabbit hole of edge cases with duplicated names...
Example: import polars as pl
df = pl.DataFrame({
"用户姓名": ["张三", "李四"],
"出生年月日": ["1990-01-15", "1985-06-20"],
"工作单位名称": ["北京科技有限公司", "上海金融集团"],
"联系电话号码": ["13800138000", "13900139000"],
})
print(dy.infer_schema(df))
# class Schema(dy.Schema):
# _ = dy.String(alias="用户姓名")
# _1 = dy.String(alias="出生年月日")
# _2 = dy.String(alias="工作单位名称")
# _3 = dy.String(alias="联系电话号码")
class Schema(dy.Schema):
_ = dy.String(alias="用户姓名")
_1 = dy.String(alias="出生年月日")
_2 = dy.String(alias="工作单位名称")
_3 = dy.String(alias="联系电话号码")
Schema.sample().columns # ['用户姓名', '出生年月日', '工作单位名称', '联系电话号码'] |
|
thanks for thinking this through! I think we are almost there :)
how about
|
|
I have replace df = pl.DataFrame({"column_1": 0, "$": 0})
print(dy.infer_schema(df))
# Class Schema(dy.Schema):
# column_1 = dy.Int64()
# column_1_1 = dy.Int64(alias="$")
df = pl.DataFrame({"col name": ["test"], "col_name": ["test"], "col_name_1": ["test"]})
result = dy.infer_schema(df)
# class Schema(dy.Schema):
# col_name = dy.String(alias="col name")
# col_name_1 = dy.String(alias="col_name")
# col_name_1_1 = dy.String(alias="col_name_1") |
|
Makes sense, thanks! If you could fix pre-commit and coverage one last time, I think we are good to go now :) |
Andreas Albert (AndreasAlbertQC)
left a comment
There was a problem hiding this comment.
After some more internal discussion with Oliver Borchert (@borchero) and Daniel Elsner (@delsner), we concluded that while we want to merge this, we also are not 100% ready to commit to the interface and functionality being stable. Therefore, we decided to move this function to a new dataframely.experimental namespace, which will allows us to be more flexible with changes in the future, while still ensuring we include it in the dataframely package.
gab23r I took the liberty of implementing that on this branch :)
Oliver Borchert (@borchero) Daniel Elsner (@delsner) PTAL, I'd now like to merge this quickly because it's been waiting for a while.
|
Thanks for your work gab23r! :) sorry this took a bit to get done |
6e06c73
into
Quantco:main
Fixes: #232
dy.infer_schema()function to generate dataframely schema code from a Polars DataFramereturn_typeparameter:None(default): prints schema to stdout for quick exploration"string": returns schema code as a string"schema": returns an actual Schema class for direct useThis add the
Not supported (potential future enhancements)