feat: Add `dy.infer_schema` by gab23r · Pull Request #294 · Quantco/dataframely

gab23r · 2026-03-05T09:52:34Z

Fixes: #232

Add dy.infer_schema() function to generate dataframely schema code from a Polars DataFrame
Supports three output modes via return_type parameter:
- None (default): prints schema to stdout for quick exploration
- "string": returns schema code as a string
- "schema": returns an actual Schema class for direct use
Handles all Polars types including nested types (List, Array, Struct) with proper inner nullability detection
Automatically handles invalid Python identifiers and keywords using aliases

This add the

>>> import polars as pl
>>> import dataframely as dy
>>> df = pl.DataFrame({
...     "name": ["Alice", "Bob"],
...     "age": [25, 30],
...     "score": [95.5, None],
... })
>>> dy.infer_schema(df, "PersonSchema")
class PersonSchema(dy.Schema):
    name = dy.String()
    age = dy.Int64()
    score = dy.Float64(nullable=True)
>>> schema = dy.infer_schema(df, "PersonSchema", return_type="schema")
>>> schema.is_valid(df)
True

Not supported (potential future enhancements)

Assess min/max length of string values to suggest min_length/max_length constraints
Suggest Enum if there are fewer than 10-20 distinct string values in a column
Suggest Categorical if there are 50-100 distinct string values in a dataframe with >100k rows

codecov · 2026-03-05T09:54:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (b3edd6a) to head (c712374).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff            @@
##              main      #294   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           54        56    +2     
  Lines         3121      3211   +90     
=========================================
+ Hits          3121      3211   +90

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR adds a new dy.infer_schema() function (addressing Issue #232) that generates dataframely schema code from a Polars DataFrame. The function inspects a DataFrame's column types and null counts to produce schema class definitions with appropriate column types and nullable annotations.

Changes:

New dataframely/_generate_schema.py module implementing infer_schema() with three return modes (print to stdout, return as string, or return as an executable Schema class), plus supporting helper functions for code generation.
Public API export of infer_schema in dataframely/__init__.py.
New test file tests/test_infer_schema.py covering basic types, nullable detection, datetime types, nested types, invalid identifiers, and round-trip validation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
`dataframely/_generate_schema.py`	New module with `infer_schema()` function and helpers for inferring schema from DataFrame columns, handling type mapping, identifier sanitization, and code generation.
`dataframely/__init__.py`	Exports `infer_schema` in the public API (`import` and `__all__`).
`tests/test_infer_schema.py`	Tests for string output mode across all supported types and round-trip validation via schema return mode.

dataframely/__init__.py

tests/test_infer_schema.py

dataframely/_generate_schema.py

tests/test_infer_schema.py

gab23r · 2026-03-10T10:54:28Z

hello Oliver Borchert (@borchero), does this implementation is close to something mergable ?

Andreas Albert (AndreasAlbertQC)

Hey gab23r , thanks for the PR! I think the core functionality is pretty solid, but I'd like us to tune the API a little and align the structure of the code more closely with what we usually do in this repo. Could you also add an entry to FAQ docs page?

dataframely/_generate_schema.py

tests/test_infer_schema.py

gab23r · 2026-03-17T14:21:35Z

I fall into the rabbit hole of edge cases with duplicated names...
I end up with this logic:

replace special character with "_"
if the name didn't have any valid characters, name the column "_"
if this name is not already given use it else add a suffix integer to the column is not duplicated.

Example:

import polars as pl

df = pl.DataFrame({
    "用户姓名": ["张三", "李四"],
    "出生年月日": ["1990-01-15", "1985-06-20"],
    "工作单位名称": ["北京科技有限公司", "上海金融集团"],
    "联系电话号码": ["13800138000", "13900139000"],
})
print(dy.infer_schema(df))
# class Schema(dy.Schema):
#     _ = dy.String(alias="用户姓名")
#     _1 = dy.String(alias="出生年月日")
#     _2 = dy.String(alias="工作单位名称")
#     _3 = dy.String(alias="联系电话号码")

class Schema(dy.Schema):
    _ = dy.String(alias="用户姓名")
    _1 = dy.String(alias="出生年月日")
    _2 = dy.String(alias="工作单位名称")
    _3 = dy.String(alias="联系电话号码")

Schema.sample().columns # ['用户姓名', '出生年月日', '工作单位名称', '联系电话号码']

Andreas Albert (AndreasAlbertQC) · 2026-03-18T07:59:29Z

thanks for thinking this through! I think we are almost there :)

if the name didn't have any valid characters, name the column "_"

how about column_0, column_1 etc? That would have two advantages:

It would avoid having column names start with underscores, which can look like you want to define a private member
it would make the naming structure consistent between the first and each subsequent column (as opposed to the current _ for the first, and then _1 / _2,... for the subsequent)

gab23r · 2026-03-18T10:02:50Z

I have replace _ by column_{column_index}, But I need to keep the behavior that ensure that duplicated names will never appear. Look at these edges cases:

df = pl.DataFrame({"column_1": 0, "$": 0})
print(dy.infer_schema(df))
# Class Schema(dy.Schema):
#     column_1 = dy.Int64()
#     column_1_1 = dy.Int64(alias="$")

df = pl.DataFrame({"col name": ["test"], "col_name": ["test"], "col_name_1": ["test"]})
result = dy.infer_schema(df)
# class Schema(dy.Schema):
#     col_name = dy.String(alias="col name")
#     col_name_1 = dy.String(alias="col_name")
#     col_name_1_1 = dy.String(alias="col_name_1")

Andreas Albert (AndreasAlbertQC) · 2026-03-18T12:02:41Z

Makes sense, thanks! If you could fix pre-commit and coverage one last time, I think we are good to go now :)

Andreas Albert (AndreasAlbertQC)

After some more internal discussion with Oliver Borchert (@borchero) and Daniel Elsner (@delsner), we concluded that while we want to merge this, we also are not 100% ready to commit to the interface and functionality being stable. Therefore, we decided to move this function to a new dataframely.experimental namespace, which will allows us to be more flexible with changes in the future, while still ensuring we include it in the dataframely package.

gab23r I took the liberty of implementing that on this branch :)

Oliver Borchert (@borchero) Daniel Elsner (@delsner) PTAL, I'd now like to merge this quickly because it's been waiting for a while.

Oliver Borchert (borchero)

Nice!

Oliver Borchert (borchero) · 2026-03-23T16:35:00Z

Thanks for your work gab23r! :) sorry this took a bit to get done

mvp infer schema

fa3b9fa

Copilot AI review requested due to automatic review settings March 5, 2026 09:52

gab23r requested review from Andreas Albert (AndreasAlbertQC), Oliver Borchert (borchero) and Daniel Elsner (delsner) as code owners March 5, 2026 09:52

Copilot started reviewing on behalf of gab23r March 5, 2026 09:53 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

gabriel added 2 commits March 5, 2026 11:20

increase code coverage

6c19bfa

copilot

f0e07fb

gab23r changed the title ~~Feat: Add dy.infer_schema~~ feat: Add dy.infer_schema Mar 5, 2026

github-actions bot added the enhancement New feature or request label Mar 5, 2026

pragma: no cover

7ee32cf

Andreas Albert (AndreasAlbertQC) requested changes Mar 12, 2026

View reviewed changes

gabriel added 2 commits March 16, 2026 15:09

more concise

50b723d

fix duplicated names

d6ee33c

code cov

142b036

replace _ by columns_{index}

8009321

remove comment in string

ef573f8

gabriel and others added 2 commits March 18, 2026 14:14

code cov

b1526a2

move to experimental, add docs

a454526

Andreas Albert (AndreasAlbertQC) approved these changes Mar 23, 2026

View reviewed changes

precommit

071ed89

Oliver Borchert (borchero) approved these changes Mar 23, 2026

View reviewed changes

fix

c712374

Andreas Albert (AndreasAlbertQC) enabled auto-merge (squash) March 23, 2026 16:42

Andreas Albert (AndreasAlbertQC) merged commit 6e06c73 into Quantco:main Mar 23, 2026
32 checks passed

gab23r deleted the infer-schema branch March 23, 2026 17:29

Conversation

gab23r commented Mar 5, 2026

Not supported (potential future enhancements)

Uh oh!

codecov bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gab23r commented Mar 10, 2026

Uh oh!

Andreas Albert (AndreasAlbertQC) left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gab23r commented Mar 17, 2026

Uh oh!

Andreas Albert (AndreasAlbertQC) commented Mar 18, 2026

Uh oh!

gab23r commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Andreas Albert (AndreasAlbertQC) commented Mar 18, 2026

Uh oh!

Andreas Albert (AndreasAlbertQC) left a comment

Choose a reason for hiding this comment

Uh oh!

Oliver Borchert (borchero) left a comment

Choose a reason for hiding this comment

Uh oh!

Oliver Borchert (borchero) commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Mar 5, 2026 •

edited

Loading

gab23r commented Mar 18, 2026 •

edited

Loading