Skip to content

feat: Add dy.infer_schema#294

Merged
Andreas Albert (AndreasAlbertQC) merged 13 commits intoQuantco:mainfrom
gab23r:infer-schema
Mar 23, 2026
Merged

feat: Add dy.infer_schema#294
Andreas Albert (AndreasAlbertQC) merged 13 commits intoQuantco:mainfrom
gab23r:infer-schema

Conversation

@gab23r
Copy link
Contributor

@gab23r gab23r commented Mar 5, 2026

Fixes: #232

  • Add dy.infer_schema() function to generate dataframely schema code from a Polars DataFrame
  • Supports three output modes via return_type parameter:
    • None (default): prints schema to stdout for quick exploration
    • "string": returns schema code as a string
    • "schema": returns an actual Schema class for direct use
  • Handles all Polars types including nested types (List, Array, Struct) with proper inner nullability detection
  • Automatically handles invalid Python identifiers and keywords using aliases

This add the

>>> import polars as pl
>>> import dataframely as dy
>>> df = pl.DataFrame({
...     "name": ["Alice", "Bob"],
...     "age": [25, 30],
...     "score": [95.5, None],
... })
>>> dy.infer_schema(df, "PersonSchema")
class PersonSchema(dy.Schema):
    name = dy.String()
    age = dy.Int64()
    score = dy.Float64(nullable=True)
>>> schema = dy.infer_schema(df, "PersonSchema", return_type="schema")
>>> schema.is_valid(df)
True

Not supported (potential future enhancements)

  • Assess min/max length of string values to suggest min_length/max_length constraints
  • Suggest Enum if there are fewer than 10-20 distinct string values in a column
  • Suggest Categorical if there are 50-100 distinct string values in a dataframe with >100k rows

@codecov
Copy link

codecov bot commented Mar 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (b3edd6a) to head (c712374).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #294   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           54        56    +2     
  Lines         3121      3211   +90     
=========================================
+ Hits          3121      3211   +90     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new dy.infer_schema() function (addressing Issue #232) that generates dataframely schema code from a Polars DataFrame. The function inspects a DataFrame's column types and null counts to produce schema class definitions with appropriate column types and nullable annotations.

Changes:

  • New dataframely/_generate_schema.py module implementing infer_schema() with three return modes (print to stdout, return as string, or return as an executable Schema class), plus supporting helper functions for code generation.
  • Public API export of infer_schema in dataframely/__init__.py.
  • New test file tests/test_infer_schema.py covering basic types, nullable detection, datetime types, nested types, invalid identifiers, and round-trip validation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
dataframely/_generate_schema.py New module with infer_schema() function and helpers for inferring schema from DataFrame columns, handling type mapping, identifier sanitization, and code generation.
dataframely/__init__.py Exports infer_schema in the public API (import and __all__).
tests/test_infer_schema.py Tests for string output mode across all supported types and round-trip validation via schema return mode.

@gab23r gab23r changed the title Feat: Add dy.infer_schema feat: Add dy.infer_schema Mar 5, 2026
@github-actions github-actions bot added the enhancement New feature or request label Mar 5, 2026
@gab23r
Copy link
Contributor Author

gab23r commented Mar 10, 2026

hello Oliver Borchert (@borchero), does this implementation is close to something mergable ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey gab23r , thanks for the PR! I think the core functionality is pretty solid, but I'd like us to tune the API a little and align the structure of the code more closely with what we usually do in this repo. Could you also add an entry to FAQ docs page?

@gab23r
Copy link
Contributor Author

gab23r commented Mar 17, 2026

I fall into the rabbit hole of edge cases with duplicated names...
I end up with this logic:

  • replace special character with "_"
  • if the name didn't have any valid characters, name the column "_"
  • if this name is not already given use it else add a suffix integer to the column is not duplicated.

Example:

import polars as pl

df = pl.DataFrame({
    "用户姓名": ["张三", "李四"],
    "出生年月日": ["1990-01-15", "1985-06-20"],
    "工作单位名称": ["北京科技有限公司", "上海金融集团"],
    "联系电话号码": ["13800138000", "13900139000"],
})
print(dy.infer_schema(df))
# class Schema(dy.Schema):
#     _ = dy.String(alias="用户姓名")
#     _1 = dy.String(alias="出生年月日")
#     _2 = dy.String(alias="工作单位名称")
#     _3 = dy.String(alias="联系电话号码")

class Schema(dy.Schema):
    _ = dy.String(alias="用户姓名")
    _1 = dy.String(alias="出生年月日")
    _2 = dy.String(alias="工作单位名称")
    _3 = dy.String(alias="联系电话号码")

Schema.sample().columns # ['用户姓名', '出生年月日', '工作单位名称', '联系电话号码']

@AndreasAlbertQC
Copy link
Collaborator

thanks for thinking this through! I think we are almost there :)

if the name didn't have any valid characters, name the column "_"

how about column_0, column_1 etc? That would have two advantages:

  1. It would avoid having column names start with underscores, which can look like you want to define a private member
  2. it would make the naming structure consistent between the first and each subsequent column (as opposed to the current _ for the first, and then _1 / _2,... for the subsequent)

@gab23r
Copy link
Contributor Author

gab23r commented Mar 18, 2026

I have replace _ by column_{column_index}, But I need to keep the behavior that ensure that duplicated names will never appear. Look at these edges cases:

df = pl.DataFrame({"column_1": 0, "$": 0})
print(dy.infer_schema(df))
# Class Schema(dy.Schema):
#     column_1 = dy.Int64()
#     column_1_1 = dy.Int64(alias="$")

df = pl.DataFrame({"col name": ["test"], "col_name": ["test"], "col_name_1": ["test"]})
result = dy.infer_schema(df)
# class Schema(dy.Schema):
#     col_name = dy.String(alias="col name")
#     col_name_1 = dy.String(alias="col_name")
#     col_name_1_1 = dy.String(alias="col_name_1")

@AndreasAlbertQC
Copy link
Collaborator

Makes sense, thanks! If you could fix pre-commit and coverage one last time, I think we are good to go now :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some more internal discussion with Oliver Borchert (@borchero) and Daniel Elsner (@delsner), we concluded that while we want to merge this, we also are not 100% ready to commit to the interface and functionality being stable. Therefore, we decided to move this function to a new dataframely.experimental namespace, which will allows us to be more flexible with changes in the future, while still ensuring we include it in the dataframely package.

gab23r I took the liberty of implementing that on this branch :)

Oliver Borchert (@borchero) Daniel Elsner (@delsner) PTAL, I'd now like to merge this quickly because it's been waiting for a while.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@borchero
Copy link
Member

Thanks for your work gab23r! :) sorry this took a bit to get done

@AndreasAlbertQC Andreas Albert (AndreasAlbertQC) merged commit 6e06c73 into Quantco:main Mar 23, 2026
32 checks passed
@gab23r gab23r deleted the infer-schema branch March 23, 2026 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Generate Schema code from a dataframe

5 participants