Skip to content

Reading upper/lower bounds values with type promotions#3293

Merged
geruh merged 5 commits intoapache:mainfrom
rambleraptor:metadata-file-promote
May 1, 2026
Merged

Reading upper/lower bounds values with type promotions#3293
geruh merged 5 commits intoapache:mainfrom
rambleraptor:metadata-file-promote

Conversation

@rambleraptor
Copy link
Copy Markdown
Contributor

@rambleraptor rambleraptor commented Apr 27, 2026

Part of #3270

Rationale for this change

Thanks to @ndrluis for the suggestion!

upper_bounds and lower_bounds fields are binary. The type they represent may not be the true type if a type promotion has occurred.

We need to determine the type based on the actual type + the number of bytes.

Are these changes tested?

Unit test included

Are there any user-facing changes?

@rambleraptor rambleraptor marked this pull request as ready for review April 27, 2026 23:59
@rambleraptor rambleraptor changed the title Type promotion - metadata file reading Reading upper/lower bounds values with type promotions Apr 27, 2026
@rambleraptor
Copy link
Copy Markdown
Contributor Author

@kevinjqliu @geruh @Fokko please take a look when you can!

Copy link
Copy Markdown
Member

@geruh geruh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising Alex. I did a quick search and the same from bytes call is in some other places like StrictMetricsEval, and somewhere in inspect would they need this as well?

Comment thread pyiceberg/expressions/visitors.py Outdated
Comment thread pyiceberg/expressions/visitors.py Outdated
Comment thread tests/expressions/test_evaluator.py Outdated
@rambleraptor
Copy link
Copy Markdown
Contributor Author

@geruh thanks for the review! I added it into StrictMetricsEval as well.

I'm torn on inspect. There's value in a human understanding that the data on disk stating that the file is of a particular type. I didn't change it, but I'm happy to be overridden on that one.

@rambleraptor rambleraptor requested a review from geruh April 29, 2026 20:11
@geruh
Copy link
Copy Markdown
Member

geruh commented Apr 30, 2026

I'm torn on inspect. There's value in a human understanding that the data on disk stating that the file is of a particular type. I didn't change it, but I'm happy to be overridden on that one.

I did a brief pass earlier so I wasn't able to give much context. Inspect methods are calling from_bytes for the bounds, and when type promotions come into play by that logic this will fail right. Because just like what you're laying down here, lower/upper_bound is the raw bytes from the manifest, and field.field_type comes from self.tbl.metadata.schema() which is the promoted type.

This can be proved by adding an integration test like:

schema = Schema(NestedField(1, "val", IntegerType(), required=False))

tbl = catalog.create_table("glue.dru", schema)

tbl.append(pa.table({"val": pa.array([10, 20, 30], type=pa.int32())}))

with tbl.update_schema() as upd:
    upd.update_column("val", field_type=LongType())
    
tbl.inspect.entries()  

running this gives me:

struct.error: unpack requires a buffer of 8 bytes

Arguably, this makes the conversions.py dispatch approach a little more appealing to me since we can fix it for everything in one place. but, would need to see or try it out. wdyt?

Comment thread pyiceberg/expressions/visitors.py Outdated
@rambleraptor rambleraptor requested a review from geruh April 30, 2026 21:16
Comment thread pyiceberg/conversions.py Outdated

@from_bytes.register(LongType)
def _(_: PrimitiveType, b: bytes) -> int:
if len(b) == 4:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just curious here when i linked the comment earlier to the java code I noticed they use < 8 any reason for the == 4 on these?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not 8 bytes, it's going to be 4 bytes. I'm not sure what kind of data would be 5-7 bytes. I changed it nonetheless just to match the Java implementation.

Comment thread tests/expressions/test_evaluator.py Outdated
Comment thread tests/expressions/test_evaluator.py Outdated
@rambleraptor rambleraptor requested a review from geruh May 1, 2026 00:41
Copy link
Copy Markdown
Member

@geruh geruh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks @rambleraptor!!

@geruh geruh merged commit 2558dcd into apache:main May 1, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants