fix: incorrect Parquet INT96 timestamp values from ArrowReader#2301
fix: incorrect Parquet INT96 timestamp values from ArrowReader#2301blackmwk merged 13 commits intoapache:mainfrom
Conversation
|
Thanks for the feedback @emkornfield! You gave me a good idea on how I might be able to restructure this to address several of your comments. If you don't mind, I'll re-request a review probably later today. |
Sounds good, please take comments with a grain of salt, I'm fairly new the code base. |
|
Rewrote using This also eliminates the Made |
| } | ||
|
|
||
| impl ArrowSchemaVisitor for Int96CoercionVisitor<'_> { | ||
| type T = Field; |
There was a problem hiding this comment.
|
Most recent changes:
Thanks for the patience so far @blackmwk! |
|
|
||
| /// Files with embedded field IDs (branch 1 of schema resolution). | ||
| #[tokio::test] | ||
| async fn test_read_int96_timestamps_with_field_ids() { |
There was a problem hiding this comment.
As ut, I expect to see tests for the schema visitor only. I think it would be better to put these tests into reader module?
There was a problem hiding this comment.
Thanks for the review @blackmwk!
Moved integration tests from int96.rs to reader.rs
- The 5 tests that write Parquet files and read through
ArrowReadernow live in thereadertest module where they belong. - Converted
///doc comments to//to match codebase test style.
Added 11 unit tests for the schema visitor in int96.rs
test_coerce_timestamp_ns_to_us—Timestampfield ID → microsecondtest_coerce_timestamptz_ns_to_us—Timestamptzfield ID → microsecond, preserves timezonetest_no_coercion_when_iceberg_is_timestamp_ns—TimestampNs→ no changetest_no_coercion_when_iceberg_is_timestamptz_ns—TimestamptzNs→ no changetest_no_coercion_when_already_microsecond— already microsecond → no changetest_defaults_to_us_without_field_ids— no field ID metadata → falls back to microsecond (Iceberg Java behavior)test_defaults_to_us_when_iceberg_type_is_not_timestamp— field ID maps to non-timestamp Iceberg type → falls back to microsecondtest_coerce_preserves_field_metadata— field metadata survives coerciontest_coerce_timestamp_in_struct— nested struct coerciontest_coerce_timestamp_in_list— nested list coerciontest_coerce_timestamp_in_map_value— nested map value coercion
# Conflicts: # crates/iceberg/src/arrow/reader.rs
blackmwk
left a comment
There was a problem hiding this comment.
Thanks @mbutrovich for this pr!
Which issue does this PR close?
What changes are included in this PR?
coerce_int96_timestamps()to patch the Arrow schema before reading, using arrow-rs's schema hint mechanism (ArrowReaderOptions::with_schema) to readINT96 columns at the resolution specified by the Iceberg table schema
timestamp/timestamptz→ microsecond,timestamp_ns/timestamptz_ns→ nanosecond, per the Iceberg specTimestampInt96Readerbehavior)ArrowSchemaVisitortraversalarrow/int96.rsmodule to keepreader.rsmanageablevisit_schemainarrow/schema.rspub(crate)so the coercion visitor can reuse the existing traversalAre these changes tested?
test_read_int96_timestamps_with_field_ids— files with embedded field IDs (branch 1)test_read_int96_timestamps_without_field_ids— migrated files without field IDs (branches 2/3)test_read_int96_timestamps_in_struct— INT96 inside a struct fieldtest_read_int96_timestamps_in_list— INT96 inside a list field (3-level Parquet LIST encoding)test_read_int96_timestamps_in_map— INT96 as map valuesapache/datafusion-comet#3856 and it passes with this change:
test: [DO NOT MERGE] test upstream iceberg-rust fix for #3856 datafusion-comet#3857