Skip to content

fix: incorrect Parquet INT96 timestamp values from ArrowReader#2301

Merged
blackmwk merged 13 commits intoapache:mainfrom
mbutrovich:int96
Apr 16, 2026
Merged

fix: incorrect Parquet INT96 timestamp values from ArrowReader#2301
blackmwk merged 13 commits intoapache:mainfrom
mbutrovich:int96

Conversation

@mbutrovich
Copy link
Copy Markdown
Collaborator

@mbutrovich mbutrovich commented Mar 31, 2026

Which issue does this PR close?

What changes are included in this PR?

  • Add coerce_int96_timestamps() to patch the Arrow schema before reading, using arrow-rs's schema hint mechanism (ArrowReaderOptions::with_schema) to read
    INT96 columns at the resolution specified by the Iceberg table schema
  • timestamp/timestamptz → microsecond, timestamp_ns/timestamptz_ns → nanosecond, per the Iceberg spec
  • Falls back to microsecond when no field ID is available (matching Iceberg Java's TimestampInt96Reader behavior)
  • Applied after all three schema resolution branches (with field IDs, name mapping, positional fallback) so the fix covers both native and migrated tables
  • Handles INT96 inside nested types (structs, lists, maps) via ArrowSchemaVisitor traversal
  • Visitor and tests live in a standalone arrow/int96.rs module to keep reader.rs manageable
  • Made visit_schema in arrow/schema.rs pub(crate) so the coercion visitor can reuse the existing traversal

Are these changes tested?

  • test_read_int96_timestamps_with_field_ids — files with embedded field IDs (branch 1)
  • test_read_int96_timestamps_without_field_ids — migrated files without field IDs (branches 2/3)
  • test_read_int96_timestamps_in_struct — INT96 inside a struct field
  • test_read_int96_timestamps_in_list — INT96 inside a list field (3-level Parquet LIST encoding)
  • test_read_int96_timestamps_in_map — INT96 as map values
  • All tests use dates outside the i64 nanosecond range (~1677-2262) to confirm the overflow is avoided
  • Apache DataFusion Comet used the repro test in
    apache/datafusion-comet#3856 and it passes with this change:
    test: [DO NOT MERGE] test upstream iceberg-rust fix for #3856 datafusion-comet#3857

Comment thread crates/iceberg/src/arrow/reader.rs Outdated
Comment thread crates/iceberg/src/arrow/reader.rs Outdated
Comment thread crates/iceberg/src/arrow/reader.rs Outdated
Comment thread crates/iceberg/src/arrow/reader.rs Outdated
Comment thread crates/iceberg/src/arrow/reader.rs Outdated
Comment thread crates/iceberg/src/arrow/reader.rs Outdated
@mbutrovich
Copy link
Copy Markdown
Collaborator Author

Thanks for the feedback @emkornfield! You gave me a good idea on how I might be able to restructure this to address several of your comments. If you don't mind, I'll re-request a review probably later today.

@emkornfield
Copy link
Copy Markdown
Contributor

Thanks for the feedback @emkornfield! You gave me a good idea on how I might be able to restructure this to address several of your comments. If you don't mind, I'll re-request a review probably later today.

Sounds good, please take comments with a grain of salt, I'm fairly new the code base.

@mbutrovich mbutrovich requested a review from emkornfield March 31, 2026 19:11
Comment thread crates/iceberg/src/arrow/reader.rs Outdated
Comment thread crates/iceberg/src/arrow/reader.rs Outdated
Comment thread crates/iceberg/src/arrow/reader.rs Outdated
@mbutrovich mbutrovich changed the title fix: incorrect Parquet INT96 values from ArrowReader fix: incorrect Parquet INT96 timestamp values from ArrowReader Mar 31, 2026
Comment thread crates/iceberg/src/arrow/reader.rs Outdated
@mbutrovich
Copy link
Copy Markdown
Collaborator Author

Rewrote using ArrowSchemaVisitor, following the same pattern as MetadataStripVisitor. The visitor walks the Arrow schema and at each primitive, checks if it's Timestamp(Nanosecond) and looks up the Iceberg field by ID to determine the target resolution.

This also eliminates the parquet_schema parameter entirely — we no longer need Parquet column paths to find INT96 columns, since arrow-rs only produces Timestamp(Nanosecond) for INT96 (INT64 timestamps get the correct unit from Parquet metadata).

Made visit_schema pub(crate) so it's accessible from reader.rs.

@mbutrovich mbutrovich requested a review from blackmwk April 1, 2026 15:12
Comment thread crates/iceberg/src/arrow/reader.rs Outdated
Comment thread crates/iceberg/src/arrow/reader.rs Outdated
Comment thread crates/iceberg/src/arrow/reader.rs Outdated
Comment thread crates/iceberg/src/arrow/reader.rs Outdated
Comment thread crates/iceberg/src/arrow/reader.rs Outdated
}

impl ArrowSchemaVisitor for Int96CoercionVisitor<'_> {
type T = Field;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be FieldRef?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to wait for #2310 for this one. The trait currently passes &Field in before_field/after_field, so getting a FieldRef requires cloning into a new Arc. Once #2310 changes the trait to pass &FieldRef, we can store Vec<FieldRef> and the push becomes a cheap Arc::clone.

@mbutrovich
Copy link
Copy Markdown
Collaborator Author

Most recent changes:

  1. Standalone module — Moved coerce_int96_timestamps, Int96CoercionVisitor, and all 5 INT96 tests to crates/iceberg/src/arrow/int96.rs. Registered in mod.rs.
  2. Concrete types — Replaced Self::T/Self::U with Field/ArrowSchema in all ArrowSchemaVisitor method signatures.
  3. Pop in after_* methods — Push in before_field/before_list_element/before_map_key/before_map_value, pop in the corresponding after_* methods. primitive/struct/list/map now peek with .last() instead of .pop().
  4. FieldRef in field_stack — Kept Vec<Field> for now since the trait passes &Field. Added // TODO(#2310) comment referencing the tracking issue.

Thanks for the patience so far @blackmwk!

@mbutrovich mbutrovich requested a review from blackmwk April 2, 2026 18:00
Comment thread crates/iceberg/src/arrow/int96.rs Outdated

/// Files with embedded field IDs (branch 1 of schema resolution).
#[tokio::test]
async fn test_read_int96_timestamps_with_field_ids() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As ut, I expect to see tests for the schema visitor only. I think it would be better to put these tests into reader module?

Copy link
Copy Markdown
Collaborator Author

@mbutrovich mbutrovich Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @blackmwk!

Moved integration tests from int96.rs to reader.rs

  • The 5 tests that write Parquet files and read through ArrowReader now live in the reader test module where they belong.
  • Converted /// doc comments to // to match codebase test style.

Added 11 unit tests for the schema visitor in int96.rs

  • test_coerce_timestamp_ns_to_usTimestamp field ID → microsecond
  • test_coerce_timestamptz_ns_to_usTimestamptz field ID → microsecond, preserves timezone
  • test_no_coercion_when_iceberg_is_timestamp_nsTimestampNs → no change
  • test_no_coercion_when_iceberg_is_timestamptz_nsTimestamptzNs → no change
  • test_no_coercion_when_already_microsecond — already microsecond → no change
  • test_defaults_to_us_without_field_ids — no field ID metadata → falls back to microsecond (Iceberg Java behavior)
  • test_defaults_to_us_when_iceberg_type_is_not_timestamp — field ID maps to non-timestamp Iceberg type → falls back to microsecond
  • test_coerce_preserves_field_metadata — field metadata survives coercion
  • test_coerce_timestamp_in_struct — nested struct coercion
  • test_coerce_timestamp_in_list — nested list coercion
  • test_coerce_timestamp_in_map_value — nested map value coercion

@mbutrovich mbutrovich requested a review from blackmwk April 15, 2026 14:25
mbutrovich and others added 2 commits April 15, 2026 14:58
# Conflicts:
#	crates/iceberg/src/arrow/reader.rs
Copy link
Copy Markdown
Contributor

@blackmwk blackmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbutrovich for this pr!

@blackmwk blackmwk merged commit a2f067d into apache:main Apr 16, 2026
20 checks passed
@mbutrovich mbutrovich deleted the int96 branch April 16, 2026 02:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: incorrect Parquet INT96 values from ArrowReader

3 participants