Implement PARQUET-2249: Introduce IEEE 754 total order#9619
Conversation
| // For floating point we need to compare NaN values until we encounter a non-NaN | ||
| // value which then becomes the new min/max. After this, only non-NaN values are | ||
| // evaluated. If all values are NaN, then the min/max NaNs as determined by | ||
| // IEEE 754 total order are returned. |
There was a problem hiding this comment.
This has me a bit worried. I need to do some benchmarking to make sure all the complicated NaN logic isn't killing performance.
There was a problem hiding this comment.
Agree, this method is on the hot path. I had a look at optimizing it, but could not get the compiler to generate nice auto-vectorized code for nan-handling yet. I think we can try optimizations in a followup, it would be more important to get the semantics correct first and make sure there are tests for edge cases.
If all values are NaN, then the min/max NaNs as determined by
// IEEE 754 total order are returned.
Does the current code correctly distinguish different NaN payloads according to their sign and bit patterns?
(Solved, github was hiding the changes to compare_greater in mod.rs)
|
Tests will fail until apache/parquet-testing#104 is merged. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing total_order_514 (570f6de) to edfb9ab (merge-base) diff File an issue against this benchmark runner |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
run benchmark arrow_writer env:
BENCH_FILTER: float |
This comment was marked as outdated.
This comment was marked as outdated.
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing total_order_514 (570f6de) to edfb9ab (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
alamb
left a comment
There was a problem hiding this comment.
Thanks @etseidl -- I finished my review and i think the code was easy to follow, well tested and I thought overall (really) nice. Thank you 🙏
The only thing I am concerned about is that it seems to cause some slowdown on write, as shown by the benchmarks. This may be inevitable as we are now doing more work, but it would be great if we could sort that out and try and avoid regressions before we merged this
| let nan_count = self.page_metrics.num_page_nans.unwrap_or(0); | ||
| match i64::try_from(nan_count) { | ||
| Ok(count) => Some(count), | ||
| _ => Some(i64::MAX), |
There was a problem hiding this comment.
This is probably ok -- another perhaps safer thing would be to return None if the value overflows i64
I doubt there is any actual dataset that has than i64::MAX values -- but I can imagine an invalid / adversarial input (or something that underflowed on the writer) could have such a value 🤷
| fn test_ieee754_total_order_float_only_nan() { | ||
| // Test IEEE 754 total order for f32 | ||
| // Order should be: -NaN < -Inf < -1.0 < -0.0 < +0.0 < 1.0 < +Inf < +NaN | ||
| let neg_nan1 = f32::from_bits(0xffc00000); |
There was a problem hiding this comment.
TIL there are multiple encodings of NaN: https://en.wikipedia.org/wiki/NaN
| #[test] | ||
| fn test_ieee754_total_order_float_only_nan() { | ||
| // Test IEEE 754 total order for f32 | ||
| // Order should be: -NaN < -Inf < -1.0 < -0.0 < +0.0 < 1.0 < +Inf < +NaN |
There was a problem hiding this comment.
This comment seems incorrect as the test is for different nan encodings
There was a problem hiding this comment.
Sorry, copy/paste leftovers. will fix
|
|
||
| /// Returns sort order for a physical/logical type. | ||
| /// | ||
| /// `is_type_defined` indicates whether the column order for this type is |
There was a problem hiding this comment.
Minor nit: why not just pass the actual ColumOrder enum (rather than a flag?) That would perhaps make the code more self documenting as well as allow extension in the future without breaking API changes
There was a problem hiding this comment.
There's kind of a chicken/egg thing going on here. TYPE_DEFINED_ORDER contains a SortOrder, which is not yet known when trying to create the ColumnOrder. I think this function should not be public, and instead SortOrder should be obtained via ColumnOrder::sort_order.
How do you feel about making this pub(crate)?
There was a problem hiding this comment.
Maybe a follow-up PR...I'd like to change the only other place this function is used in the crate and maybe make this private. Or deprecate this and move the guts into a private fn.
| } | ||
|
|
||
| #[test] | ||
| fn test_arrow_gh_47662() { |
There was a problem hiding this comment.
Looks like this is a reasonable behavior for this file. Here is the issue
| f16::from_f32(-5.), | ||
| f16::from_f32(-4.), | ||
| f16::from_f32(-0.), | ||
| f16::from_f32(0.), |
There was a problem hiding this comment.
why change the tests? If the min really is -0 shouldn't that be reflected?
There was a problem hiding this comment.
The min actually is 0.0. TypeDefined ordering required replacing min==0.0 with -0.0. Total ordering removes that requirement.
| f16::from_bits(0x8000), | ||
| ]; | ||
|
|
||
| fn validate_float16_metadata( |
There was a problem hiding this comment.
It is unfortunate we have ~ 3 copies of the same code with slight differences. However the effort to macroize or template the thing t handle the different floating point types is probably not worth it
|
run benchmark arrow_writer env:
BENCH_FILTER: float |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
run benchmark arrow_writer env:
BENCH_FILTER: list |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
This comment was marked as outdated.
This comment was marked as outdated.
|
run benchmark arrow_writer env: |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing total_order_514 (570f6de) to edfb9ab (merge-base) diff File an issue against this benchmark runner |
It took me a little while to figure out how to do this (see the trail of disaster comments I left on apache/arrow-rs#9619) Add instructions to the README for running specific benchmark suites with filters.
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
The only thing that looks suspicious is some of the However I think @HippoBaro also saw some potential strangeness with those runs, so maybe it is a measurement thing |
Indeed. From #9619 (comment) |
Shout out to @Xuanwo who wrote all of the |
Which issue does this PR close?
Rationale for this change
This takes the implementation done by @Xuanwo (#8158) and updates it to the new thrift format and recent changes to the original proposal (apache/parquet-format#514).
What changes are included in this PR?
Adds needed thrift structures as well as NaN counts for pages and column chunks.
Are these changes tested?
Yes, new tests added (more may be needed).
Are there any user-facing changes?
Yes. This adds the
IEEE_754_TOTAL_ORDERvariant to the publicColumnOrderenum, andTOTAL_ORDERto the publicSortOrderenum, which are technically breaking API changes. This also adds anan_countargument to the public functionColumnIndexBuilder::append.Non-breaking changes include adding
nan_countfields to theValueStatisticsandColumnIndexstructs.Behaviorally, this will now order all floating point statistics using IEEE754 total order, and all floating point columns will use the new
IEEE_754_TOTAL_ORDERcolumn order in theFileMetaData.column_orderslist. Newer readers that recognize the new column order may now safely use the statistics for chunk and page pruning. Older readers will not recognize this column order and should continue to ignore the statistics.