GH-533: Add ALP (Adaptive Lossless floating-Point) encoding specification#557
GH-533: Add ALP (Adaptive Lossless floating-Point) encoding specification#557prtkgaur wants to merge 4 commits into
Conversation
|
|
||
| ##### Header (7 bytes) | ||
|
|
||
| All multi-byte values are little-endian. |
There was a problem hiding this comment.
| All multi-byte values are little-endian. | |
| All multi-byte values are stored in little-endian order. |
| |--------|-------|------|------|-------------| | ||
| | 0 | compression_mode | 1 byte | uint8 | Compression mode (must be 0 = ALP) | | ||
| | 1 | integer_encoding | 1 byte | uint8 | Integer encoding (must be 0 = FOR + bit-packing) | | ||
| | 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 15\]. Default: 10 (vector size 1024) | |
There was a problem hiding this comment.
| | 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 15\]. Default: 10 (vector size 1024) | | |
| | 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 15\]. Recommended default: 10 (vector size 1024) | |
|
|
||
| | Offset | Field | Size | Type | Description | | ||
| |--------|-------|------|------|-------------| | ||
| | 0 | compression_mode | 1 byte | uint8 | Compression mode (must be 0 = ALP) | |
There was a problem hiding this comment.
we think ALP-RD fits in well here? I forget what the extension point is and why we were OK keeping this field, but not version.
There was a problem hiding this comment.
changing the code to read ALP's newer laid out bits would be difficult to apply across all implementations. but AlpRD is a completely newer implementations.
Added a note to the field description.
| |--------|-------|------|------|-------------| | ||
| | 0 | compression_mode | 1 byte | uint8 | Compression mode (must be 0 = ALP) | | ||
| | 1 | integer_encoding | 1 byte | uint8 | Integer encoding (must be 0 = FOR + bit-packing) | | ||
| | 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 15\]. Default: 10 (vector size 1024) | |
There was a problem hiding this comment.
| | 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 15\]. Default: 10 (vector size 1024) | | |
| | 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in the inclusive range: \[3, 15\]. Default: 10 (vector size 1024) | |
| **Note:** The number of elements per vector and the packed data size are NOT stored | ||
| in the header. They are derived: | ||
| * Elements per vector: `vector_size` for all vectors except the last, which may be smaller. | ||
| * Packed data size: `ceil(num_elements_in_vector * bit_width / 8)`. |
There was a problem hiding this comment.
bit_width isn't in the header either, so it is a little strange to call ths out here?
There was a problem hiding this comment.
I also hihglighted this as confusing when reading the spec. I recommend removing this sentence as the packed data size is covered in the "Vector Format" section below
There was a problem hiding this comment.
comment regarding packed data size has been removed.
|
|
||
| **Note:** The number of elements per vector and the packed data size are NOT stored | ||
| in the header. They are derived: | ||
| * Elements per vector: `vector_size` for all vectors except the last, which may be smaller. |
There was a problem hiding this comment.
This seems like a little bit of a strange callout since it is covered on line 457 explicitly. and log_vector_size is stored in the header?
There was a problem hiding this comment.
I agree this is redundant -- I think leaving the note is valuable clarification
Note: The number of elements per vector and the packed data size are NOT stored
in the header, they are derived
However the other items are not
There was a problem hiding this comment.
Removed the packed data size bullet, kept the elements-per-vector clarification.
| values. Each offset gives the byte position of the corresponding vector's data, | ||
| measured from the start of the offset array itself. | ||
|
|
||
| The first offset equals `num_vectors * 4` (pointing just past the offset array). |
There was a problem hiding this comment.
| The first offset equals `num_vectors * 4` (pointing just past the offset array). | |
| The first offset always equals `num_vectors * 4` (pointing just past the offset array). |
Lets be explicitly here that we don't support padding.
| Data section sizes: | ||
| | Section | Size Formula | Description | | ||
| |---------------------|-----------------------------|------------------------------| | ||
| | PackedValues | ceil(N * bit\_width / 8) | Bit-packed delta values | |
There was a problem hiding this comment.
| | PackedValues | ceil(N * bit\_width / 8) | Bit-packed delta values | | |
| | PackedValues | ceil(`vector_size` * bit\_width / 8) | Bit-packed delta values | |
| |---------------------|-----------------------------|------------------------------| | ||
| | PackedValues | ceil(N * bit\_width / 8) | Bit-packed delta values | | ||
| | ExceptionPositions | num\_exceptions * 2 bytes | uint16 indices of exceptions | | ||
| | ExceptionValues | num\_exceptions * sizeof(T) | Original float/double values | |
There was a problem hiding this comment.
| | ExceptionValues | num\_exceptions * sizeof(T) | Original float/double values | | |
| | ExceptionValues | num\_exceptions * sizeof(encoded type) (float=4 and double=8) | Original float/double values | |
|
|
||
| The FOR-encoded deltas, bit-packed into `ceil(num_elements_in_vector * bit_width / 8)` bytes. | ||
| Values are packed from the least significant bit of each byte to the most significant bit, | ||
| in groups of 8 values, using the same bit-packing order as the |
There was a problem hiding this comment.
Where does the group of 8 values come in? Wouldn't this messup the number of bytes math?
There was a problem hiding this comment.
Good catch — removed 'groups of 8' phrasing. Now simply references the same LSB-first packing order as RLE/Bit-Packing Hybrid.
(I think RleBitPackHybrid was in my mind at that point :) )
| The encoding uses two separate multiplications (not a single multiplication by | ||
| `10^(e-f)`, and not division) to ensure that implementations produce identical | ||
| floating-point rounding across languages. The powers of 10 MUST be stored as | ||
| precomputed floating-point constants (i.e., literal values like `1e-3f`), not |
There was a problem hiding this comment.
Why can't they be precomputed at runtime?
There was a problem hiding this comment.
You're right — this shouldn't mandate literals vs runtime computation. Reworded to require that encoder and decoder use identical power-of-10 values, without prescribing how they're obtained.
| | Type | Magic Number | Formula | | ||
| |--------|-----------------------------------|----------------------------------| | ||
| | FLOAT | 2^22 + 2^23 = 12,582,912 | `(int)((value + magic) - magic)` | | ||
| | DOUBLE | 2^51 + 2^52 = 6,755,399,441,055,744 | `(long)((value + magic) - magic)` | |
There was a problem hiding this comment.
| | DOUBLE | 2^51 + 2^52 = 6,755,399,441,055,744 | `(long)((value + magic) - magic)` | | |
| | DOUBLE | 2^51 + 2^52 = 6,755,399,441,055,744 | `(int64_t)((value + magic) - magic)` | |
|
|
||
| | Type | Magic Number | Formula | | ||
| |--------|-----------------------------------|----------------------------------| | ||
| | FLOAT | 2^22 + 2^23 = 12,582,912 | `(int)((value + magic) - magic)` | |
There was a problem hiding this comment.
| | FLOAT | 2^22 + 2^23 = 12,582,912 | `(int)((value + magic) - magic)` | | |
| | FLOAT | 2^22 + 2^23 = 12,582,912 | `(int32_t)((value + magic) - magic)` | |
| ``` | ||
| +-------------------------------------------------------------------+ | ||
| | | | ||
| | encoded = round( value * 10^e * 10^(-f) ) | |
There was a problem hiding this comment.
| | encoded = round( value * 10^e * 10^(-f) ) | | |
| | encoded = fast_round( value * 10^e * 10^(-f) ) | |
alamb
left a comment
There was a problem hiding this comment.
Thank you @prtkgaur and @emkornfield -- I started going through this proposal as well. I haven't made it through the entire thing, but I left commands on what I have made it through first.
@prtkgaur how would you like to address feedback on this PR? Would you like to process the comments? I would also be happy to make a PR with proposed edits to your branch if that would be better. Please just let me know
| ALP works by converting floating-point values to integers using decimal scaling, | ||
| then applying Frame of Reference (FOR) encoding and bit-packing. Values that | ||
| cannot be losslessly converted are stored as exceptions. The encoding achieves | ||
| high compression for decimal-like floating-point data (e.g., monetary values, | ||
| sensor readings) while remaining fully lossless. |
There was a problem hiding this comment.
You have more summary at the end of the encoding, but I think a few more sentences in the intro would help people understand this more easily
| ALP works by converting floating-point values to integers using decimal scaling, | |
| then applying Frame of Reference (FOR) encoding and bit-packing. Values that | |
| cannot be losslessly converted are stored as exceptions. The encoding achieves | |
| high compression for decimal-like floating-point data (e.g., monetary values, | |
| sensor readings) while remaining fully lossless. | |
| ALP works by converting floating-point values to integers using decimal scaling, | |
| then applying Frame of Reference (FOR) encoding and bit-packing. Values that | |
| cannot be losslessly converted are stored as exceptions. The encoding achieves | |
| high compression for decimal-like floating-point data (e.g., monetary values, | |
| sensor readings) while remaining fully lossless. Values do not depend on | |
| each other, which enables quick random access and parallel encode/decode. |
There was a problem hiding this comment.
Agreed — added sentence about value independence enabling random access and parallel encode/decode.
|
|
||
| **Note:** The number of elements per vector and the packed data size are NOT stored | ||
| in the header. They are derived: | ||
| * Elements per vector: `vector_size` for all vectors except the last, which may be smaller. |
There was a problem hiding this comment.
I agree this is redundant -- I think leaving the note is valuable clarification
Note: The number of elements per vector and the packed data size are NOT stored
in the header, they are derived
However the other items are not
| **Note:** The number of elements per vector and the packed data size are NOT stored | ||
| in the header. They are derived: | ||
| * Elements per vector: `vector_size` for all vectors except the last, which may be smaller. | ||
| * Packed data size: `ceil(num_elements_in_vector * bit_width / 8)`. |
There was a problem hiding this comment.
I also hihglighted this as confusing when reading the spec. I recommend removing this sentence as the packed data size is covered in the "Vector Format" section below
Incorporate review comments from emkornfield and alamb on PR apache#557:
|
I expect to have a second round of feedback tomorrow |
alamb
left a comment
There was a problem hiding this comment.
Ok, I made it through the spec again. I think it is really quite close
Let me know what you think about the comments and what you think abotu the idea of starting to draft a Blog post.
| | ------------------------------------- | -------------- | | ||
| | [Bit-packed (Deprecated)](#BITPACKED) | BIT_PACKED = 4 | | ||
|
|
||
| | Encoding | ID | Supported Types | |
There was a problem hiding this comment.
Something seems strange to me about this chart. The version on main
Lines 28 to 40 in 96edf77
Doesn't appear in this document.
I would expect this PR that adds the new ALP encoding to add a new entry to the existing table, rather than add an entirely new table.
Maybe we need to merge up from main to this branch (I am happy to do this and I think I have the permissions to push to your branch, but I don't want to mess things up on you)
There was a problem hiding this comment.
You're right, this branch was forked before the summary table was added in #550. Rebased onto upstream/master and added ALP as a row to the existing table.
|
|
||
| ##### Fast Rounding | ||
|
|
||
| The rounding function uses a "magic number" technique for branchless rounding: |
There was a problem hiding this comment.
I think it would help to use consistent terminology here to above.
| The rounding function uses a "magic number" technique for branchless rounding: | |
| The `fast_round` function uses a "magic number" technique for branchless rounding. | |
| `fast_round(value)` is defined as follows: |
There was a problem hiding this comment.
Done. Updated to use fast_round consistently and added "is defined as follows".
| | FLOAT | 2^22 + 2^23 = 12,582,912 | `(int32_t)((value + magic) - magic)` | | ||
| | DOUBLE | 2^51 + 2^52 = 6,755,399,441,055,744 | `(int64_t)((value + magic) - magic)` | | ||
|
|
||
| For negative values, the signs are reversed: `(int32_t)((value - magic) + magic)` for FLOAT, `(int64_t)((value - magic) + magic)` for DOUBLE. |
There was a problem hiding this comment.
I don't understand this sentence. The C++ implementation does not seem to change sign order from what I can tell:
/// \brief Convert a float to an int without rounding
static inline auto FastRound(T n) -> SignedExactType {
n = n + Constants::kMagicNumber - Constants::kMagicNumber;
return static_cast<SignedExactType>(n);
}Also the "Fast Rounding" section of the ALP paper doesn't mention sign reversal that I can find
There was a problem hiding this comment.
You're right -- the sign reversal was incorrect. The C++ implementation uses a single formula for both positive and negative values (n + magic - magic). Removed the paragraph.
| (the encoded integer of the first non-exception value, or 0 if all values | ||
| are exceptions) before FOR encoding. This keeps the FOR range tight. | ||
|
|
||
| ##### Frame of Reference and Bit-Packing |
There was a problem hiding this comment.
This section has a worked example of FOR / bitpacking so perhaps we could write it like
| ##### Frame of Reference and Bit-Packing | |
| ##### Example: Frame of Reference and Bit-Packing |
|
|
||
| ##### Frame of Reference and Bit-Packing | ||
|
|
||
| After decimal encoding and exception substitution: |
There was a problem hiding this comment.
| After decimal encoding and exception substitution: | |
| Given the following data after decimal encoding and exception substitution: |
| + sum(vector_bytes for each vector) // all vectors | ||
| ``` | ||
|
|
||
| #### Constants Reference |
There was a problem hiding this comment.
This appears to be C/++ implementation details -- I am not sure it adds any value to the spec
| encoded size. However, ALP and Byte Stream Split can be complementary: ALP | ||
| exploits decimal structure while Byte Stream Split exploits byte-level correlation. | ||
|
|
||
| #### Size Calculations |
There was a problem hiding this comment.
I don't think we need to add add to the Parquet spec -- it isn't required to implement the encoder/decoder for ALP.
| converted are stored separately as *exceptions*. The encoding achieves high | ||
| compression for decimal-like floating-point data (e.g., monetary values, sensor | ||
| readings) while remaining fully lossless. Each value is encoded independently, | ||
| enabling random access to individual vectors and parallel encode/decode. |
There was a problem hiding this comment.
| enabling random access to individual vectors and parallel encode/decode. | |
| enabling random access to individual values and parallel encode/decode. |
| floating-point rounding across languages. Implementations must ensure that the | ||
| encoder and decoder use identical power-of-10 values for a given exponent. |
There was a problem hiding this comment.
I don't understand what "use identical power-of-10 values" means. ALL encoders and decoders must to use the exact same values (and floating point arithmetic) as I understand it.
There was a problem hiding this comment.
Yes, what you said. Reworded to: "All implementations MUST use the exact same floating-point arithmetic and power-of-10 constants to guarantee cross-language interoperability."
| To avoid the cost of exhaustive search on every vector, implementations | ||
| SHOULD use sampling to select up to 5 candidate (exponent, factor) | ||
| combinations (the "encoding preset") at the start of each column chunk. | ||
| Each vector then searches only those 5 candidates. |
There was a problem hiding this comment.
I think SHOULD is too strong a word here. Maybe we could soften it to a suggestion
| To avoid the cost of exhaustive search on every vector, implementations | |
| SHOULD use sampling to select up to 5 candidate (exponent, factor) | |
| combinations (the "encoding preset") at the start of each column chunk. | |
| Each vector then searches only those 5 candidates. | |
| To avoid the cost of exhaustive search on every vector, implementations | |
| can use a sampling approach. One such approach, described in the paper, is to | |
| select up to 5 candidate (exponent, factor) combinations (the "encoding preset") | |
| at the start of each column chunk, and when encoding each vector, | |
| test each of the 5 candidates for the fewest exceptions. |
| readings) while remaining fully lossless. Each value is encoded independently, | ||
| enabling random access to individual vectors and parallel encode/decode. | ||
|
|
||
| #### Overview |
There was a problem hiding this comment.
The algorithm description is so long that I think it should be moved to a separate file that we would link to here. In this file we would just keeping the description that is before the Overview.
There was a problem hiding this comment.
Yes this makes sense. I too thought it became long and having a separate file would be good.
Let me take a stab at it after I address the above comments and get approval. Else the comment threads on github will get lost.
There was a problem hiding this comment.
Maybe similar to how it is done for bloom filter: https://github.com/apache/parquet-format/blob/master/BloomFilter.md
There was a problem hiding this comment.
I also think it would be ok to if we moved the content to a separate file as a follow on PR (after the spec change is approved) as in my mind moving the content to a separate file does not affect the content of the spec
Add the encoding specification for ALP (encoding value 10) to Encodings.md. ALP compresses FLOAT and DOUBLE columns by converting values to integers via decimal scaling, then applying Frame of Reference encoding and bit-packing. Values that cannot be losslessly round-tripped are stored as exceptions. The spec covers: - Page layout: 7-byte header, offset array, compressed vectors - Vector format: AlpInfo, ForInfo, packed values, exception data - Encoding math: two-step multiplication for cross-language consistency - Parameter selection, exception detection, and decoding steps Based on the paper "ALP: Adaptive Lossless floating-Point Compression" (Afroozeh and Boncz, SIGMOD 2024). Wire format matches the C++ Arrow and Java parquet-java implementations.
Incorporate review comments from emkornfield and alamb on PR apache#557:
- Clarify no padding between vectors in offset array description - Use 'sizeof(encoded type) (float=4 and double=8)' per reviewer suggestion
- Remove Characteristics, Size Calculations, Constants Reference sections - Consolidate three examples into one worked example with f!=0 and exceptions - Remove incorrect sign-reversal claim for fast_round on negative values - Soften sampling recommendation from SHOULD to suggestion - Fix "individual vectors" → "individual values" for random access - Clarify power-of-10 interop as MUST requirement - Use consistent fast_round terminology throughout
iemejia
left a comment
There was a problem hiding this comment.
Review focusing on spec consistency, numerical correctness under IEEE 754, and completeness for cross-language implementors.
| | Type | Magic Number | Formula | | ||
| |--------|-----------------------------------|----------------------------------| | ||
| | FLOAT | 2^22 + 2^23 = 12,582,912 | `(int32_t)((value + magic) - magic)` | | ||
| | DOUBLE | 2^51 + 2^52 = 6,755,399,441,055,744 | `(int64_t)((value + magic) - magic)` | |
There was a problem hiding this comment.
The formula shown here only covers non-negative values. For negative values, the expression value + magic may fall below the binade [2^52, 2^53) (for double), entering a region where ULP = 0.5 instead of 1.0, which produces incorrect rounding.
Known implementations (C++/DuckDB, Java) use sign branching:
if value >= 0:
result = (int64_t)((value + magic) - magic)
else:
result = (int64_t)((value - magic) + magic)
While the round-trip exception check prevents data corruption (incorrectly rounded values become exceptions), omitting sign branching causes a higher exception rate for datasets with negative values, degrading compression ratios.
Suggestion: document both branches in the spec table, or at minimum add a note that implementations SHOULD use sign branching for negative values.
| | 0 | 1500.0 | 15000.0 | 15000 | 1500.0 | No | | ||
| | 1 | NaN | - | - | - | Yes (NaN) | | ||
| | 2 | 2500.0 | 25000.0 | 25000 | 2500.0 | No | | ||
| | 3 | 333.3 | 3333.0 | 3333 | 333.3 | No | |
There was a problem hiding this comment.
The value 333.3 with (exponent=4, factor=3) may not survive a round-trip under strict IEEE 754 arithmetic:
- Encode:
333.3 * 10^4 * 10^(-3)=333.3 * 10000.0 * 0.001. Since0.001is not exactly representable in IEEE 754 double, the product is approximately3333.000000000000069..., which rounds to3333. - Decode:
3333 * 10^3 * 10^(-4)=3333 * 1000 * 0.0001. Since0.0001is not exactly representable, the product may not bit-equal333.3.
If decode(encode(333.3)) != 333.3 at the bit level, this value should be classified as an exception. The example may be using idealized math rather than actual IEEE 754 double semantics.
Suggestion: verify this specific round-trip in an actual IEEE 754 implementation, or replace 333.3 with a value that provably round-trips (e.g., 3000.0, 500.0, or any value where the scaling produces an exact result).
| | NaN | `NaN` | Cannot convert to integer | | ||
| | Infinity | `+Inf`, `-Inf` | Cannot convert to integer | | ||
| | Negative zero | `-0.0` | Would become `+0.0` after encoding | | ||
| | Out of range | value * 10^e > INT32\_MAX | Exceeds target integer limits | |
There was a problem hiding this comment.
This condition is under-specified in two ways:
- For DOUBLE, the target integer type is int64, not int32. The condition should reference INT64_MAX for doubles.
- The actual safe casting limit is not exactly
INT32_MAX/INT64_MAXbut the largest floating-point value that can be safely converted to the target integer type without undefined behavior. For float→int32 this is approximately2147483520.0f(not2147483647), and for double→int64 approximately9223372036854774784.0(not9223372036854775807). These differ because not all integers near the max are exactly representable in the source float/double type.
Suggestion: either specify the exact limits for each type, or reword to: "the scaled value exceeds the range that can be losslessly represented in the target integer type (int32 for FLOAT, int64 for DOUBLE)."
| The encoding uses two separate multiplications (not a single multiplication by | ||
| `10^(e-f)`, and not division) to ensure that implementations produce identical | ||
| floating-point results. All implementations MUST use the exact same floating-point | ||
| arithmetic and power-of-10 constants to guarantee cross-language interoperability. |
There was a problem hiding this comment.
This requirement is not actionable without specifying the actual constant values. Powers of 10 like 10^(-3) are not exactly representable in IEEE 754, and different methods of computing them (e.g., 1.0/1000.0 vs the compile-time literal 1e-3 vs pow(10, -3)) can produce different bit patterns.
Suggestion: either (a) provide a table of required constants with their exact IEEE 754 hex representations, or (b) specify that constants MUST match the values produced by the standard decimal-to-binary conversion of the literals 1e0, 1e1, ..., 1e18 and 1e-0, 1e-1, ..., 1e-18 as defined by IEEE 754-2008 §5.12.2. This makes the requirement unambiguous and testable across languages.
|
|
||
| ##### Parameter Selection | ||
|
|
||
| The encoder selects the (exponent, factor) pair that minimizes exceptions. |
There was a problem hiding this comment.
In practice, minimizing exception count alone is not optimal. A parameter pair with slightly more exceptions but a much smaller bit-width (tighter FOR range) can produce smaller output. The actual optimization target should be estimated encoded size, accounting for both bit-width and exception overhead.
Since this is an encoder-only concern (decoders are agnostic to the selection strategy), the spec should clarify that any valid (e, f) pair produces a correct encoding — the choice only affects compression ratio. Suggested rewording:
The encoder SHOULD select the (exponent, factor) pair that produces the smallest encoded output. A simple heuristic is to minimize exception count; a more precise approach accounts for both bit-width and exception overhead.
Add the encoding specification for ALP (encoding value 10) to Encodings.md. ALP compresses FLOAT and DOUBLE columns by converting values to integers via decimal scaling, then applying Frame of Reference encoding and bit-packing. Values that cannot be losslessly round-tripped are stored as exceptions.
See rendered preview here: https://github.com/prtkgaur/parquet-format/blob/alpEncoding/Encodings.md#adaptive-lossless-floating-point-alp--10
The spec covers:
Based on the paper "ALP: Adaptive Lossless floating-Point Compression" (Afroozeh and Boncz, SIGMOD 2024). Wire format matches the C++ Arrow and Java parquet-java implementations.
Rationale for this change
What changes are included in this PR?
Do these changes have PoC implementations?