GH-533: Add ALP (Adaptive Lossless floating-Point) encoding specification by prtkgaur · Pull Request #557 · apache/parquet-format

prtkgaur · 2026-03-11T04:03:57Z

Add the encoding specification for ALP (encoding value 10) to Encodings.md. ALP compresses FLOAT and DOUBLE columns by converting values to integers via decimal scaling, then applying Frame of Reference encoding and bit-packing. Values that cannot be losslessly round-tripped are stored as exceptions.

Closes [Proposal] Add ALP encoding support in parquet file format #533

See rendered preview here: https://github.com/prtkgaur/parquet-format/blob/alpEncoding/Encodings.md#adaptive-lossless-floating-point-alp--10

The spec covers:

Page layout: 7-byte header, offset array, compressed vectors
Vector format: AlpInfo, ForInfo, packed values, exception data
Encoding math: two-step multiplication for cross-language consistency
Parameter selection, exception detection, and decoding steps

Based on the paper "ALP: Adaptive Lossless floating-Point Compression" (Afroozeh and Boncz, SIGMOD 2024). Wire format matches the C++ Arrow and Java parquet-java implementations.

Rationale for this change

What changes are included in this PR?

Do these changes have PoC implementations?

emkornfield · 2026-03-24T23:24:34Z

+
+##### Header (7 bytes)
+
+All multi-byte values are little-endian.


Suggested change

All multi-byte values are little-endian.

All multi-byte values are stored in little-endian order.

emkornfield · 2026-03-24T23:25:47Z

+|--------|-------|------|------|-------------|
+| 0 | compression_mode | 1 byte | uint8 | Compression mode (must be 0 = ALP) |
+| 1 | integer_encoding | 1 byte | uint8 | Integer encoding (must be 0 = FOR + bit-packing) |
+| 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 15\]. Default: 10 (vector size 1024) |


emkornfield · 2026-03-24T23:26:59Z

+
+| Offset | Field | Size | Type | Description |
+|--------|-------|------|------|-------------|
+| 0 | compression_mode | 1 byte | uint8 | Compression mode (must be 0 = ALP) |


we think ALP-RD fits in well here? I forget what the extension point is and why we were OK keeping this field, but not version.

changing the code to read ALP's newer laid out bits would be difficult to apply across all implementations. but AlpRD is a completely newer implementations.

Added a note to the field description.

emkornfield · 2026-03-24T23:27:36Z

+|--------|-------|------|------|-------------|
+| 0 | compression_mode | 1 byte | uint8 | Compression mode (must be 0 = ALP) |
+| 1 | integer_encoding | 1 byte | uint8 | Integer encoding (must be 0 = FOR + bit-packing) |
+| 2 | log_vector_size | 1 byte | uint8 | log2(vector\_size). Must be in \[3, 15\]. Default: 10 (vector size 1024) |


emkornfield · 2026-03-24T23:28:58Z

+**Note:** The number of elements per vector and the packed data size are NOT stored
+in the header. They are derived:
+* Elements per vector: `vector_size` for all vectors except the last, which may be smaller.
+* Packed data size: `ceil(num_elements_in_vector * bit_width / 8)`.


bit_width isn't in the header either, so it is a little strange to call ths out here?

I also hihglighted this as confusing when reading the spec. I recommend removing this sentence as the packed data size is covered in the "Vector Format" section below

comment regarding packed data size has been removed.

emkornfield · 2026-03-24T23:30:01Z

+
+**Note:** The number of elements per vector and the packed data size are NOT stored
+in the header. They are derived:
+* Elements per vector: `vector_size` for all vectors except the last, which may be smaller.


This seems like a little bit of a strange callout since it is covered on line 457 explicitly. and log_vector_size is stored in the header?

I agree this is redundant -- I think leaving the note is valuable clarification

Note: The number of elements per vector and the packed data size are NOT stored
in the header, they are derived

However the other items are not

Removed the packed data size bullet, kept the elements-per-vector clarification.

emkornfield · 2026-03-24T23:31:06Z

+values. Each offset gives the byte position of the corresponding vector's data,
+measured from the start of the offset array itself.
+
+The first offset equals `num_vectors * 4` (pointing just past the offset array).


Suggested change

The first offset equals `num_vectors * 4` (pointing just past the offset array).

The first offset always equals `num_vectors * 4` (pointing just past the offset array).

Lets be explicitly here that we don't support padding.

emkornfield · 2026-03-24T23:33:04Z

+Data section sizes:
+| Section             | Size Formula                | Description                  |
+|---------------------|-----------------------------|------------------------------|
+| PackedValues        | ceil(N * bit\_width / 8)    | Bit-packed delta values      |


emkornfield · 2026-03-24T23:35:09Z

+|---------------------|-----------------------------|------------------------------|
+| PackedValues        | ceil(N * bit\_width / 8)    | Bit-packed delta values      |
+| ExceptionPositions  | num\_exceptions * 2 bytes   | uint16 indices of exceptions |
+| ExceptionValues     | num\_exceptions * sizeof(T) | Original float/double values |


emkornfield · 2026-03-24T23:37:48Z

+
+The FOR-encoded deltas, bit-packed into `ceil(num_elements_in_vector * bit_width / 8)` bytes.
+Values are packed from the least significant bit of each byte to the most significant bit,
+in groups of 8 values, using the same bit-packing order as the


Where does the group of 8 values come in? Wouldn't this messup the number of bytes math?

Good catch — removed 'groups of 8' phrasing. Now simply references the same LSB-first packing order as RLE/Bit-Packing Hybrid.

(I think RleBitPackHybrid was in my mind at that point :) )

emkornfield · 2026-03-24T23:38:55Z

+The encoding uses two separate multiplications (not a single multiplication by
+`10^(e-f)`, and not division) to ensure that implementations produce identical
+floating-point rounding across languages. The powers of 10 MUST be stored as
+precomputed floating-point constants (i.e., literal values like `1e-3f`), not


Why can't they be precomputed at runtime?

You're right — this shouldn't mandate literals vs runtime computation. Reworded to require that encoder and decoder use identical power-of-10 values, without prescribing how they're obtained.

emkornfield · 2026-03-24T23:39:39Z

+| Type   | Magic Number                      | Formula                          |
+|--------|-----------------------------------|----------------------------------|
+| FLOAT  | 2^22 + 2^23 = 12,582,912         | `(int)((value + magic) - magic)` |
+| DOUBLE | 2^51 + 2^52 = 6,755,399,441,055,744 | `(long)((value + magic) - magic)` |


emkornfield · 2026-03-24T23:39:46Z

+
+| Type   | Magic Number                      | Formula                          |
+|--------|-----------------------------------|----------------------------------|
+| FLOAT  | 2^22 + 2^23 = 12,582,912         | `(int)((value + magic) - magic)` |


emkornfield · 2026-03-24T23:40:13Z

+```
+-------------------------------------------------------------------+
+|                                                                   |
+|   encoded = round( value  *  10^e  *  10^(-f) )                  |


Suggested change

| encoded = round( value * 10^e * 10^(-f) ) |

| encoded = fast_round( value * 10^e * 10^(-f) ) |

alamb

Thank you @prtkgaur and @emkornfield -- I started going through this proposal as well. I haven't made it through the entire thing, but I left commands on what I have made it through first.

@prtkgaur how would you like to address feedback on this PR? Would you like to process the comments? I would also be happy to make a PR with proposed edits to your branch if that would be better. Please just let me know

alamb · 2026-04-28T13:25:41Z

+ALP works by converting floating-point values to integers using decimal scaling,
+then applying Frame of Reference (FOR) encoding and bit-packing. Values that
+cannot be losslessly converted are stored as exceptions. The encoding achieves
+high compression for decimal-like floating-point data (e.g., monetary values,
+sensor readings) while remaining fully lossless.


You have more summary at the end of the encoding, but I think a few more sentences in the intro would help people understand this more easily

Suggested change

ALP works by converting floating-point values to integers using decimal scaling,

then applying Frame of Reference (FOR) encoding and bit-packing. Values that

cannot be losslessly converted are stored as exceptions. The encoding achieves

high compression for decimal-like floating-point data (e.g., monetary values,

sensor readings) while remaining fully lossless.

ALP works by converting floating-point values to integers using decimal scaling,

then applying Frame of Reference (FOR) encoding and bit-packing. Values that

cannot be losslessly converted are stored as exceptions. The encoding achieves

high compression for decimal-like floating-point data (e.g., monetary values,

sensor readings) while remaining fully lossless. Values do not depend on

each other, which enables quick random access and parallel encode/decode.

Agreed — added sentence about value independence enabling random access and parallel encode/decode.

alamb · 2026-04-29T12:59:59Z

+
+**Note:** The number of elements per vector and the packed data size are NOT stored
+in the header. They are derived:
+* Elements per vector: `vector_size` for all vectors except the last, which may be smaller.


I agree this is redundant -- I think leaving the note is valuable clarification

Note: The number of elements per vector and the packed data size are NOT stored
in the header, they are derived

However the other items are not

alamb · 2026-04-29T13:00:59Z

+**Note:** The number of elements per vector and the packed data size are NOT stored
+in the header. They are derived:
+* Elements per vector: `vector_size` for all vectors except the last, which may be smaller.
+* Packed data size: `ceil(num_elements_in_vector * bit_width / 8)`.


I also hihglighted this as confusing when reading the spec. I recommend removing this sentence as the packed data size is covered in the "Vector Format" section below

Incorporate review comments from emkornfield and alamb on PR apache#557:

alamb · 2026-04-30T18:35:19Z

I expect to have a second round of feedback tomorrow

alamb

Ok, I made it through the spec again. I think it is really quite close

Let me know what you think about the comments and what you think abotu the idea of starting to draft a Blog post.

alamb · 2026-05-01T14:47:25Z

-| ------------------------------------- | -------------- |
-| [Bit-packed (Deprecated)](#BITPACKED) | BIT_PACKED = 4 |
-
+| Encoding | ID | Supported Types |


Something seems strange to me about this chart. The version on main

parquet-format/Encodings.md

Lines 28 to 40 in 96edf77

### Supported Encodings

For details on current implementation status, see the [Implementation Status](https://parquet.apache.org/docs/file-format/implementationstatus/#encodings) page.

| Encoding type | Encoding enum | Supported Types |

| ------------------------------------------------ | --------------------------------------------------------- | ------------------------------------------------- |

| [Plain](#PLAIN) | PLAIN = 0 | All Physical Types |

| [Dictionary Encoding](#DICTIONARY) | PLAIN_DICTIONARY = 2 (Deprecated) <br> RLE_DICTIONARY = 8 | All Physical Types |

| [Run Length Encoding / Bit-Packing Hybrid](#RLE) | RLE = 3 | BOOLEAN, Dictionary Indices |

| [Delta Encoding](#DELTAENC) | DELTA_BINARY_PACKED = 5 | INT32, INT64 |

| [Delta-length byte array](#DELTALENGTH) | DELTA_LENGTH_BYTE_ARRAY = 6 | BYTE_ARRAY |

| [Delta Strings](#DELTASTRING) | DELTA_BYTE_ARRAY = 7 | BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY |

| [Byte Stream Split](#BYTESTREAMSPLIT) | BYTE_STREAM_SPLIT = 9 | INT32, INT64, FLOAT, DOUBLE, FIXED_LEN_BYTE_ARRAY |

Doesn't appear in this document.

I would expect this PR that adds the new ALP encoding to add a new entry to the existing table, rather than add an entirely new table.

Maybe we need to merge up from main to this branch (I am happy to do this and I think I have the permissions to push to your branch, but I don't want to mess things up on you)

You're right, this branch was forked before the summary table was added in #550. Rebased onto upstream/master and added ALP as a row to the existing table.

alamb · 2026-05-01T14:54:00Z

+
+##### Fast Rounding
+
+The rounding function uses a "magic number" technique for branchless rounding:


I think it would help to use consistent terminology here to above.

Suggested change

The rounding function uses a "magic number" technique for branchless rounding:

The `fast_round` function uses a "magic number" technique for branchless rounding.

`fast_round(value)` is defined as follows:

Done. Updated to use fast_round consistently and added "is defined as follows".

alamb · 2026-05-01T15:01:03Z

+| FLOAT  | 2^22 + 2^23 = 12,582,912         | `(int32_t)((value + magic) - magic)` |
+| DOUBLE | 2^51 + 2^52 = 6,755,399,441,055,744 | `(int64_t)((value + magic) - magic)` |
+
+For negative values, the signs are reversed: `(int32_t)((value - magic) + magic)` for FLOAT, `(int64_t)((value - magic) + magic)` for DOUBLE.


I don't understand this sentence. The C++ implementation does not seem to change sign order from what I can tell:

https://github.com/apache/arrow/pull/48345/changes#diff-f9ab708cab94060b4067fff0a6739e9c3751b450422115663b2bd0badfcc748bR487-R490

/// \brief Convert a float to an int without rounding static inline auto FastRound(T n) -> SignedExactType { n = n + Constants::kMagicNumber - Constants::kMagicNumber; return static_cast<SignedExactType>(n); }

Also the "Fast Rounding" section of the ALP paper doesn't mention sign reversal that I can find

You're right -- the sign reversal was incorrect. The C++ implementation uses a single formula for both positive and negative values (n + magic - magic). Removed the paragraph.

alamb · 2026-05-01T15:06:23Z

+(the encoded integer of the first non-exception value, or 0 if all values
+are exceptions) before FOR encoding. This keeps the FOR range tight.
+
+##### Frame of Reference and Bit-Packing


This section has a worked example of FOR / bitpacking so perhaps we could write it like

Suggested change

##### Frame of Reference and Bit-Packing

##### Example: Frame of Reference and Bit-Packing

alamb · 2026-05-01T15:06:37Z

+
+##### Frame of Reference and Bit-Packing
+
+After decimal encoding and exception substitution:


Suggested change

After decimal encoding and exception substitution:

Given the following data after decimal encoding and exception substitution:

alamb · 2026-05-01T15:10:34Z

+           + sum(vector_bytes for each vector)   // all vectors
+```
+
+#### Constants Reference


This appears to be C/++ implementation details -- I am not sure it adds any value to the spec

alamb · 2026-05-01T15:11:19Z

+encoded size. However, ALP and Byte Stream Split can be complementary: ALP
+exploits decimal structure while Byte Stream Split exploits byte-level correlation.
+
+#### Size Calculations


I don't think we need to add add to the Parquet spec -- it isn't required to implement the encoder/decoder for ALP.

alamb · 2026-05-01T15:12:17Z

+converted are stored separately as *exceptions*. The encoding achieves high
+compression for decimal-like floating-point data (e.g., monetary values, sensor
+readings) while remaining fully lossless. Each value is encoded independently,
+enabling random access to individual vectors and parallel encode/decode.


Suggested change

enabling random access to individual vectors and parallel encode/decode.

enabling random access to individual values and parallel encode/decode.

alamb · 2026-05-01T15:21:46Z

+floating-point rounding across languages. Implementations must ensure that the
+encoder and decoder use identical power-of-10 values for a given exponent.


I don't understand what "use identical power-of-10 values" means. ALL encoders and decoders must to use the exact same values (and floating point arithmetic) as I understand it.

Yes, what you said. Reworded to: "All implementations MUST use the exact same floating-point arithmetic and power-of-10 constants to guarantee cross-language interoperability."

alamb · 2026-05-01T15:26:06Z

+To avoid the cost of exhaustive search on every vector, implementations
+SHOULD use sampling to select up to 5 candidate (exponent, factor)
+combinations (the "encoding preset") at the start of each column chunk.
+Each vector then searches only those 5 candidates.


I think SHOULD is too strong a word here. Maybe we could soften it to a suggestion

Suggested change

To avoid the cost of exhaustive search on every vector, implementations

SHOULD use sampling to select up to 5 candidate (exponent, factor)

combinations (the "encoding preset") at the start of each column chunk.

Each vector then searches only those 5 candidates.

To avoid the cost of exhaustive search on every vector, implementations

can use a sampling approach. One such approach, described in the paper, is to

select up to 5 candidate (exponent, factor) combinations (the "encoding preset")

at the start of each column chunk, and when encoding each vector,

test each of the 5 candidates for the fewest exceptions.

pitrou · 2026-05-05T20:34:34Z

+readings) while remaining fully lossless. Each value is encoded independently,
+enabling random access to individual vectors and parallel encode/decode.
+
+#### Overview


The algorithm description is so long that I think it should be moved to a separate file that we would link to here. In this file we would just keeping the description that is before the Overview.

Yes this makes sense. I too thought it became long and having a separate file would be good.
Let me take a stab at it after I address the above comments and get approval. Else the comment threads on github will get lost.

Maybe similar to how it is done for bloom filter: https://github.com/apache/parquet-format/blob/master/BloomFilter.md

I also think it would be ok to if we moved the content to a separate file as a follow on PR (after the spec change is approved) as in my mind moving the content to a separate file does not affect the content of the spec

Add the encoding specification for ALP (encoding value 10) to Encodings.md. ALP compresses FLOAT and DOUBLE columns by converting values to integers via decimal scaling, then applying Frame of Reference encoding and bit-packing. Values that cannot be losslessly round-tripped are stored as exceptions. The spec covers: - Page layout: 7-byte header, offset array, compressed vectors - Vector format: AlpInfo, ForInfo, packed values, exception data - Encoding math: two-step multiplication for cross-language consistency - Parameter selection, exception detection, and decoding steps Based on the paper "ALP: Adaptive Lossless floating-Point Compression" (Afroozeh and Boncz, SIGMOD 2024). Wire format matches the C++ Arrow and Java parquet-java implementations.

Incorporate review comments from emkornfield and alamb on PR apache#557:

- Clarify no padding between vectors in offset array description - Use 'sizeof(encoded type) (float=4 and double=8)' per reviewer suggestion

- Remove Characteristics, Size Calculations, Constants Reference sections - Consolidate three examples into one worked example with f!=0 and exceptions - Remove incorrect sign-reversal claim for fast_round on negative values - Soften sampling recommendation from SHOULD to suggestion - Fix "individual vectors" → "individual values" for random access - Clarify power-of-10 interop as MUST requirement - Use consistent fast_round terminology throughout

iemejia

Review focusing on spec consistency, numerical correctness under IEEE 754, and completeness for cross-language implementors.

iemejia · 2026-05-23T11:11:32Z

+| Type   | Magic Number                      | Formula                          |
+|--------|-----------------------------------|----------------------------------|
+| FLOAT  | 2^22 + 2^23 = 12,582,912         | `(int32_t)((value + magic) - magic)` |
+| DOUBLE | 2^51 + 2^52 = 6,755,399,441,055,744 | `(int64_t)((value + magic) - magic)` |


The formula shown here only covers non-negative values. For negative values, the expression value + magic may fall below the binade [2^52, 2^53) (for double), entering a region where ULP = 0.5 instead of 1.0, which produces incorrect rounding.

Known implementations (C++/DuckDB, Java) use sign branching:

if value >= 0: result = (int64_t)((value + magic) - magic) else: result = (int64_t)((value - magic) + magic)

While the round-trip exception check prevents data corruption (incorrectly rounded values become exceptions), omitting sign branching causes a higher exception rate for datasets with negative values, degrading compression ratios.

Suggestion: document both branches in the spec table, or at minimum add a note that implementations SHOULD use sign branching for negative values.

iemejia · 2026-05-23T11:11:32Z

+| 0     | 1500.0  | 15000.0                | 15000   | 1500.0                             | No         |
+| 1     | NaN     | -                      | -       | -                                  | Yes (NaN)  |
+| 2     | 2500.0  | 25000.0                | 25000   | 2500.0                             | No         |
+| 3     | 333.3   | 3333.0                 | 3333    | 333.3                              | No         |


The value 333.3 with (exponent=4, factor=3) may not survive a round-trip under strict IEEE 754 arithmetic:

Encode: 333.3 * 10^4 * 10^(-3) = 333.3 * 10000.0 * 0.001. Since 0.001 is not exactly representable in IEEE 754 double, the product is approximately 3333.000000000000069..., which rounds to 3333.

Decode: 3333 * 10^3 * 10^(-4) = 3333 * 1000 * 0.0001. Since 0.0001 is not exactly representable, the product may not bit-equal 333.3.

If decode(encode(333.3)) != 333.3 at the bit level, this value should be classified as an exception. The example may be using idealized math rather than actual IEEE 754 double semantics.

Suggestion: verify this specific round-trip in an actual IEEE 754 implementation, or replace 333.3 with a value that provably round-trips (e.g., 3000.0, 500.0, or any value where the scaling produces an exact result).

iemejia · 2026-05-23T11:11:32Z

+| NaN                | `NaN`                      | Cannot convert to integer        |
+| Infinity           | `+Inf`, `-Inf`             | Cannot convert to integer        |
+| Negative zero      | `-0.0`                     | Would become `+0.0` after encoding |
+| Out of range       | value * 10^e > INT32\_MAX  | Exceeds target integer limits    |


This condition is under-specified in two ways:

For DOUBLE, the target integer type is int64, not int32. The condition should reference INT64_MAX for doubles.

The actual safe casting limit is not exactly INT32_MAX / INT64_MAX but the largest floating-point value that can be safely converted to the target integer type without undefined behavior. For float→int32 this is approximately 2147483520.0f (not 2147483647), and for double→int64 approximately 9223372036854774784.0 (not 9223372036854775807). These differ because not all integers near the max are exactly representable in the source float/double type.

Suggestion: either specify the exact limits for each type, or reword to: "the scaled value exceeds the range that can be losslessly represented in the target integer type (int32 for FLOAT, int64 for DOUBLE)."

iemejia · 2026-05-23T11:11:32Z

+The encoding uses two separate multiplications (not a single multiplication by
+`10^(e-f)`, and not division) to ensure that implementations produce identical
+floating-point results. All implementations MUST use the exact same floating-point
+arithmetic and power-of-10 constants to guarantee cross-language interoperability.


This requirement is not actionable without specifying the actual constant values. Powers of 10 like 10^(-3) are not exactly representable in IEEE 754, and different methods of computing them (e.g., 1.0/1000.0 vs the compile-time literal 1e-3 vs pow(10, -3)) can produce different bit patterns.

Suggestion: either (a) provide a table of required constants with their exact IEEE 754 hex representations, or (b) specify that constants MUST match the values produced by the standard decimal-to-binary conversion of the literals 1e0, 1e1, ..., 1e18 and 1e-0, 1e-1, ..., 1e-18 as defined by IEEE 754-2008 §5.12.2. This makes the requirement unambiguous and testable across languages.

iemejia · 2026-05-23T11:11:32Z

+
+##### Parameter Selection
+
+The encoder selects the (exponent, factor) pair that minimizes exceptions.


In practice, minimizing exception count alone is not optimal. A parameter pair with slightly more exceptions but a much smaller bit-width (tighter FOR range) can produce smaller output. The actual optimization target should be estimated encoded size, accounting for both bit-width and exception overhead.

Since this is an encoder-only concern (decoders are agnostic to the selection strategy), the spec should clarify that any valid (e, f) pair produces a correct encoding — the choice only affects compression ratio. Suggested rewording:

The encoder SHOULD select the (exponent, factor) pair that produces the smallest encoded output. A simple heuristic is to minimize exception count; a more precise approach accounts for both bit-width and exception overhead.

alamb changed the title ~~Add ALP (Adaptive Lossless floating-Point) encoding specification~~ GH-533: Add ALP (Adaptive Lossless floating-Point) encoding specification Mar 11, 2026

This was referenced Mar 11, 2026

GH-533: Adaptive Lossless Floating-Point (ALP) Encoding #548

Closed

[WIp] ALP encoder/decoder support apache/arrow-rs#9372

Draft

emkornfield reviewed Mar 24, 2026

View reviewed changes

Comment thread Encodings.md

emkornfield reviewed Mar 24, 2026

View reviewed changes

alamb reviewed Apr 29, 2026

View reviewed changes

prtkgaur pushed a commit to prtkgaur/parquet-format that referenced this pull request Apr 29, 2026

Address review feedback on ALP encoding specification

c8ce8a7

Incorporate review comments from emkornfield and alamb on PR apache#557:

alamb mentioned this pull request Apr 30, 2026

[Proposal] Add ALP encoding support in parquet file format #533

Open

alamb mentioned this pull request May 1, 2026

Blog on ALP apache/parquet-site#175

Open

alamb reviewed May 1, 2026

View reviewed changes

alamb mentioned this pull request May 3, 2026

[Parquet] Prototype ALP encoding apache/arrow-rs#8748

Open

pitrou reviewed May 5, 2026

View reviewed changes

alamb mentioned this pull request May 9, 2026

Align structure names with spec sdf-jkl/arrow-rs#5

Merged

prtkgaur mentioned this pull request May 13, 2026

GH-48701: [C++][Parquet] Add ALPpd encoding apache/arrow#48345

Open

sfc-gh-pgaur added 2 commits May 14, 2026 00:47

Address review feedback on ALP encoding specification

69aaf62

Incorporate review comments from emkornfield and alamb on PR apache#557:

sfc-gh-pgaur added 2 commits May 14, 2026 00:48

Address remaining review feedback on ALP spec

095a0e5

- Clarify no padding between vectors in offset array description - Use 'sizeof(encoded type) (float=4 and double=8)' per reviewer suggestion

prtkgaur force-pushed the alpEncoding branch from 2d8f409 to ccb6674 Compare May 14, 2026 00:50

iemejia reviewed May 23, 2026

View reviewed changes


		##### Header (7 bytes)

		All multi-byte values are little-endian.

	All multi-byte values are little-endian.
	All multi-byte values are stored in little-endian order.

	\| 2 \| log_vector_size \| 1 byte \| uint8 \| log2(vector\_size). Must be in \[3, 15\]. Default: 10 (vector size 1024) \|
	\| 2 \| log_vector_size \| 1 byte \| uint8 \| log2(vector\_size). Must be in \[3, 15\]. Recommended default: 10 (vector size 1024) \|

	The first offset equals `num_vectors * 4` (pointing just past the offset array).
	The first offset always equals `num_vectors * 4` (pointing just past the offset array).

	\| PackedValues \| ceil(N * bit\_width / 8) \| Bit-packed delta values \|
	\| PackedValues \| ceil(`vector_size` * bit\_width / 8) \| Bit-packed delta values \|

	\| ExceptionValues \| num\_exceptions * sizeof(T) \| Original float/double values \|
	\| ExceptionValues \| num\_exceptions * sizeof(encoded type) (float=4 and double=8) \| Original float/double values \|

	\| DOUBLE \| 2^51 + 2^52 = 6,755,399,441,055,744 \| `(long)((value + magic) - magic)` \|
	\| DOUBLE \| 2^51 + 2^52 = 6,755,399,441,055,744 \| `(int64_t)((value + magic) - magic)` \|

	\| FLOAT \| 2^22 + 2^23 = 12,582,912 \| `(int)((value + magic) - magic)` \|
	\| FLOAT \| 2^22 + 2^23 = 12,582,912 \| `(int32_t)((value + magic) - magic)` \|

	\| encoded = round( value * 10^e * 10^(-f) ) \|
	\| encoded = fast_round( value * 10^e * 10^(-f) ) \|

	### Supported Encodings

	For details on current implementation status, see the [Implementation Status](https://parquet.apache.org/docs/file-format/implementationstatus/#encodings) page.

	\| Encoding type \| Encoding enum \| Supported Types \|
	\| ------------------------------------------------ \| --------------------------------------------------------- \| ------------------------------------------------- \|
	\| [Plain](#PLAIN) \| PLAIN = 0 \| All Physical Types \|
	\| [Dictionary Encoding](#DICTIONARY) \| PLAIN_DICTIONARY = 2 (Deprecated) <br> RLE_DICTIONARY = 8 \| All Physical Types \|
	\| [Run Length Encoding / Bit-Packing Hybrid](#RLE) \| RLE = 3 \| BOOLEAN, Dictionary Indices \|
	\| [Delta Encoding](#DELTAENC) \| DELTA_BINARY_PACKED = 5 \| INT32, INT64 \|
	\| [Delta-length byte array](#DELTALENGTH) \| DELTA_LENGTH_BYTE_ARRAY = 6 \| BYTE_ARRAY \|
	\| [Delta Strings](#DELTASTRING) \| DELTA_BYTE_ARRAY = 7 \| BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY \|
	\| [Byte Stream Split](#BYTESTREAMSPLIT) \| BYTE_STREAM_SPLIT = 9 \| INT32, INT64, FLOAT, DOUBLE, FIXED_LEN_BYTE_ARRAY \|


		##### Fast Rounding

		The rounding function uses a "magic number" technique for branchless rounding:

-The rounding function uses a "magic number" technique for branchless rounding:
+The `fast_round` function uses a "magic number" technique for branchless rounding.
+`fast_round(value)` is defined as follows:

	##### Frame of Reference and Bit-Packing
	##### Example: Frame of Reference and Bit-Packing


		##### Frame of Reference and Bit-Packing

		After decimal encoding and exception substitution:

	After decimal encoding and exception substitution:
	Given the following data after decimal encoding and exception substitution:

	enabling random access to individual vectors and parallel encode/decode.
	enabling random access to individual values and parallel encode/decode.

		floating-point rounding across languages. Implementations must ensure that the
		encoder and decoder use identical power-of-10 values for a given exponent.


		##### Parameter Selection

		The encoder selects the (exponent, factor) pair that minimizes exceptions.

Conversation

prtkgaur commented Mar 11, 2026 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Do these changes have PoC implementations?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

prtkgaur commented Mar 11, 2026 •

edited by alamb

Loading