Skip to content

[Bug] percentile: DataFusion quantizes interpolation weight to 6 decimal places #4719

Description

@andygrove

Describe the bug

Comet's native percentile aggregate (PR #4542) maps to DataFusion's percentile_cont, which computes the linear interpolation weight with a quantization step:

const INTERPOLATION_PRECISION: f64 = 1_000_000.0;
let fraction = index - (lower_index as f64);
let scaled = (fraction * INTERPOLATION_PRECISION) as usize;
let weight = scaled as f64 / INTERPOLATION_PRECISION;
let interpolated_f = lower_f + (upper_f - lower_f) * weight;

The interpolation weight is truncated to 6 decimal places. Spark's exact Percentile interpolates with the full-precision fraction ((position - lower) * higherValue + (higher - position) * lowerValue), so a deeply-interpolated value can differ from Spark by up to roughly (upper - lower) * 1e-6.

Affected versions

Spark 3.4 / 3.5 / 4.0 / 4.1, wherever percentile(col, p) (or median, or percentile_cont ... WITHIN GROUP) maps to the native path.

Impact

Minor. The difference only appears when p * (n - 1) has a fractional part not representable in 6 decimal places, and is bounded by (upper - lower) * 1e-6. The cases tested in percentile.sql match Spark exactly.

Possible fix

Either contribute a higher-precision (or unquantized) interpolation upstream to DataFusion's percentile_cont, or implement a Comet-specific accumulator that matches Spark's interpolation exactly.

Surfaced by the percentile audit accompanying #4542.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions