Describe the bug
Comet's native percentile aggregate (PR #4542) maps to DataFusion's percentile_cont, which computes the linear interpolation weight with a quantization step:
const INTERPOLATION_PRECISION: f64 = 1_000_000.0;
let fraction = index - (lower_index as f64);
let scaled = (fraction * INTERPOLATION_PRECISION) as usize;
let weight = scaled as f64 / INTERPOLATION_PRECISION;
let interpolated_f = lower_f + (upper_f - lower_f) * weight;
The interpolation weight is truncated to 6 decimal places. Spark's exact Percentile interpolates with the full-precision fraction ((position - lower) * higherValue + (higher - position) * lowerValue), so a deeply-interpolated value can differ from Spark by up to roughly (upper - lower) * 1e-6.
Affected versions
Spark 3.4 / 3.5 / 4.0 / 4.1, wherever percentile(col, p) (or median, or percentile_cont ... WITHIN GROUP) maps to the native path.
Impact
Minor. The difference only appears when p * (n - 1) has a fractional part not representable in 6 decimal places, and is bounded by (upper - lower) * 1e-6. The cases tested in percentile.sql match Spark exactly.
Possible fix
Either contribute a higher-precision (or unquantized) interpolation upstream to DataFusion's percentile_cont, or implement a Comet-specific accumulator that matches Spark's interpolation exactly.
Surfaced by the percentile audit accompanying #4542.
Describe the bug
Comet's native
percentileaggregate (PR #4542) maps to DataFusion'spercentile_cont, which computes the linear interpolation weight with a quantization step:The interpolation weight is truncated to 6 decimal places. Spark's exact
Percentileinterpolates with the full-precision fraction ((position - lower) * higherValue + (higher - position) * lowerValue), so a deeply-interpolated value can differ from Spark by up to roughly(upper - lower) * 1e-6.Affected versions
Spark 3.4 / 3.5 / 4.0 / 4.1, wherever
percentile(col, p)(ormedian, orpercentile_cont ... WITHIN GROUP) maps to the native path.Impact
Minor. The difference only appears when
p * (n - 1)has a fractional part not representable in 6 decimal places, and is bounded by(upper - lower) * 1e-6. The cases tested inpercentile.sqlmatch Spark exactly.Possible fix
Either contribute a higher-precision (or unquantized) interpolation upstream to DataFusion's
percentile_cont, or implement a Comet-specific accumulator that matches Spark's interpolation exactly.Surfaced by the
percentileaudit accompanying #4542.