Skip to content

Introduce version-specific behavior in Spark expressions #21698

@andygrove

Description

@andygrove

Is your feature request related to a problem or challenge?

Many Spark expressions have different behavior across Spark versions. This is especially the case when comparing Spark 3.x and 4.x where there are many breaking changes.

I think it is important to start addressing this in the Spark-compatible expressions in DataFusion.

fwiw, the approach we take in Comet is that each expression has a getSupportLevel method that can return Compatible, Incompatible(reason), or Unsupported. These methods are context-aware based on Spark version, Spark configuration (e.g. ANSI mode enabled/disabled), and the specific arguments being passed to the expression. Here is an example:

override def getSupportLevel(expr: Reverse): SupportLevel = {
  if (containsBinary(expr.child.dataType)) {
    Incompatible(Some("reverse on array containing binary is not supported"))
  } else {
    Compatible(None)
  }
}

We also have a shim layer where we can implement different code per Spark version. This was necessary for Comet because there are API changes in Spark between versions and we need to compile per-version. It is simpler for DataFusion because we can just pass the Spark version (or some other flags) into the expression constructor.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions