Skip to content

[SPARK-56917][TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution#55947

Draft
zhengruifeng wants to merge 8 commits into
apache:masterfrom
zhengruifeng:SC-229895-connect-col-tests
Draft

[SPARK-56917][TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution#55947
zhengruifeng wants to merge 8 commits into
apache:masterfrom
zhengruifeng:SC-229895-connect-col-tests

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

@zhengruifeng zhengruifeng commented May 18, 2026

What changes were proposed in this pull request?

Add a new SparkConnectColumnResolutionTests class in python/pyspark/sql/tests/connect/test_connect_column.py that pins Spark Connect's DataFrame column resolution behavior across both modes of spark.sql.analyzer.strictDataFrameColumnResolution, with a Classic baseline alongside each test.

The class is placed in a Connect-only suite intentionally: keeping it out of Classic-shared mixins prevents it from being removed as "diverging from Classic" during routine cleanup.

Focused per-pattern tests (14). Each test runs the same pattern against self.spark (Classic), Connect strict and Connect lenient and asserts the observed behavior:

  • Shadowing - Classic raises (MISSING_ATTRIBUTES), Connect strict raises (CANNOT_RESOLVE_DATAFRAME_COLUMN), Connect lenient succeeds via name-based fallback: test_resolve_after_chained_withcolumn_shadow, test_resolve_after_select_alias_shadow, test_resolve_after_agg_alias_shadow.
  • Removal - all three modes raise (column gone by name and by id): test_resolve_after_withcolumnrenamed, test_resolve_after_drop.
  • Pass-through (filter/sort/distinct) - all three modes produce the same rows: test_resolve_through_filter, test_resolve_through_sort, test_resolve_through_distinct.
  • Attribute-id propagation (groupBy/intersect/pivot/temp-view roundtrip) - all three modes succeed; Connect's plan-id resolver propagates the attribute through these operators: test_resolve_after_groupby_count, test_resolve_after_pivot, test_resolve_after_intersect, test_resolve_after_subquery_view.
  • Union divergence - Classic succeeds; Connect raises CANNOT_RESOLVE_DATAFRAME_COLUMN in both strict and lenient modes (no name fallback for set-op outputs): test_resolve_after_union.
  • Self-join - both Classic and Connect raise (ambiguous reference): test_resolve_self_join_alias.

Mixed-surface layered DataFrame programs (3). Each chains 4-5 DataFrame transformations - semi-joins (for SQL EXISTS/IN), Window functions, cube aggregations, UDFs and struct field access - then layers a shadowing operation on top with a tagged df["c"] reference at the outermost select:

  • test_layered_semijoin_groupby_window_shadow - filter, semi-join, groupBy/agg, window functions; tagged select after groupBy. All three modes succeed.
  • test_layered_struct_semijoin_cube_ntile_shadow - filter, semi-join, struct field access, cube aggregation, NTILE window; tagged select after a withColumn shadow. Classic and Connect strict raise; Connect lenient succeeds.
  • test_layered_window_window_udf_shadow - filter, running-total window, per-partition max window, UDF wrap; tagged select after a withColumn shadow. Classic and Connect strict raise; Connect lenient succeeds.

Why are the changes needed?

#55531 added the strictDataFrameColumnResolution config and one shadowing test. This PR widens the coverage to enumerate Connect-vs-Classic behavior across shadowing variants, attribute-id-propagating operators (groupBy, intersect, pivot, temp-view roundtrip), set ops, self-joins, and multi-operator layered pipelines. Future tightening of Connect's column resolution will surface a clear test failure rather than a silent regression.

Does this PR introduce any user-facing change?

No. Test-only change.

How was this patch tested?

Ran the new suite locally with python/run-tests --testnames "pyspark.sql.tests.connect.test_connect_column SparkConnectColumnResolutionTests": all 17 tests pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic), claude-opus-4-7

…esolution

Add parity tests in test_connect_column.py and layered programs in
test_parity_dataframe.py to lock in known Connect/Classic behavior
differences in DataFrame column resolution. Tests cover shadowing,
pass-through, aggregation, pivot, set ops, self-join, subquery-as-table
patterns under both strict and non-strict modes of
spark.sql.analyzer.strictDataFrameColumnResolution, plus three
mixed-surface layered programs that combine filters, joins, aggregations,
set ops, window functions, UDFs and temp views.

Generated-by: Claude Code (Anthropic), claude-opus-4-7
…sic baselines

Move the focused parity tests and the layered programs into a dedicated
SparkConnectColumnResolutionTests class in test_connect_column.py. Each
test now starts with a Classic baseline against self.spark to document
how Spark Classic handles the same pattern, followed by Connect strict
and Connect lenient blocks. test_parity_dataframe.py is reverted to its
prior state.

Generated-by: Claude Code (Anthropic), claude-opus-4-7
Replace the three shallow layered programs with Reyden-influenced
fixtures that mirror the patterns referenced in SC-229895
(reyden/query-tests/golden-files/layered-query-tests):

  - 4-level subquery chain with windows, HAVING and correlated EXISTS
  - CTE chain with GROUPING SETS, NTILE, struct field access and
    correlated IN
  - Self-join via SQL with windowed running totals, correlated EXISTS
    and UDF wrapping

Each program builds the deeply layered base via spark.sql(), then
layers DataFrame-API shadowing on top with a tagged df["c"] reference
at the outermost select. Classic and Connect strict raise; Connect
lenient succeeds via name-based fallback.

Generated-by: Claude Code (Anthropic), claude-opus-4-7
Generated-by: Claude Code (Anthropic), claude-opus-4-7
@zhengruifeng zhengruifeng changed the title [TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution [SPARK-56917][TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution May 18, 2026
@zhengruifeng zhengruifeng marked this pull request as ready for review May 18, 2026 05:00
The Connect-specific AnalysisException subclasses the base one, so a
single import covers both. The assertRaisesRegex regex still pins the
Connect-specific CANNOT_RESOLVE_DATAFRAME_COLUMN error class.

Generated-by: Claude Code (Anthropic), claude-opus-4-7
@zhengruifeng zhengruifeng marked this pull request as draft May 18, 2026 05:53
…y assertions

- Rewrite the three mixed-surface layered tests to use only the DataFrame API
  (chained transformations, semi-joins, Window functions, cube, UDFs and
  struct field access). spark.sql() is no longer used in the layered tests.
- After running the suite locally, correct several assertions that assumed
  Classic-vs-Connect divergence where none exists: groupBy, intersect,
  pivot, and temp-view-roundtrip all preserve attribute-id propagation
  through Connect's plan-id resolver, so the tagged reference resolves in
  both strict and lenient modes.
- For union the divergence does hold: Classic succeeds but Connect raises
  CANNOT_RESOLVE_DATAFRAME_COLUMN in both strict and lenient modes (the
  name-based fallback is not triggered for set-op outputs).

Generated-by: Claude Code (Anthropic), claude-opus-4-7
Rename df / df1 / df2 variables that refer to Connect DataFrames in
SparkConnectColumnResolutionTests to cdf / cdf1 / cdf2, matching the
existing sdf convention for Classic. Drop the per-test pyspark.sql.types
imports from the layered tests; the needed types are already imported
at the top of the file.

Generated-by: Claude Code (Anthropic), claude-opus-4-7
Based on the resolution logic in
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala
add four tests covering behaviors not previously exercised:

- test_resolve_cross_dataframe_illegal_reference: the documented
  df1.select(df2.a) case where df2's plan id is not in df1's plan tree;
  fails with CANNOT_RESOLVE_DATAFRAME_COLUMN before any name-based
  fallback can run.
- test_resolve_df_star: plan-id-tagged star expansion via
  UnresolvedDataFrameStar; succeeds in both Classic and Connect modes.
- test_resolve_self_join_withcolumnrenamed: the documented self-join
  disambiguation example where the right-side candidate is filtered out
  by the rename projection above it.
- test_resolve_sort_missing_attr_recovery: the documented
  df.select(df.v).sort(df.id) case where Sort's reference is recovered
  by resolveExprsAndAddMissingAttrs adding id back to the upstream
  projection, in both strict and lenient modes.

Tighten comments on two existing tests:

- test_resolve_after_union: cite that Union is treated as a leaf node
  when walking the plan tree for plan-id resolution.
- test_resolve_self_join_alias: cite AMBIGUOUS_COLUMN_REFERENCE as the
  specific failure mode.

Generated-by: Claude Code (Anthropic), claude-opus-4-7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant