[SPARK-56917][TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution by zhengruifeng · Pull Request #55947 · apache/spark

zhengruifeng · 2026-05-18T04:04:57Z

What changes were proposed in this pull request?

Add a new SparkConnectColumnResolutionTests class in python/pyspark/sql/tests/connect/test_connect_column.py that pins Spark Connect's DataFrame column resolution behavior across both modes of spark.sql.analyzer.strictDataFrameColumnResolution, with a Classic baseline alongside each test.

The class is placed in a Connect-only suite intentionally: keeping it out of Classic-shared mixins prevents it from being removed as "diverging from Classic" during routine cleanup.

Focused per-pattern tests (14). Each test runs the same pattern against self.spark (Classic), Connect strict and Connect lenient and asserts the observed behavior:

Shadowing - Classic raises (MISSING_ATTRIBUTES), Connect strict raises (CANNOT_RESOLVE_DATAFRAME_COLUMN), Connect lenient succeeds via name-based fallback: test_resolve_after_chained_withcolumn_shadow, test_resolve_after_select_alias_shadow, test_resolve_after_agg_alias_shadow.
Removal - all three modes raise (column gone by name and by id): test_resolve_after_withcolumnrenamed, test_resolve_after_drop.
Pass-through (filter/sort/distinct) - all three modes produce the same rows: test_resolve_through_filter, test_resolve_through_sort, test_resolve_through_distinct.
Attribute-id propagation (groupBy/intersect/pivot/temp-view roundtrip) - all three modes succeed; Connect's plan-id resolver propagates the attribute through these operators: test_resolve_after_groupby_count, test_resolve_after_pivot, test_resolve_after_intersect, test_resolve_after_subquery_view.
Union divergence - Classic succeeds; Connect raises CANNOT_RESOLVE_DATAFRAME_COLUMN in both strict and lenient modes (no name fallback for set-op outputs): test_resolve_after_union.
Self-join - both Classic and Connect raise (ambiguous reference): test_resolve_self_join_alias.

Mixed-surface layered DataFrame programs (3). Each chains 4-5 DataFrame transformations - semi-joins (for SQL EXISTS/IN), Window functions, cube aggregations, UDFs and struct field access - then layers a shadowing operation on top with a tagged df["c"] reference at the outermost select:

test_layered_semijoin_groupby_window_shadow - filter, semi-join, groupBy/agg, window functions; tagged select after groupBy. All three modes succeed.
test_layered_struct_semijoin_cube_ntile_shadow - filter, semi-join, struct field access, cube aggregation, NTILE window; tagged select after a withColumn shadow. Classic and Connect strict raise; Connect lenient succeeds.
test_layered_window_window_udf_shadow - filter, running-total window, per-partition max window, UDF wrap; tagged select after a withColumn shadow. Classic and Connect strict raise; Connect lenient succeeds.

Why are the changes needed?

#55531 added the strictDataFrameColumnResolution config and one shadowing test. This PR widens the coverage to enumerate Connect-vs-Classic behavior across shadowing variants, attribute-id-propagating operators (groupBy, intersect, pivot, temp-view roundtrip), set ops, self-joins, and multi-operator layered pipelines. Future tightening of Connect's column resolution will surface a clear test failure rather than a silent regression.

Does this PR introduce any user-facing change?

No. Test-only change.

How was this patch tested?

Ran the new suite locally with python/run-tests --testnames "pyspark.sql.tests.connect.test_connect_column SparkConnectColumnResolutionTests": all 17 tests pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic), claude-opus-4-7

…esolution Add parity tests in test_connect_column.py and layered programs in test_parity_dataframe.py to lock in known Connect/Classic behavior differences in DataFrame column resolution. Tests cover shadowing, pass-through, aggregation, pivot, set ops, self-join, subquery-as-table patterns under both strict and non-strict modes of spark.sql.analyzer.strictDataFrameColumnResolution, plus three mixed-surface layered programs that combine filters, joins, aggregations, set ops, window functions, UDFs and temp views. Generated-by: Claude Code (Anthropic), claude-opus-4-7

…sic baselines Move the focused parity tests and the layered programs into a dedicated SparkConnectColumnResolutionTests class in test_connect_column.py. Each test now starts with a Classic baseline against self.spark to document how Spark Classic handles the same pattern, followed by Connect strict and Connect lenient blocks. test_parity_dataframe.py is reverted to its prior state. Generated-by: Claude Code (Anthropic), claude-opus-4-7

Replace the three shallow layered programs with Reyden-influenced fixtures that mirror the patterns referenced in SC-229895 (reyden/query-tests/golden-files/layered-query-tests): - 4-level subquery chain with windows, HAVING and correlated EXISTS - CTE chain with GROUPING SETS, NTILE, struct field access and correlated IN - Self-join via SQL with windowed running totals, correlated EXISTS and UDF wrapping Each program builds the deeply layered base via spark.sql(), then layers DataFrame-API shadowing on top with a tagged df["c"] reference at the outermost select. Classic and Connect strict raise; Connect lenient succeeds via name-based fallback. Generated-by: Claude Code (Anthropic), claude-opus-4-7

Generated-by: Claude Code (Anthropic), claude-opus-4-7

The Connect-specific AnalysisException subclasses the base one, so a single import covers both. The assertRaisesRegex regex still pins the Connect-specific CANNOT_RESOLVE_DATAFRAME_COLUMN error class. Generated-by: Claude Code (Anthropic), claude-opus-4-7

…y assertions - Rewrite the three mixed-surface layered tests to use only the DataFrame API (chained transformations, semi-joins, Window functions, cube, UDFs and struct field access). spark.sql() is no longer used in the layered tests. - After running the suite locally, correct several assertions that assumed Classic-vs-Connect divergence where none exists: groupBy, intersect, pivot, and temp-view-roundtrip all preserve attribute-id propagation through Connect's plan-id resolver, so the tagged reference resolves in both strict and lenient modes. - For union the divergence does hold: Classic succeeds but Connect raises CANNOT_RESOLVE_DATAFRAME_COLUMN in both strict and lenient modes (the name-based fallback is not triggered for set-op outputs). Generated-by: Claude Code (Anthropic), claude-opus-4-7

Rename df / df1 / df2 variables that refer to Connect DataFrames in SparkConnectColumnResolutionTests to cdf / cdf1 / cdf2, matching the existing sdf convention for Classic. Drop the per-test pyspark.sql.types imports from the layered tests; the needed types are already imported at the top of the file. Generated-by: Claude Code (Anthropic), claude-opus-4-7

Based on the resolution logic in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala add four tests covering behaviors not previously exercised: - test_resolve_cross_dataframe_illegal_reference: the documented df1.select(df2.a) case where df2's plan id is not in df1's plan tree; fails with CANNOT_RESOLVE_DATAFRAME_COLUMN before any name-based fallback can run. - test_resolve_df_star: plan-id-tagged star expansion via UnresolvedDataFrameStar; succeeds in both Classic and Connect modes. - test_resolve_self_join_withcolumnrenamed: the documented self-join disambiguation example where the right-side candidate is filtered out by the rename projection above it. - test_resolve_sort_missing_attr_recovery: the documented df.select(df.v).sort(df.id) case where Sort's reference is recovered by resolveExprsAndAddMissingAttrs adding id back to the upstream projection, in both strict and lenient modes. Tighten comments on two existing tests: - test_resolve_after_union: cite that Union is treated as a leaf node when walking the plan tree for plan-id resolution. - test_resolve_self_join_alias: cite AMBIGUOUS_COLUMN_REFERENCE as the specific failure mode. Generated-by: Claude Code (Anthropic), claude-opus-4-7

zhengruifeng added 4 commits May 18, 2026 04:04

Drop external-project reference from layered test section comment

69c9af2

Generated-by: Claude Code (Anthropic), claude-opus-4-7

zhengruifeng changed the title ~~[TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution~~ [SPARK-56917][TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution May 18, 2026

zhengruifeng marked this pull request as ready for review May 18, 2026 05:00

zhengruifeng marked this pull request as draft May 18, 2026 05:53

zhengruifeng added 3 commits May 18, 2026 06:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56917][TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution#55947

[SPARK-56917][TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution#55947
zhengruifeng wants to merge 8 commits into
apache:masterfrom
zhengruifeng:SC-229895-connect-col-tests

zhengruifeng commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhengruifeng commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhengruifeng commented May 18, 2026 •

edited

Loading