[SPARK-56917][TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution#55947
Draft
zhengruifeng wants to merge 8 commits into
Draft
[SPARK-56917][TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution#55947zhengruifeng wants to merge 8 commits into
zhengruifeng wants to merge 8 commits into
Conversation
…esolution Add parity tests in test_connect_column.py and layered programs in test_parity_dataframe.py to lock in known Connect/Classic behavior differences in DataFrame column resolution. Tests cover shadowing, pass-through, aggregation, pivot, set ops, self-join, subquery-as-table patterns under both strict and non-strict modes of spark.sql.analyzer.strictDataFrameColumnResolution, plus three mixed-surface layered programs that combine filters, joins, aggregations, set ops, window functions, UDFs and temp views. Generated-by: Claude Code (Anthropic), claude-opus-4-7
…sic baselines Move the focused parity tests and the layered programs into a dedicated SparkConnectColumnResolutionTests class in test_connect_column.py. Each test now starts with a Classic baseline against self.spark to document how Spark Classic handles the same pattern, followed by Connect strict and Connect lenient blocks. test_parity_dataframe.py is reverted to its prior state. Generated-by: Claude Code (Anthropic), claude-opus-4-7
Replace the three shallow layered programs with Reyden-influenced
fixtures that mirror the patterns referenced in SC-229895
(reyden/query-tests/golden-files/layered-query-tests):
- 4-level subquery chain with windows, HAVING and correlated EXISTS
- CTE chain with GROUPING SETS, NTILE, struct field access and
correlated IN
- Self-join via SQL with windowed running totals, correlated EXISTS
and UDF wrapping
Each program builds the deeply layered base via spark.sql(), then
layers DataFrame-API shadowing on top with a tagged df["c"] reference
at the outermost select. Classic and Connect strict raise; Connect
lenient succeeds via name-based fallback.
Generated-by: Claude Code (Anthropic), claude-opus-4-7
Generated-by: Claude Code (Anthropic), claude-opus-4-7
The Connect-specific AnalysisException subclasses the base one, so a single import covers both. The assertRaisesRegex regex still pins the Connect-specific CANNOT_RESOLVE_DATAFRAME_COLUMN error class. Generated-by: Claude Code (Anthropic), claude-opus-4-7
…y assertions - Rewrite the three mixed-surface layered tests to use only the DataFrame API (chained transformations, semi-joins, Window functions, cube, UDFs and struct field access). spark.sql() is no longer used in the layered tests. - After running the suite locally, correct several assertions that assumed Classic-vs-Connect divergence where none exists: groupBy, intersect, pivot, and temp-view-roundtrip all preserve attribute-id propagation through Connect's plan-id resolver, so the tagged reference resolves in both strict and lenient modes. - For union the divergence does hold: Classic succeeds but Connect raises CANNOT_RESOLVE_DATAFRAME_COLUMN in both strict and lenient modes (the name-based fallback is not triggered for set-op outputs). Generated-by: Claude Code (Anthropic), claude-opus-4-7
Rename df / df1 / df2 variables that refer to Connect DataFrames in SparkConnectColumnResolutionTests to cdf / cdf1 / cdf2, matching the existing sdf convention for Classic. Drop the per-test pyspark.sql.types imports from the layered tests; the needed types are already imported at the top of the file. Generated-by: Claude Code (Anthropic), claude-opus-4-7
Based on the resolution logic in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala add four tests covering behaviors not previously exercised: - test_resolve_cross_dataframe_illegal_reference: the documented df1.select(df2.a) case where df2's plan id is not in df1's plan tree; fails with CANNOT_RESOLVE_DATAFRAME_COLUMN before any name-based fallback can run. - test_resolve_df_star: plan-id-tagged star expansion via UnresolvedDataFrameStar; succeeds in both Classic and Connect modes. - test_resolve_self_join_withcolumnrenamed: the documented self-join disambiguation example where the right-side candidate is filtered out by the rename projection above it. - test_resolve_sort_missing_attr_recovery: the documented df.select(df.v).sort(df.id) case where Sort's reference is recovered by resolveExprsAndAddMissingAttrs adding id back to the upstream projection, in both strict and lenient modes. Tighten comments on two existing tests: - test_resolve_after_union: cite that Union is treated as a leaf node when walking the plan tree for plan-id resolution. - test_resolve_self_join_alias: cite AMBIGUOUS_COLUMN_REFERENCE as the specific failure mode. Generated-by: Claude Code (Anthropic), claude-opus-4-7
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Add a new
SparkConnectColumnResolutionTestsclass inpython/pyspark/sql/tests/connect/test_connect_column.pythat pins Spark Connect's DataFrame column resolution behavior across both modes ofspark.sql.analyzer.strictDataFrameColumnResolution, with a Classic baseline alongside each test.The class is placed in a Connect-only suite intentionally: keeping it out of Classic-shared mixins prevents it from being removed as "diverging from Classic" during routine cleanup.
Focused per-pattern tests (14). Each test runs the same pattern against
self.spark(Classic), Connect strict and Connect lenient and asserts the observed behavior:MISSING_ATTRIBUTES), Connect strict raises (CANNOT_RESOLVE_DATAFRAME_COLUMN), Connect lenient succeeds via name-based fallback:test_resolve_after_chained_withcolumn_shadow,test_resolve_after_select_alias_shadow,test_resolve_after_agg_alias_shadow.test_resolve_after_withcolumnrenamed,test_resolve_after_drop.test_resolve_through_filter,test_resolve_through_sort,test_resolve_through_distinct.test_resolve_after_groupby_count,test_resolve_after_pivot,test_resolve_after_intersect,test_resolve_after_subquery_view.CANNOT_RESOLVE_DATAFRAME_COLUMNin both strict and lenient modes (no name fallback for set-op outputs):test_resolve_after_union.test_resolve_self_join_alias.Mixed-surface layered DataFrame programs (3). Each chains 4-5 DataFrame transformations - semi-joins (for SQL EXISTS/IN), Window functions, cube aggregations, UDFs and struct field access - then layers a shadowing operation on top with a tagged
df["c"]reference at the outermost select:test_layered_semijoin_groupby_window_shadow- filter, semi-join, groupBy/agg, window functions; tagged select after groupBy. All three modes succeed.test_layered_struct_semijoin_cube_ntile_shadow- filter, semi-join, struct field access, cube aggregation, NTILE window; tagged select after awithColumnshadow. Classic and Connect strict raise; Connect lenient succeeds.test_layered_window_window_udf_shadow- filter, running-total window, per-partition max window, UDF wrap; tagged select after awithColumnshadow. Classic and Connect strict raise; Connect lenient succeeds.Why are the changes needed?
#55531 added the
strictDataFrameColumnResolutionconfig and one shadowing test. This PR widens the coverage to enumerate Connect-vs-Classic behavior across shadowing variants, attribute-id-propagating operators (groupBy, intersect, pivot, temp-view roundtrip), set ops, self-joins, and multi-operator layered pipelines. Future tightening of Connect's column resolution will surface a clear test failure rather than a silent regression.Does this PR introduce any user-facing change?
No. Test-only change.
How was this patch tested?
Ran the new suite locally with
python/run-tests --testnames "pyspark.sql.tests.connect.test_connect_column SparkConnectColumnResolutionTests": all 17 tests pass.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Anthropic), claude-opus-4-7