Skip to content

[SPARK-46625][SQL] Place IDENTIFIER placeholder in command name slot#55949

Open
cloud-fan wants to merge 14 commits into
apache:masterfrom
cloud-fan:parser-identifier-cte-placement
Open

[SPARK-46625][SQL] Place IDENTIFIER placeholder in command name slot#55949
cloud-fan wants to merge 14 commits into
apache:masterfrom
cloud-fan:parser-identifier-cte-placement

Conversation

@cloud-fan
Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan commented May 18, 2026

What changes were proposed in this pull request?

Root-cause fix for SPARK-46625. Supersedes #55706.

Push PlanWithUnresolvedIdentifier into the command's identifier slot at parse time, for every parser-built command, instead of wrapping the whole command. With this change, CTESubstitution sees the real CTEInChildren command directly and places WithCTE on the command's children by construction -- the invalid WithCTE(InsertIntoStatement, ...) / WithCTE(CreateTableAsSelect, ...) shape never appears.

Parser placement (AstBuilder):

  • AstBuilder.withInsertInto, visitCreateTable, visitReplaceTable build the command directly. A new helper buildWriteTableSlot returns NamedRelation and serves both InsertIntoStatement.table and OverwriteByExpression.table (whose slot is typed NamedRelation). CTAS/RTAS use single-arg withIdentClause to put the placeholder in the name slot.
  • visitCacheTable's AS path uses a new helper buildCacheTableAsSelectName to build the temp-view name as an Expression. The non-AS path uses single-arg withIdentClause.

CacheTableAsSelect.tempViewName: String -> Expression:

The last identifier slot still expressed as a plain String is now an Expression. The parser produces a non-null string Literal for direct identifiers and IDENTIFIER('literal'), or an ExpressionWithUnresolvedIdentifier for IDENTIFIER(<non-literal>). The single-part invariant is validated at parse time for literal cases and at materialization for the non-literal case. CheckAnalysis enforces the post-analysis invariant that tempViewName is a non-null string Literal, and tempViewNameString extracts the string for execution. The expression slot is naturally visited by transformExpressions, so ResolveIdentifierClause's existing ExpressionWithUnresolvedIdentifier branch resolves it without any new code path.

Placeholder shape (unresolved.scala):

  • PlanWithUnresolvedIdentifier mixes in NamedRelation so it can occupy OverwriteByExpression.table (typed NamedRelation) directly.
  • It does not extend CTEInChildren. With the in-slot placement plus the CacheTableAsSelect refactor, no parser caller places the placeholder as the substitution root of a WITH ... <command> subtree, so a CTEInChildren safety net has no reachable path under the current grammar.

Materialization (ResolveIdentifierClause):

InsertIntoStatement.table and V2WriteCommand.table are non-child LogicalPlan slots (child = query), so the default resolveOperatorsUp traversal never visits placeholders inside them. Two special-cases recurse explicitly:

  • case i: InsertIntoStatement if i.table.isInstanceOf[PlanWithUnresolvedIdentifier] => ...
  • case w: V2WriteCommand if w.table.isInstanceOf[PlanWithUnresolvedIdentifier] => ...

Each case extracts the placeholder with a single asInstanceOf at the top of the body and inlines the identifierExpr.resolved && childrenResolved check, returning the unchanged command when not yet ready.

The V2WriteCommand match dispatches via the abstract withNewTable(NamedRelation) and pattern-matches the materialized result as NamedRelation, throwing an internal error otherwise. Only OverwriteByExpression is parser-built with a placeholder in table today; matching the trait keeps the rule consistent for any future analyzer-built node in the same shape.

Tree-pattern propagation (statements.scala, v2Commands.scala):

InsertIntoStatement and V2WriteCommand override getDefaultTreePatternBits to union table.treePatternBits, so containsPattern(...) pruning correctly reports patterns (PARAMETER, PLAN_WITH_UNRESOLVED_IDENTIFIER) living in table.

Parameter binding (parameters.scala):

BindParameters.bind pattern-matches both InsertIntoStatement and V2WriteCommand to recurse into table. Without this, INSERT ... IDENTIFIER(:p) and INSERT INTO REPLACE WHERE ... IDENTIFIER(:p) under spark.sql.legacy.parameterSubstitution.constantsOnly=true would fail to bind the parameter.

CreateTableAsSelect.name / ReplaceTableAsSelect.name:

Already children via V2CreateTableAsSelectPlan.childrenToAnalyze, so no extra traversal hook is needed for them.

The post-hoc WithCTE(c: CTEInChildren, _) => c.withCTEDefs(cteDefs) collapse in ResolveIdentifierClause from #55706 is removed entirely -- no command shape reaches the analyzer needing it.

Why are the changes needed?

WITH t AS (...)
INSERT [INTO|OVERWRITE] TABLE IDENTIFIER('t')
SELECT * FROM t

previously produced an analysed plan with WithCTE wrapping the InsertIntoStatement -- a structurally invalid shape. Plug-in datasources that re-analyse the InsertIntoStatement's query subtree throw NoSuchElementException: key not found. Zero-day issue since SPARK-46625.

#55706 fixed the symptom by collapsing WithCTE(c: CTEInChildren, defs) after the fact in ResolveIdentifierClause. That works but encodes an undo for a placement that should never have happened, and turned out to be unsafe: after a later analyzer rewrite (e.g. RewriteDeleteFromTable rewriting WithCTE(DeleteFromTable, defs) into WithCTE(ReplaceData, defs)), the collapse would push WithCTE into ReplaceData.query and orphan the CTE refs in ReplaceData.condition / groupFilterCondition, breaking 9 v2 DML rCTE tests on CI. The review on #55706 (#55706 (comment)) asked for the root-cause fix; this PR implements it.

Credit to @stevomitric for surfacing in the original PR thread that the parser-level placement also has to work for legacy parameter substitution -- that's why this PR adds the targeted BindParameters and tree-pattern-bits handling for InsertIntoStatement.table and V2WriteCommand.table.

Does this PR introduce any user-facing change?

No behavior change, but a minor timing change in visitCreateTable / visitReplaceTable: CTAS/RTAS validation errors (Schema may not be specified in a CTAS statement, Partition column types may not be specified..., Constraints may not be specified...) for non-literal identifiers now fire at parse time rather than at identifier resolution. Same error messages and same ctx; only the moment of throwing moves earlier. Fail-fast improvement, not a regression.

How was this patch tested?

New tests in ParametersSuite:

  • WITH ... INSERT OVERWRITE TABLE IDENTIFIER(:p) SELECT ... FROM cte
  • WITH ... INSERT INTO IDENTIFIER(:p) SELECT ... FROM cte
  • CREATE TABLE IDENTIFIER(:p) AS WITH ... SELECT ... FROM cte
  • INSERT IDENTIFIER(:p) under legacy parameter substitution (covers spark.sql.legacy.parameterSubstitution.constantsOnly=true, which exercises the BindParameters recursion into InsertIntoStatement.table)
  • WITH ... INSERT INTO IDENTIFIER(:p) REPLACE WHERE ... -- parser (asserts the placeholder lives in OverwriteByExpression.table and no WithCTE(CTEInChildren, _) shape survives CTESubstitution)
  • BindParameters recurses into OverwriteByExpression.table (rule-level test; full analysis would require a v2 catalog)
  • CACHE TABLE IDENTIFIER(...) AS WITH ... SELECT ... -- parser (asserts tempViewName holds ExpressionWithUnresolvedIdentifier and no WithCTE(CTEInChildren, _) shape survives CTESubstitution)
  • REPLACE TABLE IDENTIFIER(...) AS WITH ... SELECT ... -- parser (mirrors the CTAS test for RTAS)

Each CTE test asserts that no WithCTE(CTEInChildren, _) shape leaks through analysis.

DDLParserSuite.CACHE TABLE updated to construct CacheTableAsSelect with Literal("t") instead of "t" for the new Expression slot.

Existing suites run clean: ParametersSuite, DeltaBased/GroupBased Delete/Update/Merge suites (including the rCTE-with-DML tests that were the original CI failures), DataSourceV2DataFrameSuite, SQLViewSuite, IdentifierClauseParserSuite, AnalysisSuite / PlanParserSuite / AnalysisErrorSuite and siblings, DataSourceV2SQLSuiteV1Filter, InsertSuite, CTEInlineSuite / CTEHintSuite, CachedTableSuite, CreateTableAsSelectSuite / DDLParserSuite.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude (Claude Code, Opus 4.7)

cloud-fan added 6 commits May 18, 2026 05:18
Push `PlanWithUnresolvedIdentifier` into the identifier slot of write
commands at parse time (`InsertIntoStatement.table`,
`CreateTableAsSelect.name`, `ReplaceTableAsSelect.name`) instead of
wrapping the entire command. `CTESubstitution` then sees the real
`CTEInChildren` and places `WithCTE` on the command's children by
construction, fixing the invalid `WithCTE(InsertIntoStatement, ...)` /
`WithCTE(CreateTableAsSelect, ...)` shape produced for queries like
`WITH t AS (...) INSERT INTO IDENTIFIER('t') SELECT * FROM t`.

`InsertIntoStatement.table` is a non-child `LogicalPlan` field
(`child = query`), which would prevent tree-pattern propagation and
`BindParameters.bind` from reaching placeholders inside it (breaking
legacy parameter substitution mode `INSERT ... IDENTIFIER(:p)`). Add a
small `innerPlans` / `withNewInnerPlans` hook on `LogicalPlan`, override
it on `InsertIntoStatement`, and wire it into
`LogicalPlan.getDefaultTreePatternBits`, `BindParameters.bind`, and
`ResolveIdentifierClause`. `CreateTableAsSelect.name` /
`ReplaceTableAsSelect.name` are already children via
`V2CreateTableAsSelectPlan.childrenToAnalyze`, so no hook is needed for
those.

A narrow post-hoc `WithCTE(c: CTEInChildren, _)` collapse is kept in
`ResolveIdentifierClause` for the two call sites where the identifier
slot's type prevents pushing the placeholder in directly:
`OverwriteByExpression.table: NamedRelation` (INSERT REPLACE WHERE) and
`CacheTableAsSelect.tempViewName: String`.

Supersedes apache#55706. Credit to stevomitric for surfacing the
legacy-mode `BindParameters` issue in the original PR thread.

Co-authored-by: Isaac
Replace the generic `innerPlans` / `withNewInnerPlans` hooks on
`LogicalPlan` (and the corresponding override in
`LogicalPlan.getDefaultTreePatternBits`) with direct special-cases for
`InsertIntoStatement`:

- `InsertIntoStatement` overrides `getDefaultTreePatternBits` itself to
  add `table.treePatternBits`.
- `BindParameters.bind` pattern-matches `InsertIntoStatement` to recurse
  into `table`.
- `ResolveIdentifierClause.apply0` pattern-matches
  `InsertIntoStatement(p: PlanWithUnresolvedIdentifier, ...)` to
  materialize the placeholder in the `table` slot.

`InsertIntoStatement.table` is the only non-child `LogicalPlan` slot
that needs this traversal, so the abstraction was paying generality
cost for one user.

Co-authored-by: Isaac
The innerPlans abstraction was dropped in the previous commit
(d17669d) in favor of an InsertIntoStatement special-case, but the
test-level comment still referenced the removed hook. Update the comment
to describe the actual mechanism: the BindParameters.bind special-case
plus the getDefaultTreePatternBits override on InsertIntoStatement.

Co-authored-by: Isaac
…Defs

The post-hoc `WithCTE(c: CTEInChildren, _) => c.withCTEDefs(cteDefs)`
collapse in `ResolveIdentifierClause.apply0` was firing indiscriminately
on any `WithCTE(CTEInChildren, _)` shape, not just the ones produced by
materializing a `PlanWithUnresolvedIdentifier`.

After `RewriteDeleteFromTable` (and `RewriteUpdateTable`) rewrites a
`DeleteFromTable` (non-`CTEInChildren`, wrapped in `WithCTE` by
`CTESubstitution.withCTEDefs`) into a `ReplaceData`/`WriteDelta`
(`CTEInChildren`), the next fixed-point iteration of
`ResolveIdentifierClause` ran the collapse: `WithCTE` got pushed into
the `query` child only, but `ReplaceData.condition` /
`groupFilterCondition` (non-child expression slots) still held the CTE
refs — now dangling. `InlineCTE.buildCTEMap` then threw
`NoSuchElementException` on the dangling ref's `cteId`. The failure
reproduced as 9 CI test failures across the v2 DML rCTE suites
(DeltaBased / GroupBased Delete/Update + recursive CTE).

Restrict the collapse to safe cases: only collapse when no cteDef is
referenced from a non-child expression slot of the wrapped command.
This still handles the intended `OverwriteByExpression` (REPLACE WHERE)
and `CacheTableAsSelect` cases — those don't normally have outer CTE
refs in their non-child slots — and leaves the `WithCTE(ReplaceData,
_)` shape intact post-rewrite.

Co-authored-by: Isaac
…apse

The prior fix narrowed the post-hoc `WithCTE(CTEInChildren, _) => c.withCTEDefs(cteDefs)`
collapse so it would not corrupt `WithCTE(ReplaceData, _)` shapes produced by
`RewriteDeleteFromTable`. The collapse itself was still a fallback for the two write
commands the parser refactor in 59a0637 could not yet handle natively
(`OverwriteByExpression` whose `table` is typed `NamedRelation`, and
`CacheTableAsSelect` whose name is `String`).

Eliminate the fallback entirely:

- Mix `NamedRelation` into `PlanWithUnresolvedIdentifier` so the placeholder can occupy
  `OverwriteByExpression.table` directly at parse time, mirroring the
  `InsertIntoStatement.table` treatment from 59a0637. `AstBuilder` for REPLACE WHERE
  builds the table slot via the new `buildWriteTableSlotNamedRelation` helper instead of
  wrapping the whole command.

- `ResolveIdentifierClause` materializes the slot via a new `OverwriteByExpression(p:
  PlanWithUnresolvedIdentifier, ...)` case, parallel to the existing
  `InsertIntoStatement` case. `OverwriteByExpression` gains a
  `getDefaultTreePatternBits` override (so `containsPattern` propagation reaches into
  `table`), and `BindParameters.bind` recurses into `table` (so legacy parameter
  substitution `INSERT ... REPLACE WHERE ... IDENTIFIER(:p)` resolves).

- Mix `CTEInChildren` into `PlanWithUnresolvedIdentifier` so the remaining
  `CacheTableAsSelect` call site — where the placeholder still wraps the whole command
  because `tempViewName: String` cannot hold a `LogicalPlan` — gets the right treatment
  at substitution time: `CTESubstitution.withCTEDefs` dispatches on the placeholder's
  own `withCTEDefs`, which pushes `WithCTE` into the placeholder's `children` (the query
  slot). The empty-children case falls back to wrapping the placeholder itself so
  cteDefs are not silently lost.

With the placeholder in the correct slot at parse time (or pushed into children at
substitution time for the `CacheTableAsSelect` case), no `WithCTE(CTEInChildren, _)`
shape ever reaches the analyzer for a freshly-materialized command, and the post-hoc
collapse can be deleted along with its safety helper.

Co-authored-by: Isaac
@cloud-fan cloud-fan force-pushed the parser-identifier-cte-placement branch from 24c3a68 to 1cf623a Compare May 18, 2026 12:48
cloud-fan added 7 commits May 18, 2026 13:15
…hildren

- Replace asInstanceOf[NamedRelation] in ResolveIdentifierClause and BindParameters
  with explicit pattern matches that throw a clear internal-error if the materialized
  table is not a NamedRelation.
- Convert the unreachable children.isEmpty branch in
  PlanWithUnresolvedIdentifier.withCTEDefs into an assert. The previous fallback
  WithCTE(this, cteDefs) would re-introduce the structurally invalid
  WithCTE(<command>, _) shape this PR is eliminating.
- Drop the leftover val resolved = ...; resolved wrapping in
  ResolveIdentifierClause.apply0.
- Add a ParametersSuite test that exercises BindParameters' OverwriteByExpression
  branch (and the getDefaultTreePatternBits PARAMETER-bit propagation) without
  needing a v2 catalog.

Co-authored-by: Isaac
Reword the block comment on `PlanWithUnresolvedIdentifier.withCTEDefs`: the
prior text claimed the two-arg `withIdentClause` form always supplies a
non-empty `otherPlans` (false -- `visitCacheTable` passes `Nil` when there is
no AS query) and contained a "this PR is eliminating" phrasing that doesn't
read well post-merge. Restate the reachability argument in steady-state terms
based on the grammar.

Add a parser-level test mirroring the existing REPLACE WHERE one for the RTAS
path: asserts the placeholder lands in `ReplaceTableAsSelect.name` and that no
`WithCTE(CTEInChildren, _)` shape survives `CTESubstitution`. Parser-only
because running RTAS through full analysis requires a v2 catalog.

Co-authored-by: Isaac
`OverwriteByExpression` is the only `V2WriteCommand` parser-built with a
placeholder in `table` today, but every `V2WriteCommand` shares the same
`child = query` shape, so `table` is a non-child slot for all of them. Match
the trait in the related analyzer rules so the invariant stays uniform if any
future analyzer-built node lands a placeholder in the same slot:

- Move the `getDefaultTreePatternBits` override from `OverwriteByExpression`
  up to the `V2WriteCommand` trait.
- Widen the `ResolveIdentifierClause` materialization case from
  `OverwriteByExpression` to `V2WriteCommand`, dispatching via
  `w.withNewTable(...)`.
- Widen the `BindParameters.bind` table-slot recursion the same way.

Behavior is unchanged today since `OverwriteByExpression` is still the only
parser-built `V2WriteCommand` carrying a placeholder.

Co-authored-by: Isaac
… safety

Tighten the assertion in `PlanWithUnresolvedIdentifier.withCTEDefs` from
`children.nonEmpty` to `children.length == 1` to encode the actual invariant
under which the override is safe.

Rewrite the comment to explain *why* extending `CTEInChildren` is safe: with
exactly one child the placeholder acts as a transparent stand-in for its
eventual command, and wrapping the single child with `WithCTE` reproduces the
same shape the command's own `withCTEDefs` would produce (e.g. for
`CacheTableAsSelect`, identical to `copy(plan = WithCTE(plan, cteDefs))`).
Spell out the two ways a non-single-child case would break the equivalence so
the assertion's failure mode is self-explanatory.

Update the class-level Scaladoc accordingly.

Co-authored-by: Isaac
…CTEInChildren

`CacheTableAsSelect.tempViewName` was the last identifier slot still expressed
as a plain `String`, which forced `visitCacheTable` to use the wrap-the-whole-
command form of `withIdentClause` -- the same shape that motivated the
`WithCTE(<command>, _)` workaround chain in SPARK-46625.

Change `tempViewName: String` to `tempViewName: Expression`. The parser produces
either a non-null string `Literal` (for direct identifiers and
`IDENTIFIER('literal')`) or an `ExpressionWithUnresolvedIdentifier` (for
`IDENTIFIER(<non-literal>)`). The expression slot is naturally visited by
`transformExpressions`, so `ResolveIdentifierClause`'s existing
`ExpressionWithUnresolvedIdentifier` branch resolves it without any new code
path. `CheckAnalysis` enforces the post-analysis invariant that the slot is a
non-null string literal, and `CacheTableAsSelect.tempViewNameString` extracts
the string for execution.

With this change, no parser caller places a `PlanWithUnresolvedIdentifier` as
the substitution root of a `WITH ... <command>` subtree. The
`CTEInChildren` extension on `PlanWithUnresolvedIdentifier` therefore has no
reachable code path under the current grammar and is removed along with its
`withCTEDefs` override. The class-level Scaladoc is updated accordingly.

Add a parser-level test for
`CACHE TABLE IDENTIFIER(<non-literal>) AS WITH ... SELECT ...` asserting the
expression placeholder lives in the name slot and no `WithCTE(CTEInChildren, _)`
shape survives `CTESubstitution`. Update `DDLParserSuite` to construct
`CacheTableAsSelect` with `Literal("t")` instead of `"t"`.

Co-authored-by: Isaac
…Slot return type; comment fixes

- ResolveIdentifierClause: replace the 9-field positional `InsertIntoStatement`
  pattern with the typed-guard shape already used for `V2WriteCommand`, so the
  rule survives field additions/reorderings on `InsertIntoStatement`.
- AstBuilder: type `buildWriteTableSlot` to return `NamedRelation` directly
  (both branches already do); delete the casting wrapper
  `buildWriteTableSlotNamedRelation` and use `buildWriteTableSlot` at the
  `OverwriteByExpression` callsite.
- Comments: `CacheTableAsSelect.name` -> `tempViewName` (field name correction)
  in `AstBuilder.visitCacheTable` and the `buildCacheTableAsSelectName`
  Scaladoc; fix the example SQL in `BindParameters` from
  `INSERT INTO REPLACE WHERE ... IDENTIFIER(:p)` to
  `INSERT INTO IDENTIFIER(:p) REPLACE WHERE ...`.

Co-authored-by: Isaac
If `tempViewName` is still an unresolved `ExpressionWithUnresolvedIdentifier`
(e.g. `CACHE TABLE IDENTIFIER(<unresolved-col>) AS SELECT ...`), the case
should not fire -- otherwise it throws an internal error and pre-empts the
catch-all `LogicalPlan` case from raising the proper `UNRESOLVED_COLUMN`
user-facing error.

Co-authored-by: Isaac
@cloud-fan
Copy link
Copy Markdown
Contributor Author

cc @peter-toth @gengliangwang

Copy link
Copy Markdown
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

SPARK-46625 is fixed at the root: instead of wrapping the entire write/CTAS/RTAS/CACHE-TABLE command in PlanWithUnresolvedIdentifier and post-hoc collapsing WithCTE(CTEInChildren, _) after materialization (the predecessor PR #55706's approach, which broke under later analyzer rewrites like RewriteDeleteFromTable), this PR pushes the placeholder directly into the command's identifier slot at parse time. With the command itself sitting at the top, CTESubstitution.withCTEDefs immediately sees a CTEInChildren and places WithCTE on the command's children by construction — the invalid WithCTE(InsertIntoStatement, ...) shape becomes structurally unreachable.

Prior state. The parser's withIdentClause(ctx, otherPlans, builder) unconditionally wrapped the produced command in a PlanWithUnresolvedIdentifier for non-literal IDENTIFIER. For write/CTAS/RTAS commands, this put the placeholder at the substitution root of a WITH ... <command> subtree. Because PlanWithUnresolvedIdentifier is not a CTEInChildren, CTESubstitution.withCTEDefs fell into the default WithCTE(p, cteDefs) branch, wrapping the placeholder. When ResolveIdentifierClause later materialized into a real CTEInChildren command, the surviving WithCTE was now wrapping a CTEInChildren — the invalid shape that breaks plug-in datasources re-analyzing InsertIntoStatement.query with NoSuchElementException. #55706's after-the-fact collapse worked for direct DML but broke after RewriteDeleteFromTable rewrote WithCTE(DeleteFromTable, defs) into WithCTE(ReplaceData, defs), where the collapse pushed WithCTE into ReplaceData.query and orphaned CTE refs in condition / groupFilterCondition.

Design approach. Mirror what DELETE/UPDATE/MERGE already do — place the placeholder inside the command's identifier slot so the command sits at the top. Concretely:

  • InsertIntoStatement.table / OverwriteByExpression.table: new helper buildWriteTableSlot returns UnresolvedRelation for literal/direct identifiers and a PlanWithUnresolvedIdentifier for non-literal IDENTIFIER. PlanWithUnresolvedIdentifier now mixes in NamedRelation so it can occupy the NamedRelation-typed OverwriteByExpression.table slot directly.
  • CreateTableAsSelect.name / ReplaceTableAsSelect.name: single-arg withIdentClause(ctx, parts => UnresolvedIdentifier(parts)). These are already children via V2CreateTableAsSelectPlan.childrenToAnalyze.
  • CacheTableAsSelect.tempViewName: String → Expression: a string Literal for direct identifiers and IDENTIFIER('literal'), an ExpressionWithUnresolvedIdentifier for IDENTIFIER(<non-literal>). Avoids the wrap-whole-command form.

Key design decisions.

  • PlanWithUnresolvedIdentifier extends NamedRelation — required for OverwriteByExpression.table: NamedRelation to typecheck. name returns the SQL text of the identifier expression as a placeholder string.
  • PlanWithUnresolvedIdentifier does not extend CTEInChildren — with in-slot placement no parser caller places the placeholder as the substitution root, so the trait would have no reachable use.
  • Two analyzer rules need explicit recursion for non-child table slots: ResolveIdentifierClause (materialization) and BindParameters (parameter binding). InsertIntoStatement and V2WriteCommand override getDefaultTreePatternBits to union table.treePatternBits so pruning correctly reaches these nodes.
  • CheckAnalysis validates the post-analysis shape of CacheTableAsSelect.tempViewName (must be a non-null string Literal); tempViewNameString extracts the string at execution time.

Implementation sketch. Touches parser (AstBuilder.withInsertInto, visitCreateTable, visitReplaceTable, visitCacheTable; new helpers buildWriteTableSlot and buildCacheTableAsSelectName); plan nodes (PlanWithUnresolvedIdentifier, InsertIntoStatement, V2WriteCommand for tree-pattern bits; CacheTableAsSelect for the new Expression slot); analyzer rules (ResolveIdentifierClause, BindParameters for explicit recursion; CheckAnalysis for invariant validation); strategy (DataSourceV2Strategy calls tempViewNameString). The post-hoc WithCTE collapse from #55706 is removed.

General

The PR description references a helper buildWriteTableSlotNamedRelation ("for OverwriteByExpression.table"), but the code only adds one helper, buildWriteTableSlot, which returns NamedRelation and serves both InsertIntoStatement.table and OverwriteByExpression.table. The description also shows the ResolveIdentifierClause case as case i @ InsertIntoStatement(p: PlanWithUnresolvedIdentifier, ...) ..., but the actual code uses isInstanceOf/asInstanceOf (see inline below). Worth aligning the description with the code so future readers don't search for a non-existent helper or pattern.


final override val nodePatterns: Seq[TreePattern] = Seq(PLAN_WITH_UNRESOLVED_IDENTIFIER)

override def name: String = identifierExpr.sql
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PlanWithUnresolvedIdentifier is now reachable as a NamedRelation via two error paths I traced: SparkStrategies.extractTableNameForError (sql/core/.../SparkStrategies.scala:1135) and the r: NamedRelation => toSQLId(r.name) fallback in QueryCompilationErrors.scala:3635. Both will render the SQL text of the unresolved expression (e.g. IDENTIFIER(:p) or concat('a', 'b')) as the "table name" in error messages. That's a reasonable fallback, but worth a one-line comment on this override noting that identifierExpr.sql is intentionally an unresolved-placeholder stand-in for those error paths.

Comment on lines +79 to +88
case i: InsertIntoStatement if i.table.isInstanceOf[PlanWithUnresolvedIdentifier] && {
val p = i.table.asInstanceOf[PlanWithUnresolvedIdentifier]
p.identifierExpr.resolved && p.childrenResolved
} =>
val p = i.table.asInstanceOf[PlanWithUnresolvedIdentifier]
if (referredTempVars.isDefined) {
referredTempVars.get ++= collectTemporaryVariablesInLogicalPlan(p)
}
i.copy(table = executor.execute(p.planBuilder.apply(
IdentifierResolution.evalIdentifierExpr(p.identifierExpr), p.children)))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description shows this as case i @ InsertIntoStatement(p: PlanWithUnresolvedIdentifier, ...) ..., but the code uses isInstanceOf in the guard and asInstanceOf in both the guard and the body (the cast appears twice). Either update the description, or extract the cast once — e.g. guard with i.table.isInstanceOf[PlanWithUnresolvedIdentifier], then do the identifierExpr.resolved && childrenResolved check and the materialization in the body, returning i unchanged when not yet resolved. The same applies to the V2WriteCommand case below.

// that the name slot holds the expression placeholder and no `WithCTE(CTEInChildren, _)` shape
// survives `CTESubstitution` (running the cache through full analysis would require the temp
// view machinery, so this is a parser-level test).
test("SPARK-46625: CACHE TABLE IDENTIFIER(...) AS WITH ... SELECT ... parser") {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End-to-end coverage gap: the CACHE TABLE / REPLACE TABLE / INSERT INTO IDENTIFIER REPLACE WHERE tests are all parser-level. For CACHE TABLE specifically an e2e test doesn't need v2-catalog or temp-view machinery beyond what SharedSparkSession already provides — CACHE TABLE IDENTIFIER('temp_v') AS WITH ... SELECT ... should resolve and execute. That would exercise the tempViewNameString extraction in DataSourceV2Strategy:781 and the new CheckAnalysis invariant case for CacheTableAsSelect, neither of which any test currently reaches.

…e2e cache test

- Refactor ResolveIdentifierClause: lift the PlanWithUnresolvedIdentifier cast
  out of a compound guard into a single asInstanceOf at the top of the case
  body; resolution check moves to an inner if/else with `i`/`w` returned
  unchanged when not yet resolved. Same shape for InsertIntoStatement and
  V2WriteCommand cases.
- Document `PlanWithUnresolvedIdentifier.name = identifierExpr.sql` as the
  error-path placeholder used by SparkStrategies.extractTableNameForError and
  the `r: NamedRelation` fallback in QueryCompilationErrors.
- Add an end-to-end ParametersSuite test for CACHE TABLE IDENTIFIER(:p) AS
  WITH ... SELECT ..., exercising the tempViewNameString extraction in
  DataSourceV2Strategy and the CheckAnalysis invariant case for
  CacheTableAsSelect.tempViewName.

Co-authored-by: Isaac
Copy link
Copy Markdown
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only one nit and seems like there is a linter issue.

messageParameters = Map("name" -> "IDENTIFIER", "expr" -> p.identifierExpr.sql)
)

case c: CacheTableAsSelect if c.tempViewName.resolved =>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if c.tempViewName.resolved guard was added in commit 90f09a9 specifically to keep this invariant check from pre-empting the proper UNRESOLVED_COLUMN user-facing error when IDENTIFIER(<unresolved-col>) itself fails to resolve. The comment documents the intent well, but there's no test that pins this behavior — a future change that drops the guard (e.g. case c: CacheTableAsSelect => without the guard) would silently swap the user-facing error for a SparkException internal error, with nothing in CI to catch it.

Worth adding a regression test alongside the SPARK-46625 cache-table tests in ParametersSuite. Rough shape:

test("SPARK-46625: CACHE TABLE IDENTIFIER(<unresolved-col>) reports UNRESOLVED_COLUMN, not an internal error") {
  val ex = intercept[AnalysisException] {
    spark.sql(
      """CACHE TABLE IDENTIFIER(unresolved_col) AS SELECT 1 AS a""".stripMargin)
  }
  assert(ex.getCondition.startsWith("UNRESOLVED_COLUMN"))
  // and explicitly NOT the internal-error condition the invariant case throws
  assert(!ex.getMessage.contains("CacheTableAsSelect.tempViewName must be"))
}

That pins both halves of the design — the guard fires for unresolved IDENTIFIERs (so the catch-all LogicalPlan case below produces UNRESOLVED_COLUMN), AND the case below it stays the catch-all, not pre-empted by this branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants