[SPARK-46625][SQL] Place IDENTIFIER placeholder in command name slot#55949
[SPARK-46625][SQL] Place IDENTIFIER placeholder in command name slot#55949cloud-fan wants to merge 14 commits into
Conversation
Push `PlanWithUnresolvedIdentifier` into the identifier slot of write
commands at parse time (`InsertIntoStatement.table`,
`CreateTableAsSelect.name`, `ReplaceTableAsSelect.name`) instead of
wrapping the entire command. `CTESubstitution` then sees the real
`CTEInChildren` and places `WithCTE` on the command's children by
construction, fixing the invalid `WithCTE(InsertIntoStatement, ...)` /
`WithCTE(CreateTableAsSelect, ...)` shape produced for queries like
`WITH t AS (...) INSERT INTO IDENTIFIER('t') SELECT * FROM t`.
`InsertIntoStatement.table` is a non-child `LogicalPlan` field
(`child = query`), which would prevent tree-pattern propagation and
`BindParameters.bind` from reaching placeholders inside it (breaking
legacy parameter substitution mode `INSERT ... IDENTIFIER(:p)`). Add a
small `innerPlans` / `withNewInnerPlans` hook on `LogicalPlan`, override
it on `InsertIntoStatement`, and wire it into
`LogicalPlan.getDefaultTreePatternBits`, `BindParameters.bind`, and
`ResolveIdentifierClause`. `CreateTableAsSelect.name` /
`ReplaceTableAsSelect.name` are already children via
`V2CreateTableAsSelectPlan.childrenToAnalyze`, so no hook is needed for
those.
A narrow post-hoc `WithCTE(c: CTEInChildren, _)` collapse is kept in
`ResolveIdentifierClause` for the two call sites where the identifier
slot's type prevents pushing the placeholder in directly:
`OverwriteByExpression.table: NamedRelation` (INSERT REPLACE WHERE) and
`CacheTableAsSelect.tempViewName: String`.
Supersedes apache#55706. Credit to stevomitric for surfacing the
legacy-mode `BindParameters` issue in the original PR thread.
Co-authored-by: Isaac
Replace the generic `innerPlans` / `withNewInnerPlans` hooks on `LogicalPlan` (and the corresponding override in `LogicalPlan.getDefaultTreePatternBits`) with direct special-cases for `InsertIntoStatement`: - `InsertIntoStatement` overrides `getDefaultTreePatternBits` itself to add `table.treePatternBits`. - `BindParameters.bind` pattern-matches `InsertIntoStatement` to recurse into `table`. - `ResolveIdentifierClause.apply0` pattern-matches `InsertIntoStatement(p: PlanWithUnresolvedIdentifier, ...)` to materialize the placeholder in the `table` slot. `InsertIntoStatement.table` is the only non-child `LogicalPlan` slot that needs this traversal, so the abstraction was paying generality cost for one user. Co-authored-by: Isaac
The innerPlans abstraction was dropped in the previous commit (d17669d) in favor of an InsertIntoStatement special-case, but the test-level comment still referenced the removed hook. Update the comment to describe the actual mechanism: the BindParameters.bind special-case plus the getDefaultTreePatternBits override on InsertIntoStatement. Co-authored-by: Isaac
…Defs The post-hoc `WithCTE(c: CTEInChildren, _) => c.withCTEDefs(cteDefs)` collapse in `ResolveIdentifierClause.apply0` was firing indiscriminately on any `WithCTE(CTEInChildren, _)` shape, not just the ones produced by materializing a `PlanWithUnresolvedIdentifier`. After `RewriteDeleteFromTable` (and `RewriteUpdateTable`) rewrites a `DeleteFromTable` (non-`CTEInChildren`, wrapped in `WithCTE` by `CTESubstitution.withCTEDefs`) into a `ReplaceData`/`WriteDelta` (`CTEInChildren`), the next fixed-point iteration of `ResolveIdentifierClause` ran the collapse: `WithCTE` got pushed into the `query` child only, but `ReplaceData.condition` / `groupFilterCondition` (non-child expression slots) still held the CTE refs — now dangling. `InlineCTE.buildCTEMap` then threw `NoSuchElementException` on the dangling ref's `cteId`. The failure reproduced as 9 CI test failures across the v2 DML rCTE suites (DeltaBased / GroupBased Delete/Update + recursive CTE). Restrict the collapse to safe cases: only collapse when no cteDef is referenced from a non-child expression slot of the wrapped command. This still handles the intended `OverwriteByExpression` (REPLACE WHERE) and `CacheTableAsSelect` cases — those don't normally have outer CTE refs in their non-child slots — and leaves the `WithCTE(ReplaceData, _)` shape intact post-rewrite. Co-authored-by: Isaac
…apse The prior fix narrowed the post-hoc `WithCTE(CTEInChildren, _) => c.withCTEDefs(cteDefs)` collapse so it would not corrupt `WithCTE(ReplaceData, _)` shapes produced by `RewriteDeleteFromTable`. The collapse itself was still a fallback for the two write commands the parser refactor in 59a0637 could not yet handle natively (`OverwriteByExpression` whose `table` is typed `NamedRelation`, and `CacheTableAsSelect` whose name is `String`). Eliminate the fallback entirely: - Mix `NamedRelation` into `PlanWithUnresolvedIdentifier` so the placeholder can occupy `OverwriteByExpression.table` directly at parse time, mirroring the `InsertIntoStatement.table` treatment from 59a0637. `AstBuilder` for REPLACE WHERE builds the table slot via the new `buildWriteTableSlotNamedRelation` helper instead of wrapping the whole command. - `ResolveIdentifierClause` materializes the slot via a new `OverwriteByExpression(p: PlanWithUnresolvedIdentifier, ...)` case, parallel to the existing `InsertIntoStatement` case. `OverwriteByExpression` gains a `getDefaultTreePatternBits` override (so `containsPattern` propagation reaches into `table`), and `BindParameters.bind` recurses into `table` (so legacy parameter substitution `INSERT ... REPLACE WHERE ... IDENTIFIER(:p)` resolves). - Mix `CTEInChildren` into `PlanWithUnresolvedIdentifier` so the remaining `CacheTableAsSelect` call site — where the placeholder still wraps the whole command because `tempViewName: String` cannot hold a `LogicalPlan` — gets the right treatment at substitution time: `CTESubstitution.withCTEDefs` dispatches on the placeholder's own `withCTEDefs`, which pushes `WithCTE` into the placeholder's `children` (the query slot). The empty-children case falls back to wrapping the placeholder itself so cteDefs are not silently lost. With the placeholder in the correct slot at parse time (or pushed into children at substitution time for the `CacheTableAsSelect` case), no `WithCTE(CTEInChildren, _)` shape ever reaches the analyzer for a freshly-materialized command, and the post-hoc collapse can be deleted along with its safety helper. Co-authored-by: Isaac
24c3a68 to
1cf623a
Compare
…hildren - Replace asInstanceOf[NamedRelation] in ResolveIdentifierClause and BindParameters with explicit pattern matches that throw a clear internal-error if the materialized table is not a NamedRelation. - Convert the unreachable children.isEmpty branch in PlanWithUnresolvedIdentifier.withCTEDefs into an assert. The previous fallback WithCTE(this, cteDefs) would re-introduce the structurally invalid WithCTE(<command>, _) shape this PR is eliminating. - Drop the leftover val resolved = ...; resolved wrapping in ResolveIdentifierClause.apply0. - Add a ParametersSuite test that exercises BindParameters' OverwriteByExpression branch (and the getDefaultTreePatternBits PARAMETER-bit propagation) without needing a v2 catalog. Co-authored-by: Isaac
Reword the block comment on `PlanWithUnresolvedIdentifier.withCTEDefs`: the prior text claimed the two-arg `withIdentClause` form always supplies a non-empty `otherPlans` (false -- `visitCacheTable` passes `Nil` when there is no AS query) and contained a "this PR is eliminating" phrasing that doesn't read well post-merge. Restate the reachability argument in steady-state terms based on the grammar. Add a parser-level test mirroring the existing REPLACE WHERE one for the RTAS path: asserts the placeholder lands in `ReplaceTableAsSelect.name` and that no `WithCTE(CTEInChildren, _)` shape survives `CTESubstitution`. Parser-only because running RTAS through full analysis requires a v2 catalog. Co-authored-by: Isaac
`OverwriteByExpression` is the only `V2WriteCommand` parser-built with a placeholder in `table` today, but every `V2WriteCommand` shares the same `child = query` shape, so `table` is a non-child slot for all of them. Match the trait in the related analyzer rules so the invariant stays uniform if any future analyzer-built node lands a placeholder in the same slot: - Move the `getDefaultTreePatternBits` override from `OverwriteByExpression` up to the `V2WriteCommand` trait. - Widen the `ResolveIdentifierClause` materialization case from `OverwriteByExpression` to `V2WriteCommand`, dispatching via `w.withNewTable(...)`. - Widen the `BindParameters.bind` table-slot recursion the same way. Behavior is unchanged today since `OverwriteByExpression` is still the only parser-built `V2WriteCommand` carrying a placeholder. Co-authored-by: Isaac
… safety Tighten the assertion in `PlanWithUnresolvedIdentifier.withCTEDefs` from `children.nonEmpty` to `children.length == 1` to encode the actual invariant under which the override is safe. Rewrite the comment to explain *why* extending `CTEInChildren` is safe: with exactly one child the placeholder acts as a transparent stand-in for its eventual command, and wrapping the single child with `WithCTE` reproduces the same shape the command's own `withCTEDefs` would produce (e.g. for `CacheTableAsSelect`, identical to `copy(plan = WithCTE(plan, cteDefs))`). Spell out the two ways a non-single-child case would break the equivalence so the assertion's failure mode is self-explanatory. Update the class-level Scaladoc accordingly. Co-authored-by: Isaac
…CTEInChildren
`CacheTableAsSelect.tempViewName` was the last identifier slot still expressed
as a plain `String`, which forced `visitCacheTable` to use the wrap-the-whole-
command form of `withIdentClause` -- the same shape that motivated the
`WithCTE(<command>, _)` workaround chain in SPARK-46625.
Change `tempViewName: String` to `tempViewName: Expression`. The parser produces
either a non-null string `Literal` (for direct identifiers and
`IDENTIFIER('literal')`) or an `ExpressionWithUnresolvedIdentifier` (for
`IDENTIFIER(<non-literal>)`). The expression slot is naturally visited by
`transformExpressions`, so `ResolveIdentifierClause`'s existing
`ExpressionWithUnresolvedIdentifier` branch resolves it without any new code
path. `CheckAnalysis` enforces the post-analysis invariant that the slot is a
non-null string literal, and `CacheTableAsSelect.tempViewNameString` extracts
the string for execution.
With this change, no parser caller places a `PlanWithUnresolvedIdentifier` as
the substitution root of a `WITH ... <command>` subtree. The
`CTEInChildren` extension on `PlanWithUnresolvedIdentifier` therefore has no
reachable code path under the current grammar and is removed along with its
`withCTEDefs` override. The class-level Scaladoc is updated accordingly.
Add a parser-level test for
`CACHE TABLE IDENTIFIER(<non-literal>) AS WITH ... SELECT ...` asserting the
expression placeholder lives in the name slot and no `WithCTE(CTEInChildren, _)`
shape survives `CTESubstitution`. Update `DDLParserSuite` to construct
`CacheTableAsSelect` with `Literal("t")` instead of `"t"`.
Co-authored-by: Isaac
…Slot return type; comment fixes - ResolveIdentifierClause: replace the 9-field positional `InsertIntoStatement` pattern with the typed-guard shape already used for `V2WriteCommand`, so the rule survives field additions/reorderings on `InsertIntoStatement`. - AstBuilder: type `buildWriteTableSlot` to return `NamedRelation` directly (both branches already do); delete the casting wrapper `buildWriteTableSlotNamedRelation` and use `buildWriteTableSlot` at the `OverwriteByExpression` callsite. - Comments: `CacheTableAsSelect.name` -> `tempViewName` (field name correction) in `AstBuilder.visitCacheTable` and the `buildCacheTableAsSelectName` Scaladoc; fix the example SQL in `BindParameters` from `INSERT INTO REPLACE WHERE ... IDENTIFIER(:p)` to `INSERT INTO IDENTIFIER(:p) REPLACE WHERE ...`. Co-authored-by: Isaac
If `tempViewName` is still an unresolved `ExpressionWithUnresolvedIdentifier` (e.g. `CACHE TABLE IDENTIFIER(<unresolved-col>) AS SELECT ...`), the case should not fire -- otherwise it throws an internal error and pre-empts the catch-all `LogicalPlan` case from raising the proper `UNRESOLVED_COLUMN` user-facing error. Co-authored-by: Isaac
gengliangwang
left a comment
There was a problem hiding this comment.
Summary
SPARK-46625 is fixed at the root: instead of wrapping the entire write/CTAS/RTAS/CACHE-TABLE command in PlanWithUnresolvedIdentifier and post-hoc collapsing WithCTE(CTEInChildren, _) after materialization (the predecessor PR #55706's approach, which broke under later analyzer rewrites like RewriteDeleteFromTable), this PR pushes the placeholder directly into the command's identifier slot at parse time. With the command itself sitting at the top, CTESubstitution.withCTEDefs immediately sees a CTEInChildren and places WithCTE on the command's children by construction — the invalid WithCTE(InsertIntoStatement, ...) shape becomes structurally unreachable.
Prior state. The parser's withIdentClause(ctx, otherPlans, builder) unconditionally wrapped the produced command in a PlanWithUnresolvedIdentifier for non-literal IDENTIFIER. For write/CTAS/RTAS commands, this put the placeholder at the substitution root of a WITH ... <command> subtree. Because PlanWithUnresolvedIdentifier is not a CTEInChildren, CTESubstitution.withCTEDefs fell into the default WithCTE(p, cteDefs) branch, wrapping the placeholder. When ResolveIdentifierClause later materialized into a real CTEInChildren command, the surviving WithCTE was now wrapping a CTEInChildren — the invalid shape that breaks plug-in datasources re-analyzing InsertIntoStatement.query with NoSuchElementException. #55706's after-the-fact collapse worked for direct DML but broke after RewriteDeleteFromTable rewrote WithCTE(DeleteFromTable, defs) into WithCTE(ReplaceData, defs), where the collapse pushed WithCTE into ReplaceData.query and orphaned CTE refs in condition / groupFilterCondition.
Design approach. Mirror what DELETE/UPDATE/MERGE already do — place the placeholder inside the command's identifier slot so the command sits at the top. Concretely:
InsertIntoStatement.table/OverwriteByExpression.table: new helperbuildWriteTableSlotreturnsUnresolvedRelationfor literal/direct identifiers and aPlanWithUnresolvedIdentifierfor non-literal IDENTIFIER.PlanWithUnresolvedIdentifiernow mixes inNamedRelationso it can occupy theNamedRelation-typedOverwriteByExpression.tableslot directly.CreateTableAsSelect.name/ReplaceTableAsSelect.name: single-argwithIdentClause(ctx, parts => UnresolvedIdentifier(parts)). These are already children viaV2CreateTableAsSelectPlan.childrenToAnalyze.CacheTableAsSelect.tempViewName: String → Expression: a stringLiteralfor direct identifiers andIDENTIFIER('literal'), anExpressionWithUnresolvedIdentifierforIDENTIFIER(<non-literal>). Avoids the wrap-whole-command form.
Key design decisions.
PlanWithUnresolvedIdentifier extends NamedRelation— required forOverwriteByExpression.table: NamedRelationto typecheck.namereturns the SQL text of the identifier expression as a placeholder string.PlanWithUnresolvedIdentifierdoes not extendCTEInChildren— with in-slot placement no parser caller places the placeholder as the substitution root, so the trait would have no reachable use.- Two analyzer rules need explicit recursion for non-child
tableslots:ResolveIdentifierClause(materialization) andBindParameters(parameter binding).InsertIntoStatementandV2WriteCommandoverridegetDefaultTreePatternBitsto uniontable.treePatternBitsso pruning correctly reaches these nodes. CheckAnalysisvalidates the post-analysis shape ofCacheTableAsSelect.tempViewName(must be a non-null stringLiteral);tempViewNameStringextracts the string at execution time.
Implementation sketch. Touches parser (AstBuilder.withInsertInto, visitCreateTable, visitReplaceTable, visitCacheTable; new helpers buildWriteTableSlot and buildCacheTableAsSelectName); plan nodes (PlanWithUnresolvedIdentifier, InsertIntoStatement, V2WriteCommand for tree-pattern bits; CacheTableAsSelect for the new Expression slot); analyzer rules (ResolveIdentifierClause, BindParameters for explicit recursion; CheckAnalysis for invariant validation); strategy (DataSourceV2Strategy calls tempViewNameString). The post-hoc WithCTE collapse from #55706 is removed.
General
The PR description references a helper buildWriteTableSlotNamedRelation ("for OverwriteByExpression.table"), but the code only adds one helper, buildWriteTableSlot, which returns NamedRelation and serves both InsertIntoStatement.table and OverwriteByExpression.table. The description also shows the ResolveIdentifierClause case as case i @ InsertIntoStatement(p: PlanWithUnresolvedIdentifier, ...) ..., but the actual code uses isInstanceOf/asInstanceOf (see inline below). Worth aligning the description with the code so future readers don't search for a non-existent helper or pattern.
|
|
||
| final override val nodePatterns: Seq[TreePattern] = Seq(PLAN_WITH_UNRESOLVED_IDENTIFIER) | ||
|
|
||
| override def name: String = identifierExpr.sql |
There was a problem hiding this comment.
PlanWithUnresolvedIdentifier is now reachable as a NamedRelation via two error paths I traced: SparkStrategies.extractTableNameForError (sql/core/.../SparkStrategies.scala:1135) and the r: NamedRelation => toSQLId(r.name) fallback in QueryCompilationErrors.scala:3635. Both will render the SQL text of the unresolved expression (e.g. IDENTIFIER(:p) or concat('a', 'b')) as the "table name" in error messages. That's a reasonable fallback, but worth a one-line comment on this override noting that identifierExpr.sql is intentionally an unresolved-placeholder stand-in for those error paths.
| case i: InsertIntoStatement if i.table.isInstanceOf[PlanWithUnresolvedIdentifier] && { | ||
| val p = i.table.asInstanceOf[PlanWithUnresolvedIdentifier] | ||
| p.identifierExpr.resolved && p.childrenResolved | ||
| } => | ||
| val p = i.table.asInstanceOf[PlanWithUnresolvedIdentifier] | ||
| if (referredTempVars.isDefined) { | ||
| referredTempVars.get ++= collectTemporaryVariablesInLogicalPlan(p) | ||
| } | ||
| i.copy(table = executor.execute(p.planBuilder.apply( | ||
| IdentifierResolution.evalIdentifierExpr(p.identifierExpr), p.children))) |
There was a problem hiding this comment.
The PR description shows this as case i @ InsertIntoStatement(p: PlanWithUnresolvedIdentifier, ...) ..., but the code uses isInstanceOf in the guard and asInstanceOf in both the guard and the body (the cast appears twice). Either update the description, or extract the cast once — e.g. guard with i.table.isInstanceOf[PlanWithUnresolvedIdentifier], then do the identifierExpr.resolved && childrenResolved check and the materialization in the body, returning i unchanged when not yet resolved. The same applies to the V2WriteCommand case below.
| // that the name slot holds the expression placeholder and no `WithCTE(CTEInChildren, _)` shape | ||
| // survives `CTESubstitution` (running the cache through full analysis would require the temp | ||
| // view machinery, so this is a parser-level test). | ||
| test("SPARK-46625: CACHE TABLE IDENTIFIER(...) AS WITH ... SELECT ... parser") { |
There was a problem hiding this comment.
End-to-end coverage gap: the CACHE TABLE / REPLACE TABLE / INSERT INTO IDENTIFIER REPLACE WHERE tests are all parser-level. For CACHE TABLE specifically an e2e test doesn't need v2-catalog or temp-view machinery beyond what SharedSparkSession already provides — CACHE TABLE IDENTIFIER('temp_v') AS WITH ... SELECT ... should resolve and execute. That would exercise the tempViewNameString extraction in DataSourceV2Strategy:781 and the new CheckAnalysis invariant case for CacheTableAsSelect, neither of which any test currently reaches.
…e2e cache test - Refactor ResolveIdentifierClause: lift the PlanWithUnresolvedIdentifier cast out of a compound guard into a single asInstanceOf at the top of the case body; resolution check moves to an inner if/else with `i`/`w` returned unchanged when not yet resolved. Same shape for InsertIntoStatement and V2WriteCommand cases. - Document `PlanWithUnresolvedIdentifier.name = identifierExpr.sql` as the error-path placeholder used by SparkStrategies.extractTableNameForError and the `r: NamedRelation` fallback in QueryCompilationErrors. - Add an end-to-end ParametersSuite test for CACHE TABLE IDENTIFIER(:p) AS WITH ... SELECT ..., exercising the tempViewNameString extraction in DataSourceV2Strategy and the CheckAnalysis invariant case for CacheTableAsSelect.tempViewName. Co-authored-by: Isaac
peter-toth
left a comment
There was a problem hiding this comment.
LGTM, only one nit and seems like there is a linter issue.
| messageParameters = Map("name" -> "IDENTIFIER", "expr" -> p.identifierExpr.sql) | ||
| ) | ||
|
|
||
| case c: CacheTableAsSelect if c.tempViewName.resolved => |
There was a problem hiding this comment.
The if c.tempViewName.resolved guard was added in commit 90f09a9 specifically to keep this invariant check from pre-empting the proper UNRESOLVED_COLUMN user-facing error when IDENTIFIER(<unresolved-col>) itself fails to resolve. The comment documents the intent well, but there's no test that pins this behavior — a future change that drops the guard (e.g. case c: CacheTableAsSelect => without the guard) would silently swap the user-facing error for a SparkException internal error, with nothing in CI to catch it.
Worth adding a regression test alongside the SPARK-46625 cache-table tests in ParametersSuite. Rough shape:
test("SPARK-46625: CACHE TABLE IDENTIFIER(<unresolved-col>) reports UNRESOLVED_COLUMN, not an internal error") {
val ex = intercept[AnalysisException] {
spark.sql(
"""CACHE TABLE IDENTIFIER(unresolved_col) AS SELECT 1 AS a""".stripMargin)
}
assert(ex.getCondition.startsWith("UNRESOLVED_COLUMN"))
// and explicitly NOT the internal-error condition the invariant case throws
assert(!ex.getMessage.contains("CacheTableAsSelect.tempViewName must be"))
}That pins both halves of the design — the guard fires for unresolved IDENTIFIERs (so the catch-all LogicalPlan case below produces UNRESOLVED_COLUMN), AND the case below it stays the catch-all, not pre-empted by this branch.
What changes were proposed in this pull request?
Root-cause fix for SPARK-46625. Supersedes #55706.
Push
PlanWithUnresolvedIdentifierinto the command's identifier slot at parse time, for every parser-built command, instead of wrapping the whole command. With this change,CTESubstitutionsees the realCTEInChildrencommand directly and placesWithCTEon the command's children by construction -- the invalidWithCTE(InsertIntoStatement, ...)/WithCTE(CreateTableAsSelect, ...)shape never appears.Parser placement (
AstBuilder):AstBuilder.withInsertInto,visitCreateTable,visitReplaceTablebuild the command directly. A new helperbuildWriteTableSlotreturnsNamedRelationand serves bothInsertIntoStatement.tableandOverwriteByExpression.table(whose slot is typedNamedRelation). CTAS/RTAS use single-argwithIdentClauseto put the placeholder in thenameslot.visitCacheTable's AS path uses a new helperbuildCacheTableAsSelectNameto build the temp-view name as anExpression. The non-AS path uses single-argwithIdentClause.CacheTableAsSelect.tempViewName: String -> Expression:The last identifier slot still expressed as a plain
Stringis now anExpression. The parser produces a non-null stringLiteralfor direct identifiers andIDENTIFIER('literal'), or anExpressionWithUnresolvedIdentifierforIDENTIFIER(<non-literal>). The single-part invariant is validated at parse time for literal cases and at materialization for the non-literal case.CheckAnalysisenforces the post-analysis invariant thattempViewNameis a non-null stringLiteral, andtempViewNameStringextracts the string for execution. The expression slot is naturally visited bytransformExpressions, soResolveIdentifierClause's existingExpressionWithUnresolvedIdentifierbranch resolves it without any new code path.Placeholder shape (
unresolved.scala):PlanWithUnresolvedIdentifiermixes inNamedRelationso it can occupyOverwriteByExpression.table(typedNamedRelation) directly.CTEInChildren. With the in-slot placement plus theCacheTableAsSelectrefactor, no parser caller places the placeholder as the substitution root of aWITH ... <command>subtree, so aCTEInChildrensafety net has no reachable path under the current grammar.Materialization (
ResolveIdentifierClause):InsertIntoStatement.tableandV2WriteCommand.tableare non-childLogicalPlanslots (child = query), so the defaultresolveOperatorsUptraversal never visits placeholders inside them. Two special-cases recurse explicitly:case i: InsertIntoStatement if i.table.isInstanceOf[PlanWithUnresolvedIdentifier] => ...case w: V2WriteCommand if w.table.isInstanceOf[PlanWithUnresolvedIdentifier] => ...Each case extracts the placeholder with a single
asInstanceOfat the top of the body and inlines theidentifierExpr.resolved && childrenResolvedcheck, returning the unchanged command when not yet ready.The
V2WriteCommandmatch dispatches via the abstractwithNewTable(NamedRelation)and pattern-matches the materialized result asNamedRelation, throwing an internal error otherwise. OnlyOverwriteByExpressionis parser-built with a placeholder intabletoday; matching the trait keeps the rule consistent for any future analyzer-built node in the same shape.Tree-pattern propagation (
statements.scala,v2Commands.scala):InsertIntoStatementandV2WriteCommandoverridegetDefaultTreePatternBitsto uniontable.treePatternBits, socontainsPattern(...)pruning correctly reports patterns (PARAMETER,PLAN_WITH_UNRESOLVED_IDENTIFIER) living intable.Parameter binding (
parameters.scala):BindParameters.bindpattern-matches bothInsertIntoStatementandV2WriteCommandto recurse intotable. Without this,INSERT ... IDENTIFIER(:p)andINSERT INTO REPLACE WHERE ... IDENTIFIER(:p)underspark.sql.legacy.parameterSubstitution.constantsOnly=truewould fail to bind the parameter.CreateTableAsSelect.name/ReplaceTableAsSelect.name:Already children via
V2CreateTableAsSelectPlan.childrenToAnalyze, so no extra traversal hook is needed for them.The post-hoc
WithCTE(c: CTEInChildren, _) => c.withCTEDefs(cteDefs)collapse inResolveIdentifierClausefrom #55706 is removed entirely -- no command shape reaches the analyzer needing it.Why are the changes needed?
previously produced an analysed plan with
WithCTEwrapping theInsertIntoStatement-- a structurally invalid shape. Plug-in datasources that re-analyse theInsertIntoStatement's query subtree throwNoSuchElementException: key not found. Zero-day issue since SPARK-46625.#55706 fixed the symptom by collapsing
WithCTE(c: CTEInChildren, defs)after the fact inResolveIdentifierClause. That works but encodes an undo for a placement that should never have happened, and turned out to be unsafe: after a later analyzer rewrite (e.g.RewriteDeleteFromTablerewritingWithCTE(DeleteFromTable, defs)intoWithCTE(ReplaceData, defs)), the collapse would pushWithCTEintoReplaceData.queryand orphan the CTE refs inReplaceData.condition/groupFilterCondition, breaking 9 v2 DML rCTE tests on CI. The review on #55706 (#55706 (comment)) asked for the root-cause fix; this PR implements it.Credit to @stevomitric for surfacing in the original PR thread that the parser-level placement also has to work for legacy parameter substitution -- that's why this PR adds the targeted
BindParametersand tree-pattern-bits handling forInsertIntoStatement.tableandV2WriteCommand.table.Does this PR introduce any user-facing change?
No behavior change, but a minor timing change in
visitCreateTable/visitReplaceTable: CTAS/RTAS validation errors (Schema may not be specified in a CTAS statement,Partition column types may not be specified...,Constraints may not be specified...) for non-literal identifiers now fire at parse time rather than at identifier resolution. Same error messages and samectx; only the moment of throwing moves earlier. Fail-fast improvement, not a regression.How was this patch tested?
New tests in
ParametersSuite:WITH ... INSERT OVERWRITE TABLE IDENTIFIER(:p) SELECT ... FROM cteWITH ... INSERT INTO IDENTIFIER(:p) SELECT ... FROM cteCREATE TABLE IDENTIFIER(:p) AS WITH ... SELECT ... FROM cteINSERT IDENTIFIER(:p) under legacy parameter substitution(coversspark.sql.legacy.parameterSubstitution.constantsOnly=true, which exercises theBindParametersrecursion intoInsertIntoStatement.table)WITH ... INSERT INTO IDENTIFIER(:p) REPLACE WHERE ... -- parser(asserts the placeholder lives inOverwriteByExpression.tableand noWithCTE(CTEInChildren, _)shape survivesCTESubstitution)BindParameters recurses into OverwriteByExpression.table(rule-level test; full analysis would require a v2 catalog)CACHE TABLE IDENTIFIER(...) AS WITH ... SELECT ... -- parser(assertstempViewNameholdsExpressionWithUnresolvedIdentifierand noWithCTE(CTEInChildren, _)shape survivesCTESubstitution)REPLACE TABLE IDENTIFIER(...) AS WITH ... SELECT ... -- parser(mirrors the CTAS test for RTAS)Each CTE test asserts that no
WithCTE(CTEInChildren, _)shape leaks through analysis.DDLParserSuite.CACHE TABLEupdated to constructCacheTableAsSelectwithLiteral("t")instead of"t"for the newExpressionslot.Existing suites run clean:
ParametersSuite,DeltaBased/GroupBased Delete/Update/Mergesuites (including the rCTE-with-DML tests that were the original CI failures),DataSourceV2DataFrameSuite,SQLViewSuite,IdentifierClauseParserSuite,AnalysisSuite/PlanParserSuite/AnalysisErrorSuiteand siblings,DataSourceV2SQLSuiteV1Filter,InsertSuite,CTEInlineSuite/CTEHintSuite,CachedTableSuite,CreateTableAsSelectSuite/DDLParserSuite.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude (Claude Code, Opus 4.7)