[SPARK-56870][SDP] Implement SCD1 Batch Processor; Extend Microbatch with CDC Metadata by AnishMahto · Pull Request #55970 · apache/spark

AnishMahto · 2026-05-19T02:32:53Z

Approved AutoCDC SPIP: https://lists.apache.org/thread/j6sj9wo9odgdpgzlxtvhoy7szs0jplf7

This is a stacked PR. Review incremental diff here: AnishMahto/spark@SPARK-56856-SCD1-microbatch-deduplication...SPARK-56870-extend-microbatch-with-cdc-metadata

Preamble:

The SCD type 1 flow is a foreachBatch streaming query on an input change-data-feed, and is responsible for reconciling the incoming change data onto some target table that follows SCD1 replication semantics.

SCD1 flows also maintain an "auxiliary" table to keep track of early-arriving out-of-order received events state. Each microbatch will need to reconcile against this auxiliary table as well, and update the auxiliary table's state appropriately for future microbatches.

Extend Microbatch with CDC Metadata:

After deduplication, all of the incoming rows can be classified as either a delete event or an upsert event (mutually exclusive), and there's at most one per key.

If we identify a row as a delete event, remember its sequencing as its deleteSequence. If we identify a row as an upsert event, remember its sequencing as its upsertSequence. That is, deleteSequence/upsertSequence encode both the sequencing for the row as well as the row classification (delete or upsert).

We need to persist this encoded information now, because in future stages we may drop the columns that deleteCondition needed to do the classification in the first place, depending on which columns were selected by ChangeArgs.columnSelection.

Where is the CDC Metadata stored?

Within the microbatch, we append a _cdc_metadata struct column, that stores the deleteSequence and upsertSequence.

This _cdc_metadata column will eventually also land in the persisted target and auxiliary tables, which are the artifacts of an AutoCDC flow. This column represents operational metadata that the AutoCDC flow has tagged a row with, and is necessary for out-of-order correctness of the SCD decomposition.

Users will not be able to opt out of persisting this column in the target table using ChangeArgs.columnSelection, as it is necessary for correctness. The column will not have a stable public contract, and users should make no assumptions on its contents.

szehon-ho

Review of the incremental diff on top of #55969 (extend microbatch with CDC metadata). Overall this looks good to merge with minor nits.

What looks good

The delete/upsert encoding in _cdc_metadata matches the SPIP story: mutually exclusive deleteSequence / upsertSequence, persisted before columnSelection can drop deleteCondition columns.
resolvedSequencingType at processor construction is the right split (flow setup vs per-microbatch work); the Int→Long cast test and incompatible cast test are valuable.
Reserved-column conflict uses conf.resolver and CaseSensitivityLabels — consistent with session case sensitivity.
constructCdcMetadataCol driven off cdcMetadataColSchema with ordered fields is clean; companion constants keep tests readable.
AUTOCDC_RESERVED_COLUMN_NAME_CONFLICT / SQLSTATE 42710 is appropriate.
Test coverage for classification, no delete condition, column ordering, cast success/failure, and reserved-name conflict is solid.

Incremental diff is focused and stacks cleanly on #55836 + #55969.

szehon-ho

Re-reviewed the incremental diff on #55969. CDC metadata encoding and resolvedSequencingType casting look correct; reserved-column validation and tests LGTM.

Left three inline nits (comment wording, extend input contract, reserved-prefix scope) — all non-blocking. Approved.

szehon-ho · 2026-05-21T01:00:18Z

+      resolvedSequencingType = LongType
+    )
+
+    // Mutual-exclusivity invariant: each row's _cdc_metadata struct has exactly one of


Nit (non-blocking): this comment still says _cdc_metadata; the implementation uses Scd1BatchProcessor.cdcMetadataColName (__spark_autocdc_metadata). Consider aligning the comment with the constant so future readers are not confused.

szehon-ho · 2026-05-21T01:00:19Z

+   * The returned dataframe has all of the columns in the input microbatch + the CDC metadata
+   * column.
+   */
+  def extendMicrobatchRowsWithCdcMetadata(microbatchDf: DataFrame): DataFrame = {


Nit (non-blocking): deduplicateMicrobatch documents a validated microbatch (non-null, orderable sequencing). Consider mirroring that contract here — e.g. rename microbatchDf → validatedMicrobatch and add a scaladoc @param — so foreachBatch wiring keeps the same precondition for both steps.

szehon-ho · 2026-05-21T01:00:19Z

+    )
+  }
+
+  private def validateCdcMetadataColumnNotPresent(microbatchDf: DataFrame): Unit = {


Nit (non-blocking): this PR only guards __spark_autocdc_metadata on the microbatch (__spark_autocdc_winning_row is covered in #55969 for dedup). When target/auxiliary schemas are wired, worth applying the same reserved-prefix policy there in a follow-up so user columns cannot collide with persisted AutoCDC metadata.

AnishMahto force-pushed the SPARK-56870-extend-microbatch-with-cdc-metadata branch from 552e33c to 9a566ff Compare May 19, 2026 17:10

AnishMahto changed the title ~~[SPARK-56870][SDP] Extend Microbatch with CDC Metadata~~ [SPARK-56870][SDP] SCD1 Extend Microbatch with CDC Metadata May 19, 2026

AnishMahto changed the title ~~[SPARK-56870][SDP] SCD1 Extend Microbatch with CDC Metadata~~ [SPARK-56870][SDP] Implement SCD1 Batch Processor; Extend Microbatch with CDC Metadata May 19, 2026

szehon-ho reviewed May 19, 2026

View reviewed changes

AnishMahto force-pushed the SPARK-56870-extend-microbatch-with-cdc-metadata branch from 9a566ff to f9c2aed Compare May 19, 2026 23:06

AnishMahto added 16 commits May 20, 2026 20:35

Implement deduplicateMicrobatch

ada1bdb

indenting cleanup

a1a0e7b

schema comment

434f6ad

casing

022a95c

linting

f92f1e3

PR feedback

04a38f2

use reserved __spark_autocdc* prefix

19d9040

Add deduplicate test when row contains nested columns

1e8b86c

PR feedback

0b498a0

validation

a0d1198

buff scaladoc

9a2c28f

use spark resolver

c95144d

lingint

415f90b

rebase conflict

c1da259

PR feedback

5338205

rebase conflicts

02473ba

AnishMahto force-pushed the SPARK-56870-extend-microbatch-with-cdc-metadata branch from f9c2aed to 02473ba Compare May 20, 2026 21:08

AnishMahto requested a review from szehon-ho May 20, 2026 21:09

szehon-ho approved these changes May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56870][SDP] Implement SCD1 Batch Processor; Extend Microbatch with CDC Metadata#55970

[SPARK-56870][SDP] Implement SCD1 Batch Processor; Extend Microbatch with CDC Metadata#55970
AnishMahto wants to merge 16 commits into
apache:masterfrom
AnishMahto:SPARK-56870-extend-microbatch-with-cdc-metadata

AnishMahto commented May 19, 2026 •

edited

Loading

Uh oh!

szehon-ho left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

szehon-ho left a comment

Uh oh!

szehon-ho May 21, 2026

Uh oh!

szehon-ho May 21, 2026

Uh oh!

szehon-ho May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AnishMahto commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho May 21, 2026

Choose a reason for hiding this comment

Uh oh!

szehon-ho May 21, 2026

Choose a reason for hiding this comment

Uh oh!

szehon-ho May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AnishMahto commented May 19, 2026 •

edited

Loading