[SPARK-56856][SDP] Implement SCD1 Batch Processor; Microbatch Deduplication by AnishMahto · Pull Request #55969 · apache/spark

AnishMahto · 2026-05-19T02:28:58Z

Approved AutoCDC SPIP: https://lists.apache.org/thread/j6sj9wo9odgdpgzlxtvhoy7szs0jplf7

This is a stacked PR. Review incremental diff here: AnishMahto/spark@SPARK-56838-introduce-ChangeArgs...SPARK-56856-SCD1-microbatch-deduplication

Preamble:

The SCD type 1 flow is a foreachBatch streaming query on an input change-data-feed, and is responsible for reconciling the incoming change data onto some target table that follows SCD1 replication semantics.

SCD1 flows also maintain an "auxiliary" table to keep track of early-arriving out-of-order received events state. Each microbatch will need to reconcile against this auxiliary table as well, and update the auxiliary table's state appropriately for future microbatches.

Microbatch Deduplication:

The first step of microbatch reconciliation for SCD1 is deduplicating the microbatch such that there is a single row per key.

Since SCD1 is only concerned with maintaining latest state per key from the change data source, within a microbatch we only care about the row with the latest sequencing per key - drop all other rows for that same key.

szehon-ho

Review of the incremental SCD1 microbatch dedup diff (on top of #55836). A few nits below.

szehon-ho · 2026-05-19T20:44:17Z

+   *
+   * The schema of the returned dataframe matches the schema of the microbatch exactly.
+   */
+  def deduplicateMicrobatch(microbatchDf: DataFrame): DataFrame = {


The scaladoc documents tie-breaking and null sequencing behavior; consider adding tests for:

Equal sequencing for the same key — even a lightweight test that documents non-determinism (or runs twice) would lock in the contract.

Null sequencing — max_by has subtle null ordering (see DataFrameAggregateSuite "max_by"); worth defining expected CDC behavior or asserting we reject nulls upstream.

Single row per key (no-op) — cheap sanity check that one input row passes through unchanged.

Not blocking if you prefer to add these when merge logic lands.

Added these tests but just FYI I'm actually going to add microbatch validation to disallow null sequencing when I put together the foreachBatch body in https://issues.apache.org/jira/browse/SPARK-56953.

szehon-ho · 2026-05-19T20:44:17Z

+        .toImmutableArraySeq
+
+    microbatchDf
+      .groupBy(changeArgs.keys.map(k => F.col(k.quoted)): _*)


If changeArgs.keys is empty, groupBy() collapses the entire microbatch into a single group (one output row). Worth guarding with require(changeArgs.keys.nonEmpty, ...) here or validating at ChangeArgs construction in the registration PR.

Yeah we will be validating against empty keys on ChangeArgs construction once we get to AutoCDC flow registration within the SDP engine - both SCD1/SCD2 semantics would break if there is an empty key set.

szehon-ho · 2026-05-19T20:44:17Z

+  def deduplicateMicrobatch(microbatchDf: DataFrame): DataFrame = {
+    // The `max_by` API can only return a single column, so pack/unpack the entire row into a
+    // temporary column before and after the `max_by` operation.
+    val winningRowCol = OutOfOrderCdcMergeUtils.tempColName("__winning_row")


tempColName generates a fresh UUID on every deduplicateMicrobatch call, so the logical plan column name differs across invocations. Fine for correctness; just a heads-up if you later add plan-golden / EXPLAIN tests — you may want a stable internal name with a collision-safe prefix instead. Non-blocking.

Good callout. I see that Spark CDC uses the "__spark_cdc" reserved prefix, so I'm choosing to adopt "__spark_autocdc" as the reserved prefix for SDP system column names.

AnishMahto added 10 commits May 12, 2026 21:02

Introduce ChangeArgs

8b08cbe

linting

202f3a5

reorder error condition

4ac75e7

PR feedback

11606c5

linting

d1a38e6

PR feedback

bbe5335

buff error message and revert to case class

95ca0e1

test UnqualifiedColumnName('col')

481ca9f

minor test buff

0126659

address PR feedbak

ac15be5

AnishMahto changed the title ~~[SPARK-56856][SDP] SCD1 Microbatch Deduplication~~ [SPARK-56856][SDP] Implement SCD1 Batch Processor; Microbatch Deduplication May 19, 2026

AnishMahto added 6 commits May 19, 2026 18:10

PR feedback

436ff0a

Implement deduplicateMicrobatch

875f0b1

indenting cleanup

08ea9f4

schema comment

cf3ec82

casing

21d4ffe

linting

2ff07f4

szehon-ho reviewed May 19, 2026

View reviewed changes

PR feedback

76d775d

AnishMahto force-pushed the SPARK-56856-SCD1-microbatch-deduplication branch from 5deb653 to 76d775d Compare May 19, 2026 21:38

szehon-ho mentioned this pull request May 19, 2026

[SPARK-56870][SDP] Implement SCD1 Batch Processor; Extend Microbatch with CDC Metadata #55970

Open

use reserved __spark_autocdc* prefix

8790a2d

AnishMahto requested a review from szehon-ho May 19, 2026 21:52

Add deduplicate test when row contains nested columns

5c0c0f8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56856][SDP] Implement SCD1 Batch Processor; Microbatch Deduplication#55969

[SPARK-56856][SDP] Implement SCD1 Batch Processor; Microbatch Deduplication#55969
AnishMahto wants to merge 19 commits into
apache:masterfrom
AnishMahto:SPARK-56856-SCD1-microbatch-deduplication

AnishMahto commented May 19, 2026

Uh oh!

szehon-ho left a comment

Uh oh!

Uh oh!

szehon-ho May 19, 2026

Uh oh!

AnishMahto May 19, 2026

Uh oh!

szehon-ho May 19, 2026

Uh oh!

AnishMahto May 19, 2026

Uh oh!

szehon-ho May 19, 2026

Uh oh!

AnishMahto May 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AnishMahto commented May 19, 2026

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

szehon-ho May 19, 2026

Choose a reason for hiding this comment

Uh oh!

AnishMahto May 19, 2026

Choose a reason for hiding this comment

Uh oh!

szehon-ho May 19, 2026

Choose a reason for hiding this comment

Uh oh!

AnishMahto May 19, 2026

Choose a reason for hiding this comment

Uh oh!

szehon-ho May 19, 2026

Choose a reason for hiding this comment

Uh oh!

AnishMahto May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AnishMahto May 19, 2026 •

edited

Loading