Skip to content

[bug] Emit REMOVE in HudiDataFileExtractor when CLEAN deletes previous file version before incremental sync#824

Open
vinishjail97 wants to merge 1 commit into
apache:mainfrom
vinishjail97:fix/hudi-clean-removes-before-xtable-sync
Open

[bug] Emit REMOVE in HudiDataFileExtractor when CLEAN deletes previous file version before incremental sync#824
vinishjail97 wants to merge 1 commit into
apache:mainfrom
vinishjail97:fix/hudi-clean-removes-before-xtable-sync

Conversation

@vinishjail97
Copy link
Copy Markdown
Contributor

@vinishjail97 vinishjail97 commented May 20, 2026

What is the purpose of the pull request

Fixes a bug in HudiDataFileExtractor (source side) where XTable's incremental Hudi sync can leave a stale ADD / data-file reference in the target table's metadata, causing FILE_NOT_FOUND on read. Affects any target format — Delta and Iceberg — since the missing REMOVE is in the format-agnostic InternalFilesDiff produced by the source.

RCA (one line)

Hudi CLEAN deletes an old Parquet file before XTable's next incremental sync runs, so HudiDataFileExtractor.getUpdatesToPartition can't see the deleted file via getAllBaseFiles() and skips emitting a REMOVE for it — leaving a stale reference in the target's metadata (ADD in the Delta log, data-file entry in the Iceberg manifest) that points to a non-existent file; this PR recovers the deleted file path from the previous commit's metadata in the Hudi timeline (HoodieWriteStat.prevCommit) so the REMOVE is emitted correctly.

Brief change log

  • HudiDataFileExtractor.getUpdatesToPartition now also takes HoodieTimeline and List<HoodieWriteStat> so it can build a fileId → prevCommit map.
  • After the base-file loop, when an ADD was emitted but no REMOVE was, look up the previous commit's metadata via HoodieCommitMetadata.fromBytes(timeline.getInstantDetails(prevInstant).get(), ...) and emit the REMOVE from HoodieWriteStat.getPath().
  • Gracefully degrades (logs a warning and skips) if the previous commit has been archived or the read fails.

Verify this pull request

This change added tests and can be verified as follows:

  • Added TestHudiDataFileExtractor with three unit tests for recoverRemovedFile: previous commit present, previous commit archived, fileId not in previous commit metadata.
  • Added ITConversionController.testIncrementalSyncEmitsRemoveWhenHudiCleanRunsBeforeSync, an end-to-end regression test that reproduces the race (insert → sync → upsert → insert → clean → incremental sync) and asserts Delta equivalence at 120 records.

…e incremental sync

When Hudi CLEAN physically removes an old base file from storage before XTable
processes the corresponding UPSERT commit incrementally, getAllBaseFiles() only
returns the new file. The else-if branch in getUpdatesToPartition that emits a
REMOVE for the previous version never fires, leaving a stale ADD in the downstream
Delta log that causes FILE_NOT_FOUND errors on read.

Fix: after the base-file loop, if an ADD was emitted but no REMOVE was emitted
and HoodieWriteStat.prevCommit indicates a prior version exists, recover the old
file path by reading the previous commit's metadata from the visible Hudi timeline
and emit a REMOVE. Gracefully degrades (logs warning, skips) if the previous
commit has been archived.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant