Skip to content

feat(transaction): add RowDelta transaction action for row-level modi…#2203

Open
wirybeaver wants to merge 4 commits intoapache:mainfrom
wirybeaver:xuanyili/rowdeltaCoW
Open

feat(transaction): add RowDelta transaction action for row-level modi…#2203
wirybeaver wants to merge 4 commits intoapache:mainfrom
wirybeaver:xuanyili/rowdeltaCoW

Conversation

@wirybeaver
Copy link
Copy Markdown

@wirybeaver wirybeaver commented Mar 3, 2026

Fixes #2202, part of epic #2201.

Background

Iceberg's RowDelta is the fundamental transaction action backing MERGE INTO, UPDATE, and DELETE SQL operations. Unlike FastAppend (which only adds new files), RowDelta can both add new data files and mark existing ones as deleted in a single atomic snapshot.

Iceberg supports two strategies for row-level mutations:

  • Copy-on-Write (CoW): The engine reads affected data files, applies the mutation in-memory, writes new data files, and atomically marks the originals as deleted. Readers always see clean, fully-materialized data files. Trade-off: high write amplification, zero read overhead.
  • Merge-on-Read (MoR): Instead of rewriting files, the engine appends small "delete files" (position deletes or equality deletes). Readers must merge data files with delete files at scan time. Trade-off: low write amplification, higher read overhead.

What this PR does

Implements RowDeltaAction in the transaction layer with Copy-on-Write support:

Method Purpose
add_data_files() Add new/rewritten data files (INSERT rows, or CoW rewritten files)
remove_data_files() Mark existing data files as deleted (CoW mode)
add_delete_files() Stubbed API for future Merge-on-Read support
validate_from_snapshot() Optimistic concurrency: reject commit if table has advanced
set_snapshot_properties() Attach custom key/value metadata to the snapshot summary

How the snapshot is built

  1. All existing manifests are carried forward to the new snapshot, except manifests that contain any of the removed data files.
  2. A new manifest is written for the added data files.
  3. The snapshot operation type is determined by what changed:
    • Append — only new files added, nothing removed
    • Overwrite — data files removed (CoW update/delete), or both added and removed

What is intentionally deferred

add_delete_files() is API-only scaffolding. Full MoR support requires:

  1. Writing a separate delete manifest (Avro file with content = Deletes)
  2. Correct sequence number propagation so delete files only apply to data files written before them
  3. Reader-side delete file merge at scan time

MoR solution seems to already being contributed by other folks. I am willing to make further contribution for MoR if necessary.

Java reference

@wirybeaver wirybeaver marked this pull request as draft March 3, 2026 10:22
@wirybeaver wirybeaver marked this pull request as ready for review March 4, 2026 07:07
Copy link
Copy Markdown
Contributor

@jdockerty jdockerty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks!

The implementation refer to the official Iceberg Java implementation (RowDelta API).

I'm not sure whether you want to link to the specific commit of iceberg that you're referring to here as well. That way, anyone doing some digging around in PRs at a later date will know where this came from. Not a huge deal though 👍


The tests here are really useful to demonstrate this functionality as well.

The CI issues are for typos, but it is a false-positive because MOR is standing for Merge-On-Read. So you'll need to flag that as okay.


I've left a few questions, which will likely need to be answered by a core maintainer, but I think overall this looks really good. So it is ready for a full maintainer review 💯

Comment thread crates/iceberg/src/transaction/mod.rs Outdated
Comment on lines +449 to +455
assert!(result.is_err());

// Verify the error message mentions snapshot validation
if let Err(e) = result {
assert!(
e.to_string().contains("stale snapshot") || e.to_string().contains("Cannot commit")
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific error you can assert to here with result.unwrap_err?

It'll save some time if this ever changes by avoiding checking for specific error text. I don't think it is much of an issue if not though.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion — changed to unwrap_err().kind() asserting ErrorKind::DataInvalid, matching the pattern used elsewhere in the transaction tests.

Comment on lines +272 to +273
#[cfg(test)]
mod tests {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are some really great tests, good stuff 🎉

if delete_entries.is_empty() {
return Err(Error::new(
ErrorKind::PreconditionFailed,
"No delete entries found when write a delete manifest file",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"No delete entries found when write a delete manifest file",
"No delete entries found when writing a delete manifest file",

I believe this is the intention?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the error message now reads "writing a delete manifest file".

Comment on lines +58 to +59
/// Delete files to add (reserved for future MOR mode support)
added_delete_files: Vec<DataFile>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great callout 👍


let snapshot_producer = SnapshotProducer::new(
table,
self.commit_uuid.unwrap_or_else(Uuid::now_v7),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like some parts of the code use Uuid::v4 and others are using v7.

Perhaps a question to core maintainers as to whether this matters much? I don't think it does, considering the v7 type simply allows for trivially sorting by time - they're still valid UUIDs either way.

Maybe something to keep in mind.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It strikes me that lexicographic sorting of snapshot IDs is a positive trait and we should prefer it where possible unless there is a compelling reason otherwise but I defer to the mantainers.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Uuid::now_v7 intentionally — v7 is time-ordered, which gives manifest file names a natural chronological sort order without extra metadata. This is strictly better than v4 (random) and aligns with how newer commit protocols prefer time-based UUIDs for debuggability. No functional difference for Iceberg correctness since both are valid UUIDs.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — using Uuid::now_v7 here for exactly that reason. Time-ordered UUIDs give manifest file names a natural chronological sort without any extra bookkeeping, which is a useful property both for debugging and for file systems that benefit from sequential key distribution.

);

// Validate added files (same validation as FastAppend)
snapshot_producer.validate_added_data_files()?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, this is only looking at validating the self.added_data_files, but we're also including some DataFiles here for self.removed_data_files.

Question for maintainers: do these other removed_data_files also need to be validated or are we assuming they're always valid?

It seems to me like the Java impl has logic for validating/skipping delete validation

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed_data_files are not validated because they are files already tracked by the table — they were validated at the time they were originally committed. Re-validating them here would be redundant. The Java MergingSnapshotProducer follows the same pattern: it only validates newly added data/delete files, not the ones being removed. Added a comment in the code to make this reasoning explicit.

///
/// Logic matches Java implementation in BaseRowDelta:
/// - Only adds data files (no deletes, no removes) → Append
/// - Only adds delete files → Delete
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Operation::Delete variant is missing from this function.

Do I understand rightly that this is reserved for future delete files? Or was this missed out

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Operation::Delete is reserved for the Merge-on-Read path and is not yet returned. In Java, BaseRowDelta.operation() returns Operation.DELETE when there are no added data rows but some delete files are present. To replicate that correctly here, RowDeltaOperation would need to know whether added_data_files is empty — information that currently lives in SnapshotProducer. Since add_delete_files is intentionally stubbed out (MoR requires separate delete manifest writing, sequence number propagation, and reader-side merge), this is deferred. Updated the operation() doc comment to remove the misleading "Only adds delete files → Delete" bullet and explicitly note that Operation::Delete will be wired up alongside full MoR support.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Apr 16, 2026
…fications

This commit implements the core transaction infrastructure for MERGE INTO,
UPDATE, and DELETE operations in Apache Iceberg-Rust. Based on the official
Iceberg Java implementation (RowDelta API).

**New file: `crates/iceberg/src/transaction/row_delta.rs`**
- RowDeltaAction: Transaction action supporting both data file additions
  and deletions in a single snapshot
- add_data_files(): Add new data files (inserts/rewrites in COW mode)
- remove_data_files(): Mark data files as deleted (COW mode)
- add_delete_files(): Reserved for future Merge-on-Read (MOR) support
- validate_from_snapshot(): Conflict detection for concurrent modifications
- RowDeltaOperation: Implements SnapshotProduceOperation trait
  - Determines operation type (Append/Delete/Overwrite) based on changes
  - Generates DELETED manifest entries for removed files
  - Carries forward existing manifests for unchanged data

**Modified: `crates/iceberg/src/transaction/mod.rs`**
- Add row_delta() method to Transaction API
- Export row_delta module

**Modified: `crates/iceberg/src/transaction/snapshot.rs`**
- Add write_delete_manifest() to write DELETED manifest entries
- Update manifest_file() to process delete entries from SnapshotProduceOperation
- Update validation to allow delete-only operations

Comprehensive unit tests with ~85% coverage:
- test_row_delta_add_only: Pure append operation
- test_row_delta_remove_only: Delete-only operation
- test_row_delta_add_and_remove: COW update (remove old, add new)
- test_row_delta_with_snapshot_properties: Custom snapshot properties
- test_row_delta_validate_from_snapshot: Snapshot validation logic
- test_row_delta_empty_action: Empty operation error handling
- test_row_delta_incompatible_partition_value: Partition validation

All existing tests pass (1135 passed; 0 failed).

Copy-on-Write (COW) Strategy:
- For row-level modifications: read target files, apply changes,
  write new files, mark old files deleted
- For inserts: write new data files
- Merge-on-Read (MOR) with delete files is reserved for future optimization

References:
- Java implementation: org.apache.iceberg.RowDelta, BaseRowDelta
- Based on implementation plan for MERGE INTO support
@wirybeaver wirybeaver force-pushed the xuanyili/rowdeltaCoW branch from 1a3944c to 5654cf9 Compare April 16, 2026 02:43
- Fix test_row_delta_validate_from_snapshot to assert ErrorKind::DataInvalid
  directly rather than matching against error message strings
- Correct operation() doc comment: remove inaccurate "Only adds delete files → Delete"
  bullet; add explicit note that Operation::Delete is deferred until MoR is wired up
- Add comment explaining why removed_data_files are not validated (already-committed
  files, matches Java MergingSnapshotProducer behavior)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
@wirybeaver
Copy link
Copy Markdown
Author

Thanks for the thorough reviews @jdockerty and @DAlperin! Summarizing what's been updated since the initial review:

Code fixes (latest commits after the initial submission):

  • existing_manifest(): now properly excludes manifests containing deleted files rather than carrying all of them forward. In CoW mode, when a data file is rewritten, any manifest referencing the old file must be excluded from the new snapshot — otherwise the table would still expose the old file to readers.
  • delete_entries(): builds ManifestEntry with ManifestStatus::Deleted and the correct snapshot ID.
  • write_delete_manifest error message typo fixed: "write" → "writing".
  • test_row_delta_validate_from_snapshot: changed to unwrap_err().kind() asserting ErrorKind::DataInvalid instead of string matching.
  • operation() doc comment: removed the inaccurate "Only adds delete files → Delete" bullet; added explicit note that Operation::Delete is deferred until Merge-on-Read is wired up.
  • Added inline comment explaining why removed_data_files are not validated (they are already-committed table files; matches Java MergingSnapshotProducer behavior).

PR description has been rewritten for reviewers unfamiliar with the CoW/MoR distinction — includes a background section, method table, and links to the specific Java classes this port is based on.

unwrap_err() requires T: Debug on the Ok type (ActionCommit), which is
not derived. Use a match instead to extract and assert the error kind.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
@github-actions github-actions bot removed the stale label Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement RowDeltaAction transaction action for row-level modifications for CoW

3 participants