fix: dynamic_partition_overwrite builds per-spec delete predicates after partition spec evolution by tusharchou · Pull Request #3149 · apache/iceberg-python

tusharchou · 2026-03-15T12:02:57Z

Rationale

While reviewing PR #3011 (manifest pruning optimization), I identified a correctness
gap when tables have undergone partition spec evolution.

When dynamic_partition_overwrite is called on a table with mixed partition_spec_ids
in its snapshot, the delete predicate was built using only the current partition spec.
This caused inclusive_projection to fail silently when evaluating older manifests —
the predicate contained field references (e.g. region) that have no corresponding
partition field in the old spec, causing the manifest evaluator to skip those manifests
entirely. The result is silent data duplication: stale rows from old spec manifests are
never deleted.

Changes

pyiceberg/table/__init__.py: dynamic_partition_overwrite now iterates over all
partition_spec_ids present in the current snapshot and builds a per-spec delete
predicate, projecting the new data files' partition values into each historical spec's
coordinate space before evaluating.
tests/integration/test_manifest_pruning_spec_evolution.py: two regression tests added:
1. Mixed-spec snapshot — overwrite a partition present under both spec-0 and spec-1
2. Overwrite a partition that exists only in spec-0 manifests (the silent data
  duplication case — no exception raised, wrong rows survive)

Are these changes tested?

Yes — two new integration tests using the SQLite in-memory catalog, no external
services required.

Are there any user-facing changes?

Yes — dynamic_partition_overwrite now correctly deletes all matching rows across
all historical partition specs, fixing silent data duplication on evolved tables.

Fixes Bug: dynamic_partition_overwrite silently skips spec-0 manifests after partition spec evolution #3148
Related to Optimization: Prune manifest in snapshot overwrite operations #3011 (manifest pruning optimization that exposed this gap)
Related to Issue when overwriting data with row filter #1108 (prior spec evolution fix in manifest rewriting by @Fokko)

…s after partition spec evolution Fixes apache#3148 When a table has undergone partition spec evolution, its snapshot may contain manifests written under different partition_spec_ids. Previously, dynamic_partition_overwrite built the delete predicate using only the current spec, causing the manifest evaluator to incorrectly skip manifests from older specs — leaving stale data files silently behind. The fix builds the delete predicate per historical spec present in the snapshot, projecting the new data files' partition values into each spec's coordinate space before evaluating. Regression tests added covering: - Mixed-spec snapshot (manifests from both spec-0 and spec-1) - Overwrite of a partition that only exists in spec-0 manifests (silent data duplication case)

tusharchou · 2026-03-15T16:37:08Z

AI Disclosure
AI was used to help understand the code base and draft code changes. All code changes have been thoroughly reviewed, ensuring that the code changes are in line with a broader understanding of the codebase.

Fokko · 2026-03-27T10:02:10Z

+        # only the fields that spec knows about, matched against the
+        # corresponding positions in the new data files' partition records.
+        snapshot = self.table_metadata.snapshot_by_name(branch or MAIN_BRANCH)
+        if snapshot is not None:


I think this logic is a bit wonky, if the branch doesn't exists then we silently fall back to the main branch.

Fokko · 2026-03-27T10:02:43Z

+        # corresponding positions in the new data files' partition records.
+        snapshot = self.table_metadata.snapshot_by_name(branch or MAIN_BRANCH)
+        if snapshot is not None:
+            spec_ids_in_snapshot = {m.partition_spec_id for m in snapshot.manifests(io=self._table.io)}


Do we need to collect these upfront? This is pretty expensive as it will do IO to pull all the manifests

Fokko · 2026-03-27T10:05:38Z

+    Expected after overwrite:
+      - Only new A rows: values [888, 999]
+      - All B rows untouched: values [10, 11, 200]
+      - Total: 5 rows
+
+    Bug (pre-fix): spec-0 A manifests are skipped by the evaluator,
+    leaving stale A rows (1, 2, 3) in the table -> 8 rows total.


Hmm. I'm not sure if we can make this change. Existing users might be used to the existing behavior. When we change this, it will drop all many more rows than before.

github-actions · 2026-04-27T00:39:20Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

tusharchou added 3 commits March 15, 2026 17:27

fix: add Apache license header to test file

1931bcb

fix inline comment

346387b

Fokko reviewed Mar 27, 2026

View reviewed changes

github-actions Bot added the stale label Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: dynamic_partition_overwrite builds per-spec delete predicates after partition spec evolution#3149

fix: dynamic_partition_overwrite builds per-spec delete predicates after partition spec evolution#3149
tusharchou wants to merge 3 commits intoapache:mainfrom
tusharchou:test/manifest-pruning-spec-evolution

tusharchou commented Mar 15, 2026

Uh oh!

tusharchou commented Mar 15, 2026

Uh oh!

Fokko Mar 27, 2026

Uh oh!

Fokko Mar 27, 2026

Uh oh!

Fokko Mar 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tusharchou commented Mar 15, 2026

Rationale

Changes

Are these changes tested?

Are there any user-facing changes?

Related

Uh oh!

tusharchou commented Mar 15, 2026

Uh oh!

Fokko Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Fokko Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Fokko Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants