Skip to content

perf: Hoist table_metadata at remaining repeat-access sites in snapshot update#3301

Open
rynewang wants to merge 3 commits intoapache:mainfrom
rynewang:perf/hoist-table-metadata-snapshot-loops
Open

perf: Hoist table_metadata at remaining repeat-access sites in snapshot update#3301
rynewang wants to merge 3 commits intoapache:mainfrom
rynewang:perf/hoist-table-metadata-snapshot-loops

Conversation

@rynewang
Copy link
Copy Markdown

Summary

Follow-up to #2674. Transaction.table_metadata replays all staged updates via model_copy(deep=True) on every access, so reading it (or spec()/schema() derived from it) repeatedly within a single snapshot-producer method is redundant deep-copy work.

#2674 hoisted the property access in _summary(); this PR extends the same pattern to three more call sites in pyiceberg/table/update/snapshot.py that still read the property more than once per invocation.

Changes

  • _SnapshotProducer._summary: hoist spec()/schema() out of the per-data-file loop (they are invariant across files; still called 2× per file before this change)
  • _DeleteFiles._compute_deletes: hoist table_metadata/schema once at method entry (was 3 accesses — two via self.schema() for the metrics evaluators and one direct for snapshot_by_id)
  • _MergeAppendFiles.__init__: 3 consecutive self._transaction.table_metadata.properties accesses → 1

All hoists are at method entry. Nothing inside these methods stages a transaction update (the AddSnapshotUpdate is staged by the caller after _commit() returns), so table_metadata is invariant for the duration of each method.

Not touched here: the new_manifest_writer(self.spec(id)) calls inside per-manifest loops in _write_delete_manifest / _compute_deletes / _OverwriteFiles._existing_manifests also trigger 2–3 property accesses per iteration via the schema()/spec()/new_manifest_writer() helpers. Those loops are O(partition-groups or rewritten-manifests) rather than O(files), and fixing them cleanly would mean changing the helper signatures — happy to do that in a follow-up if there's interest.

Testing

New test_snapshot_producer_bounded_metadata_access wraps Transaction.table_metadata with a call counter and asserts:

  • _summary() access count is identical for 10 vs 100 appended files (independent of N), and ≤ 2
  • _MergeAppendFiles.__init__ makes exactly 1 more access than _FastAppendFiles.__init__ (was 3 before this change — verified the test fails with the production diff reverted)

The test constructs _FastAppendFiles / _MergeAppendFiles directly rather than going through the public append path, since the public path writes manifest avro files; the property-access count it measures is the behaviour under test and doesn't require I/O.

Existing tests/table/test_snapshots.py passing.

Motivation

For appends/deletes/overwrites touching large numbers of files or manifests, the per-iteration property access dominates wall-clock (each access replays the staged-updates list through pydantic model_copy). This keeps the cost constant per method call.

…ot update

Follow-up to apache#2674. Transaction.table_metadata replays all staged
updates via model_copy on every access; this applies the apache#2674
hoist pattern to three more sites in snapshot.py that still read
the property more than once per invocation:

- _SnapshotProducer._summary: hoist spec()/schema() out of the
  per-data-file loop
- _DeleteFiles._compute_deletes: hoist table_metadata/schema once
  (was 3 accesses at method entry)
- _MergeAppendFiles.__init__: 3 consecutive .properties reads -> 1

Adds a regression test asserting _summary() access count is
independent of file count and _MergeAppendFiles.__init__ adds
exactly one access over its superclass.
Avoids mypy attr-defined on Transaction.table_metadata.fget by
counting calls to the underlying update_table_metadata function
(the actual expensive operation) via mock.patch with wraps.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant