Skip to content

Roll added manifests at commit.manifest.target-size-bytes in fast-append#3304

Closed
rynewang wants to merge 1 commit intoapache:mainfrom
rynewang:perf/rolling-manifest-fast-append
Closed

Roll added manifests at commit.manifest.target-size-bytes in fast-append#3304
rynewang wants to merge 1 commit intoapache:mainfrom
rynewang:perf/rolling-manifest-fast-append

Conversation

@rynewang
Copy link
Copy Markdown

Summary

_FastAppendFiles currently writes every added data file into a single manifest. commit.manifest.target-size-bytes (default 8 MiB) is already defined and honoured by _MergeAppendFiles when bin-packing existing manifests, but the fast-append path never reads it — so a large append produces one arbitrarily large manifest. Java Iceberg's RollingManifestWriter rolls over at the target on the fast-append path too; this brings pyiceberg to parity.

Change

_SnapshotProducer._manifests() now:

  • Writes the first added-manifest inline until writer.tell() reaches commit.manifest.target-size-bytes, which yields an exact entries-per-manifest count for this append's entry shape.
  • Chunks the remaining _added_data_files by that count and submits each chunk to the existing ExecutorFactory thread pool. Encoding is GIL-bound, but block compression (zlib) and the output-file write both release the GIL, so one chunk's encode overlaps with earlier chunks' compress/upload.
  • Submits _write_delete_manifest and _existing_manifests to the pool before starting the inline first chunk, so they still run concurrently.

Small appends (everything fits under the target) take the same code path and produce one manifest, same as before.

Why multiple manifests matter downstream

  • Planners (Trino/Athena/Spark) read manifests in parallel; one multi-GB manifest is a single-threaded decode bottleneck.
  • The manifest-list carries per-manifest partition bounds, so predicates can skip whole manifests without opening them — which is impossible with one monolith.
  • Delete/overwrite rewrites any manifest containing an affected file; with one large manifest, a single-row delete rewrites the whole thing.

Testing

  • test_fast_append_rolls_added_manifests_at_target_size: creates a table with commit.manifest.target-size-bytes=4096, appends 200 data files, asserts _manifests() returns more than one manifest, all entries accounted for, and each manifest is within a small multiple of the target.
  • test_fast_append_single_manifest_when_under_target: 3 files → one manifest.
  • Existing tests/table/test_snapshots.py and tests/utils/test_manifest.py pass unchanged.

Relationship to #3303

Independent — this PR only touches snapshot.py, #3303 only touches pyiceberg/avro/. The thread-pool overlap here helps with or without the Cython encoder (pure-Python encode still overlaps with compress/upload); with #3303 the encode floor drops and the overlap ratio improves.

_FastAppendFiles previously wrote every added data file into a single
manifest, regardless of commit.manifest.target-size-bytes. Java's
RollingManifestWriter honours this property on the fast-append path
too; this change brings pyiceberg to parity.

The first manifest is written inline until it reaches the target,
which yields an exact entries-per-manifest count; remaining chunks
are then submitted to the existing ExecutorFactory pool so the
GIL-bound encode of chunk K overlaps with the compress/upload
(both GIL-releasing) of earlier chunks. _write_delete_manifest and
_existing_manifests are submitted first so they run concurrently
with the inline first chunk.

Small appends (under the target in one manifest) take the same
code path and produce a single manifest as before.

Multiple small manifests let query planners read them in parallel,
prune whole manifests via the manifest-list partition bounds
without opening them, and keep delete/overwrite rewrites bounded
to the affected manifest instead of the full set.
Copy link
Copy Markdown
Member

@geruh geruh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening @rynewang, this is definitely a problem as pyiceberg needs manifest rolling support like the other sub-projects. However, there is an open pr that starts that work in #3009 that still needs review. The approach taken there is the ground work that mirrors the logic in Java's RollingManifestWriter. Once this lands, then we can hook it into our manifest writing logic.

@rynewang
Copy link
Copy Markdown
Author

rynewang commented May 1, 2026

Thanks! Great to know it's already being handled. closing this PR.

@rynewang rynewang closed this May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants