feat(exporter): skip GCS upload when object CRC32C is unchanged#5338
Open
SanskaarUndale21 wants to merge 1 commit into
Open
feat(exporter): skip GCS upload when object CRC32C is unchanged#5338SanskaarUndale21 wants to merge 1 commit into
SanskaarUndale21 wants to merge 1 commit into
Conversation
Before this change, the exporter uploaded every output file on every run regardless of whether the content had changed. Since all.zip and other outputs are now reproducible (google#3491), unchanged files would accumulate redundant object generations in the bucket, making it harder for downstream consumers to detect real updates. The writer now calls ReadObjectAttrs before each GCS write and computes the CRC32C of the outgoing data using the Castagnoli polynomial (the same algorithm GCS uses for its stored checksums). If the checksums match, the upload is skipped and an info log is emitted. New objects (ErrNotFound) and any transient attr-read errors fall through to the normal upload path so the exporter remains correct under all conditions. Tests verify the three cases: same content is skipped, changed content is uploaded, and brand-new objects are always created. Fixes google#3513
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #3513.
Before this change, the exporter uploaded every output file on every run regardless of whether the content had changed. Since all.zip and other generated files became reproducible in #3491, unchanged runs would still create new object generations in the bucket -- making it harder for downstream consumers to detect real updates and generating unnecessary storage churn.
Approach: before each GCS write the writer now fetches the existing object's attributes via
ReadObjectAttrsand computes the CRC32C checksum of the outgoing data using the Castagnoli polynomial (the same algorithm GCS itself uses). If the checksums match, the upload is skipped and an info log is emitted. Three edge cases fall through to the normal upload path:ErrNotFound) -- upload always proceedsThe local-write path (used for dev/test via
-upload-to-gcs=false) is unchanged, since writes there carry no network cost.Changes
go/cmd/exporter/writer.go: addgcsContentUnchangedhelper and call it beforeWriteObjectgo/cmd/exporter/writer_test.go: new test file covering skip-on-unchanged, upload-on-changed, upload-of-new-object, and multi-file skipTest plan
TestWriter_GCS_SkipsUnchangedContent-- same data, generation must not incrementTestWriter_GCS_UploadsChangedContent-- different data, generation must incrementTestWriter_GCS_UploadsNewObject-- no pre-existing object, must be created at generation 1TestWriter_GCS_SkipsMultipleUnchanged-- three unchanged files, all generations stable