Skip to content

feat(exporter): skip GCS upload when object CRC32C is unchanged#5338

Open
SanskaarUndale21 wants to merge 1 commit into
google:masterfrom
SanskaarUndale21:fix/exporter-crc-skip-upload
Open

feat(exporter): skip GCS upload when object CRC32C is unchanged#5338
SanskaarUndale21 wants to merge 1 commit into
google:masterfrom
SanskaarUndale21:fix/exporter-crc-skip-upload

Conversation

@SanskaarUndale21
Copy link
Copy Markdown

@SanskaarUndale21 SanskaarUndale21 commented May 11, 2026

Summary

Fixes #3513.

Before this change, the exporter uploaded every output file on every run regardless of whether the content had changed. Since all.zip and other generated files became reproducible in #3491, unchanged runs would still create new object generations in the bucket -- making it harder for downstream consumers to detect real updates and generating unnecessary storage churn.

Approach: before each GCS write the writer now fetches the existing object's attributes via ReadObjectAttrs and computes the CRC32C checksum of the outgoing data using the Castagnoli polynomial (the same algorithm GCS itself uses). If the checksums match, the upload is skipped and an info log is emitted. Three edge cases fall through to the normal upload path:

  • Object does not exist yet (ErrNotFound) -- upload always proceeds
  • Any transient error reading attrs -- upload proceeds, a warning is logged
  • Checksums differ -- upload proceeds as usual

The local-write path (used for dev/test via -upload-to-gcs=false) is unchanged, since writes there carry no network cost.

Changes

  • go/cmd/exporter/writer.go: add gcsContentUnchanged helper and call it before WriteObject
  • go/cmd/exporter/writer_test.go: new test file covering skip-on-unchanged, upload-on-changed, upload-of-new-object, and multi-file skip

Test plan

  • TestWriter_GCS_SkipsUnchangedContent -- same data, generation must not increment
  • TestWriter_GCS_UploadsChangedContent -- different data, generation must increment
  • TestWriter_GCS_UploadsNewObject -- no pre-existing object, must be created at generation 1
  • TestWriter_GCS_SkipsMultipleUnchanged -- three unchanged files, all generations stable

Before this change, the exporter uploaded every output file on every
run regardless of whether the content had changed. Since all.zip and
other outputs are now reproducible (google#3491), unchanged files would
accumulate redundant object generations in the bucket, making it
harder for downstream consumers to detect real updates.

The writer now calls ReadObjectAttrs before each GCS write and
computes the CRC32C of the outgoing data using the Castagnoli
polynomial (the same algorithm GCS uses for its stored checksums).
If the checksums match, the upload is skipped and an info log is
emitted. New objects (ErrNotFound) and any transient attr-read
errors fall through to the normal upload path so the exporter
remains correct under all conditions.

Tests verify the three cases: same content is skipped, changed
content is uploaded, and brand-new objects are always created.

Fixes google#3513
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Exporter should check the crc hash before uploading

1 participant