Skip to content

[SPARK-56934][INFRA] Make build_infra_images_cache workflow error tolerant#55972

Closed
zhengruifeng wants to merge 1 commit into
apache:masterfrom
zhengruifeng:build-infra-images-continue-on-error
Closed

[SPARK-56934][INFRA] Make build_infra_images_cache workflow error tolerant#55972
zhengruifeng wants to merge 1 commit into
apache:masterfrom
zhengruifeng:build-infra-images-continue-on-error

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

@zhengruifeng zhengruifeng commented May 19, 2026

What changes were proposed in this pull request?

Make the build_infra_images_cache.yml workflow tolerant of individual image build failures:

  • Add continue-on-error: true to each of the 12 Build and push steps so a failure in one does not abort the remaining builds. In particular, a failure of the base ./dev/infra/ image build should no longer prevent the other image builds from running.
  • Add a final "Fail if any image build failed" step that runs with if: always(), prints each build step's outcome, and exits non-zero if any was failure.

Why are the changes needed?

Today, a single image build failure aborts the workflow immediately, leaving the remaining cache layers stale until someone re-triggers the job. This is especially impactful when the first step (the ./dev/infra/ base image) fails, because every subsequent image build is then skipped on that run. With this change every image still gets a chance to build and refresh its cache on each run, while the overall workflow still fails if any image build did not succeed.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

YAML parses cleanly (python3 -c "import yaml; yaml.safe_load(...)"). Verified all 12 build steps received continue-on-error: true and that the final aggregator step references every build step's outcome.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.7)

### What changes were proposed in this pull request?

Make the `build_infra_images_cache.yml` workflow tolerant of individual
image build failures:

- Add `continue-on-error: true` to each of the 12 `Build and push` steps
  so a failure in one does not abort the remaining builds.
- Add a final "Fail if any image build failed" step that runs with
  `if: always()`, prints each build step's `outcome`, and exits non-zero
  if any was `failure`.

### Why are the changes needed?

Today, a single image build failure aborts the workflow immediately,
leaving the remaining cache layers stale until someone re-triggers the
job. With this change every image still gets a chance to build and
refresh its cache on each run, while the overall workflow still fails
if any image build did not succeed.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

YAML parses cleanly (`python3 -c "import yaml; yaml.safe_load(...)"`).
Verified all 12 build steps received `continue-on-error: true` and that
the final aggregator step references every build step's outcome.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.7)
@zhengruifeng zhengruifeng changed the title [INFRA] Make build_infra_images_cache workflow error tolerant [SPARK-56934][INFRA] Make build_infra_images_cache workflow error tolerant May 19, 2026
@zhengruifeng zhengruifeng marked this pull request as ready for review May 19, 2026 02:46
@zhengruifeng zhengruifeng requested review from HyukjinKwon, LuciferYang and dongjoon-hyun and removed request for LuciferYang May 19, 2026 02:50
@zhengruifeng
Copy link
Copy Markdown
Contributor Author

also cc @gaogaotiantian

Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Yes, this is much better.

Thank you, @zhengruifeng and @HyukjinKwon .

cc @peter-toth

@zhengruifeng
Copy link
Copy Markdown
Contributor Author

merged to master, will make separate PR for 4.x

@zhengruifeng zhengruifeng deleted the build-infra-images-continue-on-error branch May 20, 2026 01:12
zhengruifeng added a commit that referenced this pull request May 20, 2026
…r tolerant

Backport of #55972 to `branch-4.x`. Cherry-picked from master commit 83b2d07; resolved a trivial conflict on the `docker/build-push-action` SHA pin (kept the version already in use on `branch-4.x`).

### What changes were proposed in this pull request?

Make the `build_infra_images_cache.yml` workflow tolerant of individual image build failures:

- Add `continue-on-error: true` to each of the 12 `Build and push` steps so a failure in one does not abort the remaining builds. In particular, a failure of the base `./dev/infra/` image build should no longer prevent the other image builds from running.
- Add a final "Fail if any image build failed" step that runs with `if: always()`, prints each build step's `outcome`, and exits non-zero if any was `failure`.

### Why are the changes needed?

Today, a single image build failure aborts the workflow immediately, leaving the remaining cache layers stale until someone re-triggers the job. This is especially impactful when the first step (the `./dev/infra/` base image) fails, because every subsequent image build is then skipped on that run. With this change every image still gets a chance to build and refresh its cache on each run, while the overall workflow still fails if any image build did not succeed.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

YAML parses cleanly (`python3 -c "import yaml; yaml.safe_load(...)"`). Verified all 12 build steps received `continue-on-error: true` and that the final aggregator step references every build step's `outcome`.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.7)

Closes #56004 from zhengruifeng/build-infra-images-continue-on-error-4.x.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
zhengruifeng added a commit that referenced this pull request May 20, 2026
…r tolerant

Backport of #55972 to `branch-4.2`. Cherry-picked from master commit 83b2d07 and adapted for `branch-4.2`:

- Kept the `docker/build-push-action` SHA pin already in use on `branch-4.2` (the master change was on a newer SHA).
- Also applied `continue-on-error: true` to the `Build and push (PySpark with Python 3.10)` step, which is 4.2-only.
- Added a corresponding `pyspark-python-310` entry to the final "Fail if any image build failed" summary step.

### What changes were proposed in this pull request?

Make the `build_infra_images_cache.yml` workflow tolerant of individual image build failures:

- Add `continue-on-error: true` to each `Build and push` step (13 on `branch-4.2`) so a failure in one does not abort the remaining builds. In particular, a failure of the base `./dev/infra/` image build should no longer prevent the other image builds from running.
- Add a final "Fail if any image build failed" step that runs with `if: always()`, prints each build step's `outcome`, and exits non-zero if any was `failure`.

### Why are the changes needed?

Today, a single image build failure aborts the workflow immediately, leaving the remaining cache layers stale until someone re-triggers the job. This is especially impactful when the first step (the `./dev/infra/` base image) fails, because every subsequent image build is then skipped on that run. With this change every image still gets a chance to build and refresh its cache on each run, while the overall workflow still fails if any image build did not succeed.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

YAML parses cleanly (`python3 -c "import yaml; yaml.safe_load(...)"`). Verified all 13 build steps received `continue-on-error: true` and that the final aggregator step references every build step's `outcome`.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.7)

Closes #56005 from zhengruifeng/build-infra-images-continue-on-error-4.2.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants