[SPARK-56934][INFRA] Make build_infra_images_cache workflow error tolerant#55972
Closed
zhengruifeng wants to merge 1 commit into
Closed
[SPARK-56934][INFRA] Make build_infra_images_cache workflow error tolerant#55972zhengruifeng wants to merge 1 commit into
zhengruifeng wants to merge 1 commit into
Conversation
### What changes were proposed in this pull request? Make the `build_infra_images_cache.yml` workflow tolerant of individual image build failures: - Add `continue-on-error: true` to each of the 12 `Build and push` steps so a failure in one does not abort the remaining builds. - Add a final "Fail if any image build failed" step that runs with `if: always()`, prints each build step's `outcome`, and exits non-zero if any was `failure`. ### Why are the changes needed? Today, a single image build failure aborts the workflow immediately, leaving the remaining cache layers stale until someone re-triggers the job. With this change every image still gets a chance to build and refresh its cache on each run, while the overall workflow still fails if any image build did not succeed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? YAML parses cleanly (`python3 -c "import yaml; yaml.safe_load(...)"`). Verified all 12 build steps received `continue-on-error: true` and that the final aggregator step references every build step's outcome. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.7)
Contributor
Author
|
also cc @gaogaotiantian |
HyukjinKwon
approved these changes
May 19, 2026
dongjoon-hyun
approved these changes
May 19, 2026
peter-toth
approved these changes
May 19, 2026
gaogaotiantian
approved these changes
May 19, 2026
Contributor
Author
|
merged to master, will make separate PR for 4.x |
This was referenced May 20, 2026
zhengruifeng
added a commit
that referenced
this pull request
May 20, 2026
…r tolerant Backport of #55972 to `branch-4.x`. Cherry-picked from master commit 83b2d07; resolved a trivial conflict on the `docker/build-push-action` SHA pin (kept the version already in use on `branch-4.x`). ### What changes were proposed in this pull request? Make the `build_infra_images_cache.yml` workflow tolerant of individual image build failures: - Add `continue-on-error: true` to each of the 12 `Build and push` steps so a failure in one does not abort the remaining builds. In particular, a failure of the base `./dev/infra/` image build should no longer prevent the other image builds from running. - Add a final "Fail if any image build failed" step that runs with `if: always()`, prints each build step's `outcome`, and exits non-zero if any was `failure`. ### Why are the changes needed? Today, a single image build failure aborts the workflow immediately, leaving the remaining cache layers stale until someone re-triggers the job. This is especially impactful when the first step (the `./dev/infra/` base image) fails, because every subsequent image build is then skipped on that run. With this change every image still gets a chance to build and refresh its cache on each run, while the overall workflow still fails if any image build did not succeed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? YAML parses cleanly (`python3 -c "import yaml; yaml.safe_load(...)"`). Verified all 12 build steps received `continue-on-error: true` and that the final aggregator step references every build step's `outcome`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.7) Closes #56004 from zhengruifeng/build-infra-images-continue-on-error-4.x. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
zhengruifeng
added a commit
that referenced
this pull request
May 20, 2026
…r tolerant Backport of #55972 to `branch-4.2`. Cherry-picked from master commit 83b2d07 and adapted for `branch-4.2`: - Kept the `docker/build-push-action` SHA pin already in use on `branch-4.2` (the master change was on a newer SHA). - Also applied `continue-on-error: true` to the `Build and push (PySpark with Python 3.10)` step, which is 4.2-only. - Added a corresponding `pyspark-python-310` entry to the final "Fail if any image build failed" summary step. ### What changes were proposed in this pull request? Make the `build_infra_images_cache.yml` workflow tolerant of individual image build failures: - Add `continue-on-error: true` to each `Build and push` step (13 on `branch-4.2`) so a failure in one does not abort the remaining builds. In particular, a failure of the base `./dev/infra/` image build should no longer prevent the other image builds from running. - Add a final "Fail if any image build failed" step that runs with `if: always()`, prints each build step's `outcome`, and exits non-zero if any was `failure`. ### Why are the changes needed? Today, a single image build failure aborts the workflow immediately, leaving the remaining cache layers stale until someone re-triggers the job. This is especially impactful when the first step (the `./dev/infra/` base image) fails, because every subsequent image build is then skipped on that run. With this change every image still gets a chance to build and refresh its cache on each run, while the overall workflow still fails if any image build did not succeed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? YAML parses cleanly (`python3 -c "import yaml; yaml.safe_load(...)"`). Verified all 13 build steps received `continue-on-error: true` and that the final aggregator step references every build step's `outcome`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.7) Closes #56005 from zhengruifeng/build-infra-images-continue-on-error-4.2. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Make the
build_infra_images_cache.ymlworkflow tolerant of individual image build failures:continue-on-error: trueto each of the 12Build and pushsteps so a failure in one does not abort the remaining builds. In particular, a failure of the base./dev/infra/image build should no longer prevent the other image builds from running.if: always(), prints each build step'soutcome, and exits non-zero if any wasfailure.Why are the changes needed?
Today, a single image build failure aborts the workflow immediately, leaving the remaining cache layers stale until someone re-triggers the job. This is especially impactful when the first step (the
./dev/infra/base image) fails, because every subsequent image build is then skipped on that run. With this change every image still gets a chance to build and refresh its cache on each run, while the overall workflow still fails if any image build did not succeed.Does this PR introduce any user-facing change?
No.
How was this patch tested?
YAML parses cleanly (
python3 -c "import yaml; yaml.safe_load(...)"). Verified all 12 build steps receivedcontinue-on-error: trueand that the final aggregator step references every build step'soutcome.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.7)