-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Make restart pod more flexible to different failure scenarios #4340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Hello! Thank you for your contribution. Please review our contribution guidelines to understand the project's testing and code conventions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the pod failure handling logic in the EphemeralRunner controller to be more flexible and general. Instead of checking for specific failure reasons like "Evicted" or "OutOf...", the new approach inspects the runner container's termination state to determine the appropriate action.
Key Changes:
- Generalized pod failure handling based on container state rather than specific pod reasons
- Added logic to treat pod failures with runner container exit code 0 as successful completions (handling sidecar failures gracefully)
- Enhanced decision logic to consider both job assignment status and container termination state when determining whether to retry pods or delete runners
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| controllers/actions.github.com/ephemeralrunner_controller.go | Replaced specific reason-based failure handling (Evicted, OutOf...) with generalized container state inspection. Added logic to treat exit code 0 as success even when pod fails. Enhanced documentation with inline comments. |
| controllers/actions.github.com/ephemeralrunner_controller_test.go | Added three new test cases covering: pod failure without container state (with job assigned), pod failure with exit code 0 (treating as success), and pod failure with exit code 1 (without job, triggering retry). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| log.Error( | ||
| errors.New("ephemeral runner container has failed, with runner container exit code non-zero"), | ||
| "Ephemeral runner container has failed, and runner container termination exit code is non-zero", | ||
| "containerTerminatedState", cs.State.Terminated, | ||
| ) |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message passed to errors.New on line 342 and the log message on line 343 are redundant and convey essentially the same information. Consider simplifying this by either removing the errors.New wrapper and just using log.Error with a message, or making the messages more distinct if they serve different purposes.
| ephemeralRunnerTimeout, | ||
| ephemeralRunnerInterval, | ||
| ).Should(BeTrue(), "Pod should be re-created") | ||
| }) |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new pod failure handling logic introduced in this PR handles the case where a pod fails with no container status and no job assigned (it should retry the pod by deleting it and letting it be recreated). However, there's no test case covering this specific scenario. Consider adding a test that sets pod.Status.Phase to PodFailed with pod.Status.ContainerStatuses = nil and no JobID set, then verifies that the pod is deleted and recreated (similar to the test on line 347, but without setting container statuses).
| ephemeralRunnerTimeout, | ||
| ephemeralRunnerInterval, | ||
| ).Should(BeTrue(), "Ephemeral runner should eventually be deleted") | ||
| }) |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a test case for the scenario where the pod fails but the runner container exits with code 0 and no job is assigned. This would verify that the "treat as success" logic works correctly even when JobID is not set. Currently, the test on line 302 only covers this scenario when a job is assigned.
Instead of testing known retryable errors (Evicted, OutOf...), this change checks for pod failed state. If the pod is in the failed state, that means that all containers exited.
If all containers are done, we need to inspect the state. If the state of the runner is not yet updated, the we delete the pod as failed unless it was tagged with the job. If it is tagged with the job, we delete the ephemeral runner.
Otherwise, we inspect the runner container state. The runner container exit code 0 means that the runner successfully executed the job. If the pod state is set to failed, but the runner's exit code is 0, we treat that exit code as successful. Let's say that the side-car exited with 1 after runner exited with 0. It would still mean that the runner successfully executed the job, which is the most important part of this check.
Otherwise, we would either remove the ephemeral runner if it was tagged with the job, or retry the pod with backoff.
Fixes #4148