Make restart pod more flexible to different failure scenarios #4340

nikola-jokic · 2025-12-17T15:04:03Z

Instead of testing known retryable errors (Evicted, OutOf...), this change checks for pod failed state. If the pod is in the failed state, that means that all containers exited.

If all containers are done, we need to inspect the state. If the state of the runner is not yet updated, the we delete the pod as failed unless it was tagged with the job. If it is tagged with the job, we delete the ephemeral runner.

Otherwise, we inspect the runner container state. The runner container exit code 0 means that the runner successfully executed the job. If the pod state is set to failed, but the runner's exit code is 0, we treat that exit code as successful. Let's say that the side-car exited with 1 after runner exited with 0. It would still mean that the runner successfully executed the job, which is the most important part of this check.

Otherwise, we would either remove the ephemeral runner if it was tagged with the job, or retry the pod with backoff.

Fixes #4148

github-actions · 2025-12-17T15:04:20Z

Hello! Thank you for your contribution.

Please review our contribution guidelines to understand the project's testing and code conventions.

Copilot

Pull request overview

This PR refactors the pod failure handling logic in the EphemeralRunner controller to be more flexible and general. Instead of checking for specific failure reasons like "Evicted" or "OutOf...", the new approach inspects the runner container's termination state to determine the appropriate action.

Key Changes:

Generalized pod failure handling based on container state rather than specific pod reasons
Added logic to treat pod failures with runner container exit code 0 as successful completions (handling sidecar failures gracefully)
Enhanced decision logic to consider both job assignment status and container termination state when determining whether to retry pods or delete runners

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
controllers/actions.github.com/ephemeralrunner_controller.go	Replaced specific reason-based failure handling (Evicted, OutOf...) with generalized container state inspection. Added logic to treat exit code 0 as success even when pod fails. Enhanced documentation with inline comments.
controllers/actions.github.com/ephemeralrunner_controller_test.go	Added three new test cases covering: pod failure without container state (with job assigned), pod failure with exit code 0 (treating as success), and pod failure with exit code 1 (without job, triggering retry).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-17T15:11:32Z

controllers/actions.github.com/ephemeralrunner_controller.go

+		log.Error(
+			errors.New("ephemeral runner container has failed, with runner container exit code non-zero"),
+			"Ephemeral runner container has failed, and runner container termination exit code is non-zero",
+			"containerTerminatedState", cs.State.Terminated,
+		)


The error message passed to errors.New on line 342 and the log message on line 343 are redundant and convey essentially the same information. Consider simplifying this by either removing the errors.New wrapper and just using log.Error with a message, or making the messages more distinct if they serve different purposes.

Copilot · 2025-12-17T15:11:33Z

controllers/actions.github.com/ephemeralrunner_controller_test.go

+				ephemeralRunnerTimeout,
+				ephemeralRunnerInterval,
+			).Should(BeTrue(), "Pod should be re-created")
+		})


The new pod failure handling logic introduced in this PR handles the case where a pod fails with no container status and no job assigned (it should retry the pod by deleting it and letting it be recreated). However, there's no test case covering this specific scenario. Consider adding a test that sets pod.Status.Phase to PodFailed with pod.Status.ContainerStatuses = nil and no JobID set, then verifies that the pod is deleted and recreated (similar to the test on line 347, but without setting container statuses).

Copilot · 2025-12-17T15:11:33Z

controllers/actions.github.com/ephemeralrunner_controller_test.go

+				ephemeralRunnerTimeout,
+				ephemeralRunnerInterval,
+			).Should(BeTrue(), "Ephemeral runner should eventually be deleted")
+		})


Consider adding a test case for the scenario where the pod fails but the runner container exits with code 0 and no job is assigned. This would verify that the "treat as success" logic works correctly even when JobID is not set. Currently, the test on line 302 only covers this scenario when a job is assigned.

Make restart pod more flexible to different failure scenarios

d776f27

Copilot AI review requested due to automatic review settings December 17, 2025 15:04

nikola-jokic requested review from a team, mumoshu, rentziass and toast-gear as code owners December 17, 2025 15:04

nikola-jokic added the gha-runner-scale-set Related to the gha-runner-scale-set mode label Dec 17, 2025

Copilot started reviewing on behalf of nikola-jokic December 17, 2025 15:04 View session

Copilot AI reviewed Dec 17, 2025

View reviewed changes

nikola-jokic mentioned this pull request Dec 17, 2025

EphemeralRunner left stuck Running after node drain/pod termination #4148

Open

4 tasks

nikola-jokic added 2 commits December 17, 2025 17:11

add more tests

44819fe

fix test

9c40aba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make restart pod more flexible to different failure scenarios #4340

Make restart pod more flexible to different failure scenarios #4340

nikola-jokic commented Dec 17, 2025

Uh oh!

github-actions bot commented Dec 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Make restart pod more flexible to different failure scenarios #4340

Are you sure you want to change the base?

Make restart pod more flexible to different failure scenarios #4340

Conversation

nikola-jokic commented Dec 17, 2025

Uh oh!

github-actions bot commented Dec 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants