Skip to content

Conversation

@nikola-jokic
Copy link
Collaborator

Instead of testing known retryable errors (Evicted, OutOf...), this change checks for pod failed state. If the pod is in the failed state, that means that all containers exited.

If all containers are done, we need to inspect the state. If the state of the runner is not yet updated, the we delete the pod as failed unless it was tagged with the job. If it is tagged with the job, we delete the ephemeral runner.

Otherwise, we inspect the runner container state. The runner container exit code 0 means that the runner successfully executed the job. If the pod state is set to failed, but the runner's exit code is 0, we treat that exit code as successful. Let's say that the side-car exited with 1 after runner exited with 0. It would still mean that the runner successfully executed the job, which is the most important part of this check.

Otherwise, we would either remove the ephemeral runner if it was tagged with the job, or retry the pod with backoff.

Fixes #4148

Copilot AI review requested due to automatic review settings December 17, 2025 15:04
@nikola-jokic nikola-jokic added the gha-runner-scale-set Related to the gha-runner-scale-set mode label Dec 17, 2025
@github-actions
Copy link
Contributor

Hello! Thank you for your contribution.

Please review our contribution guidelines to understand the project's testing and code conventions.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the pod failure handling logic in the EphemeralRunner controller to be more flexible and general. Instead of checking for specific failure reasons like "Evicted" or "OutOf...", the new approach inspects the runner container's termination state to determine the appropriate action.

Key Changes:

  • Generalized pod failure handling based on container state rather than specific pod reasons
  • Added logic to treat pod failures with runner container exit code 0 as successful completions (handling sidecar failures gracefully)
  • Enhanced decision logic to consider both job assignment status and container termination state when determining whether to retry pods or delete runners

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
controllers/actions.github.com/ephemeralrunner_controller.go Replaced specific reason-based failure handling (Evicted, OutOf...) with generalized container state inspection. Added logic to treat exit code 0 as success even when pod fails. Enhanced documentation with inline comments.
controllers/actions.github.com/ephemeralrunner_controller_test.go Added three new test cases covering: pod failure without container state (with job assigned), pod failure with exit code 0 (treating as success), and pod failure with exit code 1 (without job, triggering retry).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +341 to +345
log.Error(
errors.New("ephemeral runner container has failed, with runner container exit code non-zero"),
"Ephemeral runner container has failed, and runner container termination exit code is non-zero",
"containerTerminatedState", cs.State.Terminated,
)
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message passed to errors.New on line 342 and the log message on line 343 are redundant and convey essentially the same information. Consider simplifying this by either removing the errors.New wrapper and just using log.Error with a message, or making the messages more distinct if they serve different purposes.

Copilot uses AI. Check for mistakes.
ephemeralRunnerTimeout,
ephemeralRunnerInterval,
).Should(BeTrue(), "Pod should be re-created")
})
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new pod failure handling logic introduced in this PR handles the case where a pod fails with no container status and no job assigned (it should retry the pod by deleting it and letting it be recreated). However, there's no test case covering this specific scenario. Consider adding a test that sets pod.Status.Phase to PodFailed with pod.Status.ContainerStatuses = nil and no JobID set, then verifies that the pod is deleted and recreated (similar to the test on line 347, but without setting container statuses).

Copilot uses AI. Check for mistakes.
ephemeralRunnerTimeout,
ephemeralRunnerInterval,
).Should(BeTrue(), "Ephemeral runner should eventually be deleted")
})
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a test case for the scenario where the pod fails but the runner container exits with code 0 and no job is assigned. This would verify that the "treat as success" logic works correctly even when JobID is not set. Currently, the test on line 302 only covers this scenario when a job is assigned.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gha-runner-scale-set Related to the gha-runner-scale-set mode

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EphemeralRunner left stuck Running after node drain/pod termination

2 participants