feat(operator): add pod deletion policy when node is downfeat: add node-down pod deletion policy #57

loverustfs · 2025-12-20T02:28:37Z

Link

Summary

This PR adds a configurable “Pod Deletion Policy When Node Is Down” option to the RustFS operator, inspired by Longhorn’s setting. It automates cleanup of terminating Pods stuck on an unreachable node, enabling controllers (especially StatefulSets) to recreate replacement Pods without manual kubectl delete --force.

Motivation / Problem

When a node becomes NotReady/Unknown (or disappears), Pods may get stuck in Terminating and never complete graceful shutdown. In StatefulSet-based workloads, this can prevent timely recovery because the replacement Pod may not be created until the old terminating Pod object is removed.

Operators/admins currently must manually force delete the stuck Pod to unblock recovery.

What this PR does

New Tenant configuration

Adds a new Tenant spec field:

spec.podDeletionPolicyWhenNodeIsDown

Supported values:

DoNothing (default)
Delete (normal delete)
ForceDelete (delete with gracePeriodSeconds=0)
Longhorn-like variants (force delete terminating pods on down nodes, filtered by controller type):
- DeleteStatefulSetPod
- DeleteDeploymentPod (Deployment pods are typically owned by ReplicaSet)
- DeleteBothStatefulSetAndDeploymentPod

Reconcile behavior

During reconciliation, if the policy is enabled (not DoNothing), the operator:

Lists tenant-owned Pods via label selector rustfs.tenant=<tenant-name>
Only targets Pods already in Terminating (metadata.deletionTimestamp != None)
Resolves the hosting Node (pod.spec.nodeName) and treats it as “down” when:
- Node Ready != True (False / Unknown), or
- Node object is not found (404)
Applies the configured deletion policy (normal delete or force delete)

RBAC changes

The operator ClusterRole is extended with read access to Nodes:

nodes: get, list, watch

This is required to check Node readiness.

Example YAML

apiVersion: rustfs.com/v1alpha1
kind: Tenant
metadata:
  name: my-tenant
spec:
  podDeletionPolicyWhenNodeIsDown: DeleteStatefulSetPod
  pools:
    - name: pool-0
      servers: 4
      # ...

Safety / Caveats

Force deleting Pods can have data consistency implications depending on the storage backend and workload semantics.
Default remains DoNothing.
The operator intentionally acts only on Pods already in Terminating to keep behavior conservative and avoid deleting healthy running Pods.

Tests

Unit tests added/updated to validate:

Node down detection based on Node Ready condition (True vs False/Unknown)
Controller-kind filtering for Longhorn-like variants (StatefulSet vs ReplicaSet ownership)

All tests pass:

cargo test

Implementation notes (files changed)

reconcile.rs
k8s.rs
tenant.rs
tenant.yaml
clusterrole.yaml

Checklist

- Reformat imports to follow standard order - Make cleanup_stuck_terminating_pods_on_down_nodes call single-line - Reformat assert! statements in tests to multi-line for better readability

src/reconcile.rs

loverustfs · 2025-12-20T05:04:12Z

Hey @shahab96 ,

Thanks a lot.

I submitted this bug（https://github.com/rustfs/rustfs/issues/1193） report to Gemini, and here are his suggested fixes.
I'm unsure if they're feasible. If they aren't, you can close this issue.

shahab96 · 2025-12-20T05:07:54Z

Hey @shahab96 ,

Thanks a lot.

I submitted this bug（https://github.com/rustfs/rustfs/issues/1193） report to Gemini, and here are his suggested fixes.
I'm unsure if they're feasible. If they aren't, you can close this issue.

Totally feasible and useful changes here! Just some minor comments I had.

The one for the if statement can be totally ignored honestly. I think all we'd need here is to emit a log when the node is detected to be down. That's all actually.

Thank you for helping us out with the operator!

loverustfs · 2025-12-20T11:14:59Z

Hey @shahab96 ,
Thanks a lot.
I submitted this bug（https://github.com/rustfs/rustfs/issues/1193） report to Gemini, and here are his suggested fixes.
I'm unsure if they're feasible. If they aren't, you can close this issue.

Totally feasible and useful changes here! Just some minor comments I had.

The one for the if statement can be totally ignored honestly. I think all we'd need here is to emit a log when the node is detected to be down. That's all actually.

Thank you for helping us out with the operator!

No problem.

Thank you for helping to make the RustFS operator better!

loverustfs · 2025-12-20T11:54:11Z

Hey @shahab96 ,

Update submitted.

- use tracing::{debug, error, warn};
+ use tracing::{debug, error, warn};

+        let pod_name = pod.name_any();
+        warn!(
+            "Node {} is detected down. Pod {} is terminating on it.",
+            node_name, pod_name
+        );

feat: add node-down pod deletion policy

9a3a020

loverustfs requested a review from shahab96 December 20, 2025 02:28

loverustfs requested a review from bestgopher as a code owner December 20, 2025 02:28

loverustfs mentioned this pull request Dec 20, 2025

Pod Deletion Policy When Node is Down rustfs/rustfs#1193

Closed

loverustfs added 2 commits December 20, 2025 11:24

fix: apply rustfmt formatting fixes

ebb733b

- Reformat imports to follow standard order - Make cleanup_stuck_terminating_pods_on_down_nodes call single-line - Reformat assert! statements in tests to multi-line for better readability

fix: resolve clippy collapsible-if warning

3e736d1

loverustfs enabled auto-merge December 20, 2025 03:46

shahab96 reviewed Dec 20, 2025

View reviewed changes

src/reconcile.rs Show resolved Hide resolved

src/reconcile.rs Show resolved Hide resolved

fix: emit warning log when node is detected down

0fc046f

shahab96 approved these changes Dec 20, 2025

View reviewed changes

bestgopher approved these changes Dec 20, 2025

View reviewed changes

bestgopher disabled auto-merge December 20, 2025 13:17

bestgopher enabled auto-merge December 20, 2025 13:17

bestgopher disabled auto-merge December 20, 2025 13:18

bestgopher enabled auto-merge December 20, 2025 13:18

bestgopher added this pull request to the merge queue Dec 20, 2025

Merged via the queue into main with commit b00c9e1 Dec 20, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(operator): add pod deletion policy when node is downfeat: add node-down pod deletion policy #57

feat(operator): add pod deletion policy when node is downfeat: add node-down pod deletion policy #57

Uh oh!

loverustfs commented Dec 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

loverustfs commented Dec 20, 2025

Uh oh!

shahab96 commented Dec 20, 2025

Uh oh!

loverustfs commented Dec 20, 2025

Uh oh!

loverustfs commented Dec 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(operator): add pod deletion policy when node is downfeat: add node-down pod deletion policy #57

feat(operator): add pod deletion policy when node is downfeat: add node-down pod deletion policy #57

Uh oh!

Conversation

loverustfs commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Link

Summary

Motivation / Problem

What this PR does

New Tenant configuration

Reconcile behavior

RBAC changes

Example YAML

Safety / Caveats

Tests

Implementation notes (files changed)

Checklist

Uh oh!

Uh oh!

Uh oh!

loverustfs commented Dec 20, 2025

Uh oh!

shahab96 commented Dec 20, 2025

Uh oh!

loverustfs commented Dec 20, 2025

Uh oh!

loverustfs commented Dec 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

loverustfs commented Dec 20, 2025 •

edited

Loading