docs: partial recovery — add guide for host failure recovery and orphaned instance cleanup#261
Conversation
…aned instance cleanup
📝 WalkthroughWalkthroughThis pull request introduces partial recovery functionality for handling host failures in quorum-intact clusters. It adds comprehensive recovery documentation, implements a new CleanupInstance activity to remove orphaned database instances, integrates cleanup into host removal workflows, and enhances Docker task filtering for ready nodes. Changes
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@docs/disaster-recovery/partial-recovery.md`:
- Around line 179-183: Replace the placeholder comments with concrete psql
commands that show how to insert test data on the primary and verify it on the
replica; for example, add a psql INSERT on n1 targeting the example database and
test_table (e.g., INSERT INTO test_table (id, value) VALUES (...)) and a psql
SELECT on n2 to confirm the replicated row (e.g., SELECT * FROM test_table WHERE
value = 'recovery_test'); ensure you include host/port/user flags (-h, -p, -U)
and use the same database name (example) and table name (test_table) as shown in
the suggested improvement so readers can run the steps end-to-end.
- Around line 413-416: Replace the placeholder comments "# Insert on n1" and "#
Verify on n3" with concrete psql commands: on the primary node run a psql
command to INSERT a test row into the example database (e.g., use psql -h host-1
-p 5432 -U admin -d example -c "INSERT INTO test_table (id, value) VALUES
(...);"), and on the restored node run a psql SELECT to verify the row exists
(e.g., psql -h host-3 -p 5432 -U admin -d example -c "SELECT * FROM test_table
WHERE value = '...';"); ensure the SQL matches the test_table schema and uses a
unique value (e.g., 'full_recovery_test') so the verification step is
unambiguous.
In `@server/internal/orchestrator/swarm/postgres_service.go`:
- Around line 171-185: The timeout branch currently falls through and lets
ServiceRemove proceed even though scale-down may not have completed; update the
isTimeoutError(err) case in the same function (the scale-down/ServiceRemove
sequence) to return a descriptive error instead of continuing, e.g., return an
error indicating the scale-down timed out (wrap the original err), so callers
must handle the timeout (until the TODO RemovingHosts/context distinction is
implemented); reference isTimeoutError, the scale-down logic, and the subsequent
ServiceRemove call when making the change.
🧹 Nitpick comments (1)
server/internal/workflows/activities/cleanup_instance.go (1)
27-31:MaxAttempts: 1means no retries on transient failures.With
MaxAttempts: 1, any transient failure (network blip, etcd leader election) will fail the cleanup permanently. Consider allowing a small number of retries with backoff for resilience.Optional: Allow limited retries
options := workflow.ActivityOptions{ Queue: utils.HostQueue(a.Config.HostID), RetryOptions: workflow.RetryOptions{ - MaxAttempts: 1, + MaxAttempts: 3, }, }
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@docs/disaster-recovery/partial-recovery.md`:
- Around line 271-273: Fix the typo and endpoint mismatches: remove the extra
slash in the curl invocation used to set JOIN_TOKEN so it calls
"/v1/cluster/join-token" (the line containing JOIN_TOKEN="$(curl
http://host-1:3000//v1/cluster/join-token)") and update the summary table
entries so they exactly match the routes used earlier—use
"/v1/cluster/join-token" for fetching the join token and "/v1/cluster/join" for
joining the cluster (also correct the same occurrences noted around lines
388-390).
In `@server/internal/docker/docker.go`:
- Around line 424-449: The checkNodesUnavailable function currently treats only
NodeStateDown and NodeStateUnknown as unavailable; update the availability check
in Docker.checkNodesUnavailable to also treat swarm.NodeStateDisconnected as
unavailable (i.e., include node.Status.State == swarm.NodeStateDisconnected in
the conditional alongside NodeStateDown and NodeStateUnknown) and update the
function comment to reflect "(status 'down', 'unknown' or 'disconnected')";
ensure you reference the NodeList call and the loop over nodes in
checkNodesUnavailable when making the change.
🧹 Nitpick comments (1)
docs/disaster-recovery/partial-recovery.md (1)
131-136: Add language identifiers to output blocks (markdownlint MD040).The example output fences are missing a language. Adding
text(orconsole) will satisfy MD040.Example tweak
-``` +```text ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS 4aoqjp3q8jcny4kec5nadcn6x * lima-host-1 Ready Active Leader 959g9937i62judknmr40kcw9r lima-host-2 Ready Active Reachable l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable-
+text
docker swarm join --token SWMTKN-1-xxx...xxx 10.0.0.1:2377Also applies to: 222-224
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@server/internal/docker/docker.go`:
- Around line 424-432: The current loop swallows errors from
checkNodesUnavailable by continuing, which hides Docker API failures; change the
behavior in the block that calls d.checkNodesUnavailable(ctx, stuckNodeIDs) to
return the error (or wrap it with context) instead of continue so upstream sees
the real failure; update the call site that currently checks unavailable, err :=
d.checkNodesUnavailable(...) to handle err by returning fmt.Errorf("checking
node availability: %w", err) (or similar) and keep the existing
ErrNodeUnavailable return path when unavailable is true.
🧹 Nitpick comments (1)
docs/disaster-recovery/partial-recovery.md (1)
131-136: Add language identifiers to fenced output blocks (MD040).
Usetext(or similar) to satisfy linting.📝 Example fix
-``` +```text ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS 4aoqjp3q8jcny4kec5nadcn6x * lima-host-1 Ready Active Leader 959g9937i62judknmr40kcw9r lima-host-2 Ready Active Reachable l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active UnreachableAlso applies to: 222-224
server/internal/docker/docker.go
Outdated
| // If we have tasks that should be stopping, check if their nodes are unavailable | ||
| if len(stuckNodeIDs) > 0 { | ||
| unavailable, err := d.checkNodesUnavailable(ctx, stuckNodeIDs) | ||
| if err != nil { | ||
| continue | ||
| } | ||
| if unavailable { | ||
| return fmt.Errorf("%w: tasks cannot stop because their nodes are down", ErrNodeUnavailable) | ||
| } |
There was a problem hiding this comment.
Don’t swallow node-list errors.
Continuing on checkNodesUnavailable errors hides persistent Docker API failures and turns them into a timeout instead of a root-cause error.
🔧 Suggested fix
- if err != nil {
- continue
- }
+ if err != nil {
+ return fmt.Errorf("failed to check node availability: %w", err)
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // If we have tasks that should be stopping, check if their nodes are unavailable | |
| if len(stuckNodeIDs) > 0 { | |
| unavailable, err := d.checkNodesUnavailable(ctx, stuckNodeIDs) | |
| if err != nil { | |
| continue | |
| } | |
| if unavailable { | |
| return fmt.Errorf("%w: tasks cannot stop because their nodes are down", ErrNodeUnavailable) | |
| } | |
| // If we have tasks that should be stopping, check if their nodes are unavailable | |
| if len(stuckNodeIDs) > 0 { | |
| unavailable, err := d.checkNodesUnavailable(ctx, stuckNodeIDs) | |
| if err != nil { | |
| return fmt.Errorf("failed to check node availability: %w", err) | |
| } | |
| if unavailable { | |
| return fmt.Errorf("%w: tasks cannot stop because their nodes are down", ErrNodeUnavailable) | |
| } |
🤖 Prompt for AI Agents
In `@server/internal/docker/docker.go` around lines 424 - 432, The current loop
swallows errors from checkNodesUnavailable by continuing, which hides Docker API
failures; change the behavior in the block that calls
d.checkNodesUnavailable(ctx, stuckNodeIDs) to return the error (or wrap it with
context) instead of continue so upstream sees the real failure; update the call
site that currently checks unavailable, err := d.checkNodesUnavailable(...) to
handle err by returning fmt.Errorf("checking node availability: %w", err) (or
similar) and keep the existing ErrNodeUnavailable return path when unavailable
is true.
server/internal/docker/docker.go
Outdated
| } | ||
|
|
||
| // Build a set of node IDs for tasks that should be stopping but aren't | ||
| stuckNodeIDs := make(map[string]struct{}) |
There was a problem hiding this comment.
Thanks for pinning down how this times out. I think we can improve this change a little bit. The way this is written, we're always going to call checkNodesUnavailable, so there's little point in making it conditional.
Instead, maybe we could add a method that returns a ds.Set of all node IDs where Status.State == swarm.NodeStateReady, like:
func (d *Docker) readyNodeIDs(ctx context.Context) (ds.Set[string], error)And call it up here before we loop over the tasks. Then, we change our loop to skip tasks that aren't on a ready node:
for _, t := range tasks {
if !readyNodes.Has(t.NodeID) {
continue
}
// ...
}That way, they don't count toward our running or stopping totals. We'll still time out if we're trying to run a task on an unavailable node, which is appropriate, and we'll exit early without having to return an error if we're trying to stop a task on an unavailable node. Then, you won't need the sentinel error or the empty case in postgres_service.go.
Summary
This PR adds documentation for partial failure recovery (quorum intact) and improves node unavailability detection during force host removal operations.
Changes
disaster recovery guide for partial failure scenarios(docs/disaster-recovery/partial-recovery.md)ErrNodeUnavailablesentinel error for precise node unavailability detectionWaitForServiceto detect stuck tasks due to unavailable nodes instead of relying on generic timeoutspostgres_service.goto use the new sentinel error during service deletionCleanupInstanceactivity to clean up orphaned instance records when force-removing hostsTesting
host-3and Quorum state Intact (2 healthy hosts remained)Phase 1: Failed Host Removal (Verified)
✅ Step 1.1: Force Remove Host from Control Plane
Executed:
curl -X DELETE http://192.168.105.3:3000/v1/hosts/host-3?force=trueObserved:
Verified via:
Matches documented behavior: forced host removal + automatic database cleanup
✅ Step 1.3: Docker Swarm Cleanup
On a healthy manager (host-1):
Observed:
Confirms documented Swarm cleanup procedure
Phase 2: Reduced-Capacity Operation (Verified)
✅ Step 2.1: Host Status Verification
curl http://192.168.105.3:3000/v1/hostsObserved:
✅ Step 2.2: Database Health Verification
curl http://192.168.105.3:3000/v1/databases/storefrontObserved:
Database state:
availableInstances:
n1 on host-1
n2 on host-2
No references to host-3
✅ Step 2.3: Data Replication Verification
Executed:
Observed:
Confirms cluster remained fully operational with reduced capacity
Phase 3: Restored Host Preparation (Verified)
✅ Step 3.1: Clean Old Control Plane Data
On restored host-3:
Observed:
✅ Step 3.2: Rejoin Docker Swarm (Manager)
From host-1:
docker swarm join-token managerOn host-3:
docker swarm join --token <TOKEN> 192.168.104.1:2377Observed:
✅ Step 3.4: Redeploy Control Plane Stack
docker stack deploy -c /tmp/stack.yaml control-planeObserved:
Verified via:
Phase 4: Control Plane Cluster Rejoin (Verified)
✅ Step 4.1: Generate Join Token
curl http://192.168.105.3:3000/v1/cluster/join-tokenResponse included:
tokenserver_url✅ Step 4.2: Join Cluster from Restored Host
Observed:
✅ Step 4.3: Host Verification
curl http://192.168.105.3:3000/v1/hostsObserved:
host-3presenthealthyPhase 5: Database Capacity Restoration (Verified)
✅ Step 5.1: Add Database Node Back
Observed:
✅ Step 5.3: Final Replication Verification
Executed:
n3n2Observed:
Verification Summary:
Checklist
PLAT-313