docs: add complete failure recovery guide for lost hosts#264
Open
tsivaprasad wants to merge 1 commit intoPLAT-313-recovery-approach-when-swarm-quorum-is-lostfrom
Open
docs: add complete failure recovery guide for lost hosts#264tsivaprasad wants to merge 1 commit intoPLAT-313-recovery-approach-when-swarm-quorum-is-lostfrom
tsivaprasad wants to merge 1 commit intoPLAT-313-recovery-approach-when-swarm-quorum-is-lostfrom
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds documentation for recovering from complete host failure scenarios where hosts are completely lost and need to be recreated from scratch
Changes
Add docs/disaster-recovery/full-recovery.md with step-by-step guide covering:
Phase 1: Remove the failed host from Control Plane and Docker Swarm
Phase 2: Verify cluster is operating with reduced capacity
Phase 3: Provision and configure new host infrastructure
Phase 4: Deploy Control Plane service on new host
Phase 5: Join Control Plane cluster and restore database capacity
Phase 6: Post-recovery verification
Phase 1: Failed Host Removal (Verified)
✅ Step 1.1: Force Remove Host from Control Plane
Executed:
curl -X DELETE http://192.168.105.3:3000/v1/hosts/host-3?force=trueObserved:
remove_hosttask created and completedAutomatic database update task triggered for storefront
All instances and subscriptions belonging to host-3 were removed
Control Plane remained available throughout
Verified via:
Matches documented behavior: forced host removal + automatic database cleanup
✅ Step 1.3: Docker Swarm Cleanup
On a healthy manager (host-1):
Observed:
Failed node removed from Swarm
Swarm manager quorum preserved
Remaining managers stayed Ready
Confirms documented Swarm cleanup procedure
Phase 2: Reduced-Capacity Operation (Verified)
✅ Step 2.1: Host Status Verification
curl http://192.168.105.3:3000/v1/hostsObserved:
Only host-1 and host-2 listed
Both hosts healthy
etcd quorum intact
✅ Step 2.2: Database Health Verification
curl http://192.168.105.3:3000/v1/databases/storefrontObserved:
Database state: available
Instances:
n1 on host-1
n2 on host-2
No references to host-3
✅ Step 2.3: Data Replication Verification
Executed:
Inserted data on n2
Verified visibility on n1
Observed:
Writes succeeded
Data replicated correctly
Confirms cluster remained fully operational with reduced capacity
Phase 3: Provision New Host
✅ Step 3.1: Create New Host
Using script created a lima host with prerequisites
✅ Step 3.2: Rejoin Docker Swarm (Manager)
From host-1:
docker swarm join-token managerOn host-3:
docker swarm join --token <TOKEN> 192.168.104.1:2377Observed:
Host rejoined Swarm successfully as manager
Phase 4: Deploy Control Plane Service
✅ Step 4.1: Prepare Data Directory
On host-3 (i.e. newly created host)
sudo mkdir -p /data/control-plane✅ Step 4.2: Deploy Control Plane Stack
On leader (i.e. host-1)
docker stack deploy -c /tmp/stack.yaml control-planeObserved:
Control Plane service started on host-3
Service reached Running state
Verified via:
Phase 5: Join Control Plane Cluster
✅ Step 5.1: Generate Join Token
curl http://192.168.105.3:3000/v1/cluster/join-tokenResponse included:
token
server_url
✅ Step 5.2: Join the Cluster
Observed:
Host successfully joined Control Plane cluster
✅ Step 5.3: Host Verification
curl http://192.168.105.3:3000/v1/hostsObserved:
host-3 present
Status: healthy
✅ Step 5.4: Update Database with New Node
Observed:
New instance storefront-n3-* created
Patroni + Spock configured automatically
Database state transitioned modifying → available
Phase 6: Post-Recovery Verification
✅ Step 6.1: Verify Data Replication
Executed:
Inserted data on n3
Verified data on n2
Observed:
Data fully consistent across all three nodes
Checklist
PLAT-313