Skip to content

docs: add complete failure recovery guide for lost hosts#264

Open
tsivaprasad wants to merge 1 commit intoPLAT-313-recovery-approach-when-swarm-quorum-is-lostfrom
PLAT-313-full-recovery-approach
Open

docs: add complete failure recovery guide for lost hosts#264
tsivaprasad wants to merge 1 commit intoPLAT-313-recovery-approach-when-swarm-quorum-is-lostfrom
PLAT-313-full-recovery-approach

Conversation

@tsivaprasad
Copy link
Contributor

@tsivaprasad tsivaprasad commented Feb 6, 2026

Summary
This PR adds documentation for recovering from complete host failure scenarios where hosts are completely lost and need to be recreated from scratch

Changes
Add docs/disaster-recovery/full-recovery.md with step-by-step guide covering:
Phase 1: Remove the failed host from Control Plane and Docker Swarm
Phase 2: Verify cluster is operating with reduced capacity
Phase 3: Provision and configure new host infrastructure
Phase 4: Deploy Control Plane service on new host
Phase 5: Join Control Plane cluster and restore database capacity
Phase 6: Post-recovery verification

Phase 1: Failed Host Removal (Verified)
✅ Step 1.1: Force Remove Host from Control Plane

Executed:

curl -X DELETE http://192.168.105.3:3000/v1/hosts/host-3?force=true

Observed:

remove_host task created and completed
Automatic database update task triggered for storefront
All instances and subscriptions belonging to host-3 were removed
Control Plane remained available throughout
Verified via:

curl /v1/hosts/host-3/tasks/<task-id>
curl /v1/databases/storefront/tasks/<task-id>

Matches documented behavior: forced host removal + automatic database cleanup

✅ Step 1.3: Docker Swarm Cleanup

On a healthy manager (host-1):

docker node ls
docker node demote lima-host-3
docker node rm lima-host-3 --force

Observed:

Failed node removed from Swarm
Swarm manager quorum preserved
Remaining managers stayed Ready
Confirms documented Swarm cleanup procedure

Phase 2: Reduced-Capacity Operation (Verified)
✅ Step 2.1: Host Status Verification
curl http://192.168.105.3:3000/v1/hosts

Observed:

Only host-1 and host-2 listed
Both hosts healthy
etcd quorum intact
✅ Step 2.2: Database Health Verification
curl http://192.168.105.3:3000/v1/databases/storefront

Observed:

Database state: available

Instances:
n1 on host-1
n2 on host-2

No references to host-3

✅ Step 2.3: Data Replication Verification

Executed:

Inserted data on n2
Verified visibility on n1
Observed:

Writes succeeded
Data replicated correctly
Confirms cluster remained fully operational with reduced capacity

Phase 3: Provision New Host
✅ Step 3.1: Create New Host

Using script created a lima host with prerequisites

✅ Step 3.2: Rejoin Docker Swarm (Manager)

From host-1:
docker swarm join-token manager

On host-3:
docker swarm join --token <TOKEN> 192.168.104.1:2377

Observed:

Host rejoined Swarm successfully as manager

Phase 4: Deploy Control Plane Service
✅ Step 4.1: Prepare Data Directory

On host-3 (i.e. newly created host)

sudo mkdir -p /data/control-plane

✅ Step 4.2: Deploy Control Plane Stack

On leader (i.e. host-1)
docker stack deploy -c /tmp/stack.yaml control-plane

Observed:

Control Plane service started on host-3
Service reached Running state
Verified via:

  docker service ps control-plane_host-3
  docker service logs control-plane_host-3

Phase 5: Join Control Plane Cluster
✅ Step 5.1: Generate Join Token
curl http://192.168.105.3:3000/v1/cluster/join-token

Response included:

token
server_url
✅ Step 5.2: Join the Cluster

curl -X POST http://192.168.105.5:3000/v1/cluster/join \
  -H 'Content-Type:application/json' \
  --data '<join-token-json>'

Observed:

Host successfully joined Control Plane cluster
✅ Step 5.3: Host Verification
curl http://192.168.105.3:3000/v1/hosts

Observed:

host-3 present
Status: healthy

✅ Step 5.4: Update Database with New Node

curl -X POST http://192.168.105.3:3000/v1/databases/storefront \
  -H 'Content-Type:application/json' \
  --data '{
    "spec": {
      "nodes": [
        {"name":"n1","host_ids":["host-1"]},
        {"name":"n2","host_ids":["host-2"]},
        {"name":"n3","host_ids":["host-3"]}
      ]
    }
  }'

Observed:

New instance storefront-n3-* created
Patroni + Spock configured automatically
Database state transitioned modifying → available
Phase 6: Post-Recovery Verification
✅ Step 6.1: Verify Data Replication

Executed:

Inserted data on n3
Verified data on n2
Observed:

Data fully consistent across all three nodes

Checklist

PLAT-313

@coderabbitai
Copy link

coderabbitai bot commented Feb 6, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch PLAT-313-full-recovery-approach

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tsivaprasad tsivaprasad marked this pull request as ready for review February 9, 2026 04:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant