docs: add complete failure recovery guide for lost hosts by tsivaprasad · Pull Request #264 · pgEdge/control-plane

tsivaprasad · 2026-02-06T18:53:20Z

Summary
This PR adds documentation for recovering from complete host failure scenarios where hosts are completely lost and need to be recreated from scratch

Changes
Add docs/disaster-recovery/full-recovery.md with step-by-step guide covering:
Phase 1: Remove the failed host from Control Plane and Docker Swarm
Phase 2: Verify cluster is operating with reduced capacity
Phase 3: Provision and configure new host infrastructure
Phase 4: Deploy Control Plane service on new host
Phase 5: Join Control Plane cluster and restore database capacity
Phase 6: Post-recovery verification

Phase 1: Failed Host Removal (Verified)
✅ Step 1.1: Force Remove Host from Control Plane

Executed:

curl -X DELETE http://192.168.105.3:3000/v1/hosts/host-3?force=true

Observed:

remove_host task created and completed
Automatic database update task triggered for storefront
All instances and subscriptions belonging to host-3 were removed
Control Plane remained available throughout
Verified via:

curl /v1/hosts/host-3/tasks/<task-id>
curl /v1/databases/storefront/tasks/<task-id>

Matches documented behavior: forced host removal + automatic database cleanup

✅ Step 1.3: Docker Swarm Cleanup

On a healthy manager (host-1):

docker node ls
docker node demote lima-host-3
docker node rm lima-host-3 --force

Observed:

Failed node removed from Swarm
Swarm manager quorum preserved
Remaining managers stayed Ready
Confirms documented Swarm cleanup procedure

Phase 2: Reduced-Capacity Operation (Verified)
✅ Step 2.1: Host Status Verification
curl http://192.168.105.3:3000/v1/hosts

Observed:

Only host-1 and host-2 listed
Both hosts healthy
etcd quorum intact
✅ Step 2.2: Database Health Verification
curl http://192.168.105.3:3000/v1/databases/storefront

Observed:

Database state: available

Instances:
n1 on host-1
n2 on host-2

No references to host-3

✅ Step 2.3: Data Replication Verification

Executed:

Inserted data on n2
Verified visibility on n1
Observed:

Writes succeeded
Data replicated correctly
Confirms cluster remained fully operational with reduced capacity

Phase 3: Provision New Host
✅ Step 3.1: Create New Host

Using script created a lima host with prerequisites

✅ Step 3.2: Rejoin Docker Swarm (Manager)

From host-1:
docker swarm join-token manager

On host-3:
docker swarm join --token <TOKEN> 192.168.104.1:2377

Observed:

Host rejoined Swarm successfully as manager

Phase 4: Deploy Control Plane Service
✅ Step 4.1: Prepare Data Directory

On host-3 (i.e. newly created host)

sudo mkdir -p /data/control-plane

✅ Step 4.2: Deploy Control Plane Stack

On leader (i.e. host-1)
docker stack deploy -c /tmp/stack.yaml control-plane

Observed:

Control Plane service started on host-3
Service reached Running state
Verified via:

  docker service ps control-plane_host-3
  docker service logs control-plane_host-3

Phase 5: Join Control Plane Cluster
✅ Step 5.1: Generate Join Token
curl http://192.168.105.3:3000/v1/cluster/join-token

Response included:

token
server_url
✅ Step 5.2: Join the Cluster

curl -X POST http://192.168.105.5:3000/v1/cluster/join \
  -H 'Content-Type:application/json' \
  --data '<join-token-json>'

Observed:

Host successfully joined Control Plane cluster
✅ Step 5.3: Host Verification
curl http://192.168.105.3:3000/v1/hosts

Observed:

host-3 present
Status: healthy

✅ Step 5.4: Update Database with New Node

curl -X POST http://192.168.105.3:3000/v1/databases/storefront \
  -H 'Content-Type:application/json' \
  --data '{
    "spec": {
      "nodes": [
        {"name":"n1","host_ids":["host-1"]},
        {"name":"n2","host_ids":["host-2"]},
        {"name":"n3","host_ids":["host-3"]}
      ]
    }
  }'

Observed:

New instance storefront-n3-* created
Patroni + Spock configured automatically
Database state transitioned modifying → available
Phase 6: Post-Recovery Verification
✅ Step 6.1: Verify Data Replication

Executed:

Inserted data on n3
Verified data on n2
Observed:

Data fully consistent across all three nodes

Checklist

Changelog

PLAT-313

coderabbitai · 2026-02-06T18:53:33Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch PLAT-313-full-recovery-approach

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

docs: add complete failure recovery guide for lost hosts

d27c84d

tsivaprasad marked this pull request as ready for review February 9, 2026 04:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add complete failure recovery guide for lost hosts#264

docs: add complete failure recovery guide for lost hosts#264
tsivaprasad wants to merge 1 commit intoPLAT-313-recovery-approach-when-swarm-quorum-is-lostfrom
PLAT-313-full-recovery-approach

tsivaprasad commented Feb 6, 2026 •

edited by atlassian bot

Loading

Uh oh!

coderabbitai bot commented Feb 6, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tsivaprasad commented Feb 6, 2026 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tsivaprasad commented Feb 6, 2026 •

edited by atlassian bot

Loading

coderabbitai bot commented Feb 6, 2026 •

edited

Loading