# Zero-downtime VPS migration playbook

This playbook is for the stack in this bundle:
- 3 PostgreSQL/Patroni nodes
- 3 etcd members, colocated on the DB nodes
- 1 proxy host running HAProxy, pgAdmin, PostgREST, and Caddy
- WireGuard between all hosts

## Very important rules

For **PostgreSQL/Patroni replicas**:
- add the new replica first
- let it fully join and catch up
- only then remove the old replica

For **etcd membership** when replacing a permanently removed or failed member in a 3-member cluster:
- remove the old member first
- then add the new member

This is different on purpose. PostgreSQL redundancy and etcd quorum are not the same problem.

## Tools in this bundle

- `scripts/cluster-node.sh`
- generated per-node configs from `scripts/generate-configs.sh`

## Scenario A: Replace a replica with a new VPS

Example: replace `db3` with `db4`

1. Provision the new VPS.
2. Install Docker, WireGuard, and add `/etc/hosts` entries so `db1`, `db2`, `db3`, and `db4` resolve correctly.
3. Generate files for the new node or copy an existing node directory and update:
   - `NODE_NAME=db4`
   - `WG_IP=<new wg ip>`
4. Bring up the new host on WireGuard and verify pings to the existing cluster.
5. Start the new node as a PostgreSQL replica:

```bash
./scripts/cluster-node.sh add-replica db4 10.20.0.14 root@db4.example.com
```

6. Confirm the new node appears in Patroni and is healthy:

```bash
./scripts/cluster-node.sh list
```

7. Add `db4` to HAProxy backend and reload HAProxy.
8. Stop and retire the old replica:

```bash
./scripts/cluster-node.sh remove-replica db3 root@db3.example.com
```

9. If `db3` also hosted etcd and is going away permanently, replace the etcd member too.

## Scenario B: Replace the current primary

Example: current primary is `db1`, new node is `db4`

1. Add `db4` as a new replica and wait for it to join.
2. Switchover away from the old primary:

```bash
./scripts/cluster-node.sh switchover db4
```

3. Confirm `db4` is now leader.
4. Remove `db1` as a replica:

```bash
./scripts/cluster-node.sh remove-replica db1 root@db1.example.com
```

5. Update HAProxy if needed.
6. If `db1` also hosted etcd and is being retired permanently, replace the etcd member too.

## Scenario C: Replace an etcd member on a retired host

Example: old etcd member `db3`, replacement `db4`

1. First make sure PostgreSQL redundancy is preserved.
   - If `db4` will also be a DB node, add it as a PostgreSQL replica first.
2. Remove the old etcd member and add the new one:

```bash
./scripts/cluster-node.sh replace-etcd-member db3 db4 10.20.0.14
```

3. Use the `member add` output to configure etcd on `db4`.
4. Bring up etcd on `db4` with `initial-cluster-state=existing`.
5. After etcd is healthy, bring up Patroni/PostgreSQL on `db4`.
6. Confirm etcd membership:

```bash
./scripts/cluster-node.sh etcd-members
```

## Recommended full order when replacing one VPS

If the host runs both PostgreSQL and etcd, use this order:

1. Build new VPS
2. Connect WireGuard
3. Add new PostgreSQL replica
4. Wait for sync
5. Update HAProxy
6. Switchover if the old host is primary
7. Remove old PostgreSQL node
8. Remove old etcd member
9. Add new etcd member
10. Bring up etcd on the new host if not already done with the correct membership values
11. Verify Patroni and etcd health
12. Destroy old VPS

## Verification checklist

### Patroni

```bash
./scripts/cluster-node.sh list
```

You want to see:
- exactly one `Leader`
- all expected replicas present
- no old host still listed once it is retired

### etcd

```bash
./scripts/cluster-node.sh etcd-members
```

You want to see only the current members.

### Application path

- connect to PostgreSQL through HAProxy
- verify reads and writes still work
- verify pgAdmin and PostgREST still work through Caddy

## Practical warning

Because etcd is colocated with PostgreSQL in this design, replacing a host changes two clusters at once.
That is workable, but more delicate than keeping etcd on three tiny separate VPS instances.

If you expect frequent provider churn, separating etcd from the DB nodes later will make migrations easier.
