top | item 41822092

(no title)

lbriner | 1 year ago

> How do I check replica lagging? I use the prometheus exporter for postgres

> How would I monitor the replica? Same. You can also use something like HA proxy calling a postgres CLI command to connect to the instance

> How do I failover? Mostly, you probably want to do this manually because there can be data loss and you want to make sure the risk is worth it. I simply use repmgr for this.

> Do I need 2 replicas? It's usually good to have at least 3 (1 master and 2 slaves) but mostly so that if one fails, you still have 2 remaining i.e. time to get a 3rd back online

> How do I failback? Again, very easy with repmgr, you just tell the primary to be the primary again. The failed over primary gets stopped, the original primary gets fast-forwarded and promoted to primary and everything else gets told to follow.

I do agree that this space for postgres is very fragmented and some tools appear abandoned but its pretty straight-forward with just postgres + barman + repmgr, I have a series of vides on YouTube if you are interested but I am not a Postgres expert so please no hating :-) https://youtu.be/YM41mLZQxzE

discuss

cheald|1 year ago

+1 to all of this. The thing I'd add is that we use barman for our additional replicas; WAL streaming is very easy to do with Barman, and we stream to two backups (one onsite, one offsite). The only real costs are bandwidth and disk space, both of which are cheap. Compared to running a full replica (with its RAM costs), it's a very economical way to have a robust disaster recovery plan.

If you're doing manual failover, you don't need an odd number of nodes in the cluster (since you aren't looking for quorum to automatically resolve split-brain like you would be with tools Elasticsearch or redis-sentinel), so for us it's just a question of "how long does it take to get back online if we lose the primary" (answer: as long as it takes to determine that we need to do a switch and invoke repmgr switchover), and "how robust are we against catastrophic failure" (answer: we can recover our DB from a very-close-to-live barman backup from the same DC, or from an offsite DC if the primary DC got hit by an airplane or something).