(no title)
egnehots | 1 year ago
Use PostgreSQL administrative functions, specifically: pg_last_xact_replay_timestamp. (https://www.postgresql.org/docs/current/functions-admin.html...)
> How would I monitor the replica? A simple cron task that pings a health check if everything is OK (lag is < x) would be a good start.
There are many solutions, highly dependent on your context and the scale of your business. Options range from simple cron jobs with email alerts to more sophisticated setups like ELK/EFK, or managed services such as Datadog.
> How do I failover to the replica if the primary goes down?
> Should I handle failover automatically or manually?
> Do I need two replicas to avoid a split-brain scenario? My head hurts already.
While it may be tempting to automate failover with a tool, I strongly recommend manual failover if your business can tolerate some downtime.
This approach allows you to understand why the primary went down, preventing the same issue from affecting the replica. It's often not trivial to restore the primary or convert it to a replica. YOU become the concensus algorithm, the observer, deciding which instance become the primary.
Two scenarios to avoid:
* Falling back to a replica only for it to fail (e.g., due to a full disk).
* Successfully switching over so transparently that you will not notice that you're now running without a replica.
> After a failover (whether automatic or manual), how do I reconfigure the primary to be the primary again, and the replica to be the replica?
It's easier to switch roles and configure the former primary as the new replica. It will then automatically synchronize with the current primary.
You might also want to use the replica for:
* Some read-only queries. However, for long-running queries, you will need to configure the replication delay to avoid timeouts.
* Backups or point-in-time recovery.
If you manage yourself a database, I strongly recommand to gain confidence first in your backups and your ability to restore them quickly. Then you can play with replication, they are tons of little settings to configure (async for perf, large enough wall size to restore quickly, ...).
It's not that hard, but you want to have the confidence and the procedure written down before you have to do it in a production incident.
No comments yet.