top | item 23293863

(no title)

feike | 5 years ago

One of the authors of Patroni here.

Automatic failover for PostgreSQL works great and can be done safely if combined with synchronous replication.

Multiple tools will implement this correctly:

https://patroni.readthedocs.io/en/latest/replication_modes.h... https://github.com/sorintlab/stolon/blob/master/doc/syncrepl...

Quoting a former colleague here, but "if it hurts, do it more often". That is what you should do with your PostgreSQL failovers.

I have clusters running on timelines in the hundreds without a byte of data loss due to using synchronous replication, tools that help out with leader election, and just doing it often.

discuss

takeda|5 years ago

Can Patroni tell if master node is not responsive because it is busy vs dead? GitHub (I believe) had few outages that caused data loss because their auto failover mechanism kicked in when it shouldn't.

I would actually be interested if aphyr's analysis of Patroni and other distributed add-ons to PostgreSQL.

pas|5 years ago

There is no real difference between dead or too-busy.

The only question is how soon are you going to page humans. After the automated mechanism flipped your master 2-3 times but the cluster still hasn't made progress [nothing coming out of the master; or it locks up after a few minutes again]), or right after some other automated mechanism detects that there's a problem.

Whatever automation you have in place, it has advantages and disadvantages. In the GitHub case - I suppose - they determined post-mortem that it would have been better to just let the master chug through the incoming onslaught of queries instead of failing over, and over, and over. (But of course this seems like a trivial problem in any auto failover setup, so I suspect there's more to the story.)

feike|5 years ago

> Can Patroni tell if master node is not responsive because it is busy vs dead

No. But the contract Patroni has is this:

I only serve a master (primary) if I have the lock. If I do not have the lock I will demote.

This results in that there can be only 1 primary active at any given point in time, even if the network is partitioned.

This in and of itself does not guarantee no-split-brain situations, a split-brain can occur if writes were made on the former primary, but not yet on the future primary. This however can be mitigated with synchronous replication.

zozbot234|5 years ago

> tell if master node is not responsive because it is busy vs dead?

The postgres documentation will tell you that you'll need to set up your own mechanisms for this, and that they will need to integrate with OS facilities as appropriate. One-size-fits-all does not cut it. Not wrt. replication, not wrt. HA/failover.