(no title)
luhn | 10 months ago
Multi-AZ instances is a long-standing feature of RDS where the primary DB is synchronously replicated to a secondary DB in another AZ. On failure of the primary, RDS fails over to the secondary.
Multi-AZ clusters has two secondaries, and transactions are synchronously replicated to at least one of them. This is more robust than multi-AZ instances if a secondary fails or is degraded. It also allows read-only access to the secondaries.
Multi-AZ clusters no doubt have more "magic" under the hood, as its not a vanilla Postgres feature as far as I'm aware. I imagine this is why it's failing the Jepsen test.
ants_a|10 months ago
There still is a Postgres deficiency that makes something similar to this pattern possible. Non-replicated transactions where the client goes away mid-commit become visible immediately. So in the example, if T1 happens on a partitioned leader, disconnects during commit, T2 also happens on a partitioned node, and T3 and T4 happen later on a new leader, you would also see the same result. However, this does not jive with the statement that fault injection was not done in this test.
Edit: did not notice the post that this pattern can be explained by inconsistent commit order on replica and primary. Kind of embarrassing given I've done a talk proposing how to fix that.
sontek|10 months ago
ashu1461|10 months ago
So if snapshot violation is happening inside Multi-AZ instances, it can happen with a single region - multiple read replica kind of setup as well ? But it might be easily observable in Multi-AZ setups because the lag is high ?
luhn|10 months ago
Two replicas in a “semi synchronous” configuration, as AWS calls it, is to my knowledge not available in base Postgres. AWS must be using some bespoke replication strategy, which would have different bugs than synchronous replication and is less battle-tested.
But as nobody except AWS knows the implementation details of RDS, this is all idle speculation that doesn’t mean much.
unknown|10 months ago
[deleted]
x0x0|10 months ago
> We show that Amazon RDS for PostgreSQL multi-AZ clusters violate Snapshot Isolation
you kind of have to expect people to read
evil-olive|10 months ago
however, "multi-AZ" has been made ambiguous, because there are now multi-AZ instances and multi-AZ clusters.
...and your multi-AZ "instance", despite being not a multi-AZ "cluster" from AWS's perspective, is still two nodes that are "clustered" together and treated as one logical database from the client connection perspective.
see [0] and scroll down to the "availability and durability" screenshot for an example.
0: https://aws.amazon.com/blogs/aws/amazon-rds-multi-az-db-clus...