top | item 45929921

A race condition in Aurora RDS

241 points| theanomaly | 3 months ago |hightouch.com

82 comments

order

gtowey|3 months ago

This article seems to indicate that manually triggered failovers will always fail if your application tries to maintain its normal write traffic during that process.

Not that I'm discounting the author's experience, but something doesn't quite add up:

- How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?

- If they know, how is this not an urgent P0 issue for AWS? This seems like the most basic of basic usability features is 100% broken.

- Is there something more nuanced to the failure case here such as does this depend on transactions in-progress? I can see how maybe the failover is waiting for in-flight transactions to close and then maybe hits a timeout where it proceeds with the other part of the failover by accident. That could explain why it doesn't seem like the issue is more widespread.

twisteriffic|3 months ago

> How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?

If it's anything like how Azure handles this kind of issue, it's likely "lots of people have experienced it, a restart fixes it so no one cares that much, few have any idea how to figure out a root cause on their own, and the process to find a root cause with the vendor is so painful that no one ever sees it through"

theanomaly|3 months ago

I'm surprised this hasn't come up more often too. When we worked with AWS on this, they confirmed there was nothing unique about our traffic pattern that would trigger this issue. We also didn't run into this race condition in any of our other regions running similar workloads. What's particularly concerning is that this seems to be a fundamental flaw in Aurora's failover mechanism that could theoretically affect anyone doing manual failover.

kobalsky|3 months ago

> - How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?

I know that there is no comparison in the user base, but a few years ago I ran into a massive Python + MySQL bug that:

1. made SELECT ... FOR UPDATE fail silenty 2. aborted the transaction and set the connection into autocommit mode

This basically a worst case scenario in a transactional system.

I was basically screaming like a mad man in the corner but no one seemed to care.

Someone contacted me months later telling me that they experienced the same problem with "interesting" consequences in their system.

The bug was eventually fixed but at that point I wasn't tracking it anymore, I provided a patch when I created the issue and moved on.

https://stackoverflow.com/questions/945482/why-doesnt-anyone...

aetherson|3 months ago

My experience with AWS is that they are extremely, extremely parsimonious about any information they give out. It is near-impossible to get them to give you any details about what is happening beyond the level of their API. So my gut hunch is that they think that there's something very rare about this happening, but they refuse to give the article writer the information that might or might not help them avoid the bug.

maherbeg|3 months ago

Yeah I agree, this seems like a pretty critical feature to the Aurora product itself. We saw a similar behavior recently where we had a connection pooler in between which indicates something wrong with how they propagate DNS changes during the failover. wtf aws

Hovertruck|3 months ago

Agreed, we've been running multiple aurora clusters in production for years now and have not encountered this issue with failovers.

belter|3 months ago

The article is low quality. It does not mention which Aurora PostgreSQL version was involved, and it provides no real detail about how the staging environment differed from production, only saying that staging “didn’t reproduce the exact conditions,” which is not actionable.

This AWS documentation section: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraPostgreSQ...

“Amazon Aurora PostgreSQL updates”: under Aurora PostgreSQL 17.5.3, September 16, 2025 – Critical stability enhancements includes a potential match:

“...Fixed a race condition where an old writer instance may not step down after a new writer instance is promoted and continues to write…”

If that is the underlying issue, it would be serious, but without more specifics we can’t draw conclusions.

For context: I do not work for AWS, but I do run several production systems on Aurora PostgreSQL. I will try to reproduce this using the latest versions over the next few hours. If I do not post an update within 24 hours, assume my tests did not surface anything.

That would not rule out a real issue in certain edge cases, configurations, or version combinations but it would at least suggest it is not broadly reproducible.

nijave|3 months ago

fwiw we haven't seen issues manually doing manual failovers for maintenance using the same/similar procedure described in the article. I imagine there is something more nuanced here and it's hard to draw too many conclusions without a lot more details being provided by AWS

grogers|3 months ago

It sounds like part of the problem was how the application reacted to the reverted fail over. They had to restart their service to get writes to be accepted, implying some sort of broken caching behavior where it kept trying to send queries to the wrong primary.

It's at least possible that this sort of aborted failover happens a fair amount, but if there's no downtime then users just try again and it succeeds, so they never bother complaining to AWS. Unless AWS is specifically monitoring for it, they might be blind to it happening.

benmmurphy|3 months ago

it could be most people pause writes because its going to create errors if you try and execute a write against an instance that refuses to accept and writes, and for some people those errors might not be recoverable. so they just have some option in their application that puts the application into maintenance mode where it will hard reject writes at the application layer.

nrhrjrjrjtntbt|3 months ago

P0 if it happens to everyone, right? Like the USE1 outage recently. If it is 0.001% of customers (enough to get a HN story) is may not be that high. Maybe this customer is on a migration or upgrade path under the hood. Or just on a bad unit in the rack.

dboreham|3 months ago

Although the article has an SEO-optimized vibe, I think it's reasonable to take it as true until refuted. My rule of thumb is that any rarely executed, very tricky operation (e.g. database writer fail over) is likely to not work because there are too many variables in play and way too few opportunities to find and fix bugs. So the overall story sounds very plausible to me. It has a feel of: it doesn't work under continuous heavy write load, in combination with some set of hardware performance parameters that plays badly with some arbitrary time out. Note that the system didn't actually fail. It just didn't process the fail over operation. It reverted to the original configuration and afaics preserved data.

biggoodwolf|3 months ago

I recall seeing this also happening in CosmosDB. Both auto and manual

time0ut|3 months ago

Wow. This is alarming.

We have done a similar operation routinely on databases under pretty write intensive workloads (like 10s of thousands of inserts per second). It is so routine we have automation to adjust to planned changes in volume and do so a dozen times a month or so. It has been very robust for us. Our apps are designed for it and use AWS’s JDBC wrapper.

Just one more thing to worry about I guess…

dangoodmanUT|3 months ago

Not really: Their storage layer worked perfectly and prevented the ACID violations.

grhmc|3 months ago

Yikes! This is exactly the kind of invariant I'd expect Aurora to maintain on my behalf. It is why I pay them so much...

dangoodmanUT|3 months ago

It did, the storage layer did not allow for concurrent writes.

halifaxbeard|3 months ago

I think OP is wrong in their hypothesis based on the logs they share and the root cause AWS support provided them.

I think the promotion fails to happen and then an external watchdog notices that it didn’t, and kills everything ASAP as it’s a cluster state mismatch.

The message about the storage subsystem going away is after the other Postgres process was kill -9’d.

jansommer|3 months ago

People who have experience with Aurora and RDS Postgres: What's your experience in terms of performance? If you dont need multi A-Z and quick failover, can you achieve better performance with RDS and e.g. gp3 64.000 iops and 3125 throughput (assuming everything else can deliver that and cpu/mem isn't the bottleneck)? Aurora seems to be especially slow for inserts and also quite expensive compared to what I get with RDS when I estimate things in the calculator. And what's the story on read performance for Aurora vs RDS? There's an abundance of benchmarks showing Aurora is better in terms of performance but they leave out so much about their RDS config that I'm having a hard time believing them.

nijave|3 months ago

We've seen better results and lower costs in a 1 writer, 1-2 reader setup on Aurora PG 14. The main advantages are 1) you don't re-pay for storage for each instance--you pay for cluster storage instead of per-instance storage & 2) you no longer need to provision IOPs and it provides ~80k IOPs

If you have a PG cluster with 1 writer, 2 readers, 10Ti of storage and 16k provision IOPs (io1/2 has better latency than gp3), you pay for 30Ti and 48k PIOPS without redundancy or 60Ti and 96k PIOPS with multi-AZ.

The same Aurora setup you pay for 10Ti and get multi-AZ for free (assuming the same cluster setup and that you've stuck the instances in different AZs).

I don't want to figure the exact numbers but iirc if you have enough storage--especially io1/2--you can end up saving money and getting better performance. For smaller amounts of storage, the numbers don't necessarily work out.

There's also 2 IO billing modes to be aware of. There's the default pay-per-IO which is really only helpful for extreme spikes and generally low IO usage. The other mode is "provisioned" or "storage optimized" or something where you pay a flat 30% of the instance cost (in addition to the instance cost) for unlimited IO--you can get a lot more IO and end up cheaper in this mode if you had an IO heavy workload before

I'd also say Serverless is almost never worth it. Iirc provisioning instances was ~17% of the cost of serverless. Serverless only works out if you have ~ <4 hours of heavy usage followed by almost all idle. You can add instances fairly quickly and failover for minimal downtime (of course barring running into the bug the article describes...) to handle workload spikes using fixed instance sizes without serverless

Scubabear68|3 months ago

For me, the big miss with Postgres Aurora RDS was costs. We had some queries that did a fair amount of I/O in a way that would not normally be a problem, but in the Aurora Postgres RDS world that I/O was crazy expensive. A couple of fuzzy queries blew costs up to over $3,000/month for a database that should have cost maybe $50-$100/month. And this was for a dataset of only about 15 million rows without anything crazy in them.

Exoristos|3 months ago

We were burned by Aurora. Costs, performance, latency, all were poor and affected our product. Having good systems admins on staff, we ended up moving PostgreSQL on-prem.

belter|3 months ago

> There's an abundance of benchmarks showing Aurora is better in terms of performance but they leave out so much about their RDS config that I'm having a hard time believing them.

Do you have a problem believing these claims on equivalent hardware?: https://pages.cs.wisc.edu/~yxy/cs764-f20/papers/aurora-sigmo...

Or do your own performance assessments, following the published document and templates available so you can find the facts on your own?

For Aurora MySql:

"Amazon Aurora Performance Assessment Technical Guide" - https://d1.awsstatic.com/product-marketing/Aurora/RDS_Aurora...

For Aurora Postgres:

"...Steps to benchmark the performance of the PostgreSQL-compatible edition of Amazon Aurora using the pgbench and sysbench benchmarking tools..." - https://d1.awsstatic.com/product-marketing/Aurora/RDS_Aurora...

"Automate benchmark tests for Amazon Aurora PostgreSQL" - https://aws.amazon.com/blogs/database/automate-benchmark-tes...

"Benchmarking Amazon Aurora Limitless with pgbench" - https://aws.amazon.com/blogs/database/benchmarking-amazon-au...

paranoidrobot|3 months ago

My experience is with Aurora MySQL, not postgres. But my understanding is that the way the storage layer works is much the same.

We have some clusters with very high write IOPS on Aurora.

When looking at costs we modelled running MySQL and regular RDS MySQL.

We found for the IOPS capacity of Aurora we wouldn't be able to match it on AWS without paying a stupid amount more.

everfrustrated|3 months ago

Aurora doesn't use EBS under the hood. It has no option to choose storage type or io latency. Only a billing choice between pay per io or fixed price io.

jaggederest|3 months ago

I've had better results with managing my own clusters on metal instances. You get much better performance with e.g. NVMe drives in a 0+1 raid (~million iops in a pure raid 0 with 7 drives) and I am comfortable running my own instances and clusters. I don't care for the way RDS limits your options on extensions and configuration, and I haven't had a good time with the high availability failovers internally, I'd rather run my own 3 instances in a cluster, 3 clusters in different AZs.

Blatant plug time:

I'm actually working for a company right now ( https://pgdog.dev/ ) that is working on proper sharding and failovers from a connection pooler standpoint. We handle failovers like this by pausing write traffic for up to 60 seconds by default at the connection pooler and swapping which backend instance is getting traffic.

shawabawa3|3 months ago

> 3125 throughput

Max throughput on gp3 was recently increased to 2GB/s, is there some way I don't know about of getting 3.125?

shayonj|3 months ago

Sadly, its not the first time I have noticed unexpected and odd behaviors from Aurora PostgreSQL offering.

I noticed another interesting (and still unconfirmed) bug with Aurora PostgreSQL around their Zero Downtime Patching.

During an Aurora minor version upgrade, Aurora preserves sessions across the engine restart, but it appears to also preserve stale per-session execution state (including the internal statement timer). After ZDP, I’ve seen very simple queries (e.g. a single-row lookup via Rails/ActiveRecord) fail with `PG::QueryCanceled: ERROR: canceling statement due to statement timeout` in far less than the configured statement_timeout (GUC), and only in the brief window right after ZDP completes.

My working theory is that when the client reconnects (e.g. via PG::Connection#reset), Aurora routes the new TCP connection back to a preserved session whose “statement start time” wasn’t properly reset, so the new query inherits an old timer and gets canceled almost immediately even though it’s not long-running at all.

d1egoaz|3 months ago

> AWS has indicated a fix is on their roadmap, but as of now, the recommended mitigation aligns with our solution: use Aurora’s Failover feature on an as-needed basis and ensure that no writes are executed against the DB during the failover.

Is there a case number where we can reach out to AWS regarding this recommendation?

paranoidrobot|3 months ago

Yeah. I'd like this too.

We use Aurora MySQL but I would like to be able to point to that and ask if it applies to us.

dangoodmanUT|3 months ago

This confirms a lot of what their engineers preach: The lego brick model.

They made the storage layer in total isolation, and they made sure that it guaranteed correctness for exclusive writer access. When the upstream service failed to also make its own guarantees, the data layer was still protected.

Good job AWS engineering!

robinduckett|3 months ago

Glad to know I’m not crazy.

theanomaly|3 months ago

AWS Support initially pushed back and suggested it's because of high replication lag but they were looking at metrics that were more than 24 hours old. What kind of failure did you encounter? I really want to understand what edge case we triggered in their failover process - especially since we could not reproduce it in other regions.

bob1029|3 months ago

> Aurora's architecture differs from traditional PostgreSQL in a crucial way: it separates compute from storage.

I find this approach very compelling. MSSQL has a similar thing with their hyperscale offering. It's probably the only service in Azure that I would actually use.

redwood|3 months ago

A good reminder of how people developing a mental model of adding read replicas as a way to scale is a slippery slope. At the end of the day you're scaling only one specific part of your system with certain consistency dynamics that are difficult to reason about

terminalshort|3 months ago

Works fine for workloads like:

1. I need to grab some rows from a table

2. Eventual consistency is good enough

And that's a lot of workloads.

nijave|3 months ago

You can hit the same problems horizontally scaling compute. One instance reads from the DB, a request hits a different instance which updates the DB. The original instance writes to the DB and overwrites the changes or makes decisions based on stale data.

More broadly a distributed system problem

almosthere|3 months ago

probably should have added postgres to end of title

evanelias|3 months ago

Absolutely this. The differences between Aurora Postgres and Aurora MySQL are quite significant. A failover bug affecting one doesn't imply the same bug exists in the other.

A lot of people seem to have the misconception that "Aurora" is its own unique database system, with different front-ends "pretending" to be Postgres or MySQL, but that isn't the case at all.

ldkge|3 months ago

Am I the only one who misread that as “AI race condition”?