top | item 46730614

(no title)

The 'single primary with read replicas' pattern scaling to 800M users is the real insight here. Most startups reach for sharding or distributed databases way too early, adding complexity for scale they don't have. If OpenAI can serve hundreds of millions from one Postgres primary by offloading reads and pushing new write-heavy features elsewhere, that's a strong argument for simplicity.

discuss

fbotelho|1 month ago

I like your point, but it also says that this isn't easy:

> It may sound surprising that a single-primary architecture can meet the demands of OpenAI’s scale; however, making this work in practice isn’t simple.

And it also says that this approach has cornered them into a solution that isn't trivial to change. They now use different database deployments (the single primary one that is the focus of the post and *multiple* other systems, such as Azure CosmosDB, to which some of the write traffic is being directed).

> To mitigate these limitations and reduce write pressure, we’ve migrated, and continue to migrate, shardable (i.e. workloads that can be horizontally partitioned), write-heavy workloads to sharded systems such as Azure Cosmos DB, optimising application logic to minimise unnecessary writes. We also no longer allow adding new tables to the current PostgreSQL deployment. New workloads default to the sharded systems.

I wonder how easy it is for developers to maintain and evolve this solution of miscellaneous database systems.

So yes, you can go far with a single primary, but you can also potentially never easily get away from it.

poemxo|1 month ago

Counterpoint: if they had reached for sharding early, they would have avoided the technical debt of having to refactor their existing database. I don't think sharding is necessarily that complex either, especially for a SaaS style app like ChatGPT where users are mostly siloed in.

Etheryte|1 month ago

When people spend their entire careers in AWS land, it's easy to forget just how much power a single beefy bare metal server brings to bear. You can scale far and wide simply by getting a bigger server.

oofbey|1 month ago

True. But “big beef” is complicated and difficult to make reliable. Horizontal scaling of unreliable servers is dirt simple to stay up through almost anything except sudden load spikes. And then it’s largely a matter of configuring your auto scaling and retries.

That said big beef is so simple to start with. And this story is a strong example that YAGNI is a practical reality for almost everybody wrt “distributed everything”.

iamlintaoz|1 month ago

If you need so many tricks to support the infra, it will eventually come back to bite you. I am pretty sure that Google in year 2000 could have supported their workloads with existing technologies (Yahoo could, and it was a much larger company). But they did GFS and Bigtable, and the rest is history. Other companies struggled to catch up due to inferior infrastructure. A visionary company needs to be prepared and should not be hindered by infrastructure. Can you scale the single primary system another 10x or more? Because their CEO said that they will scale their revenue by that much within just a couple of years.

kevincox|1 month ago

But they didn't really stay single primary. They moved a lot of load off to alternate database systems. So they did effectively shard, but to different databases rather than postgres.

Quite possibly they would have been better off staying purely postgres but with sharing. But impossible to know.

msp26|1 month ago

This account's comment history is pure slop. 90% sure its all AI generated. The structure is too blatant.