The 'single primary with read replicas' pattern scaling to 800M users is the real insight here. Most startups reach for sharding or distributed databases way too early, adding complexity for scale they don't have. If OpenAI can serve hundreds of millions from one Postgres primary by offloading reads and pushing new write-heavy features elsewhere, that's a strong argument for simplicity.
fbotelho|1 month ago
> It may sound surprising that a single-primary architecture can meet the demands of OpenAI’s scale; however, making this work in practice isn’t simple.
And it also says that this approach has cornered them into a solution that isn't trivial to change. They now use different database deployments (the single primary one that is the focus of the post and *multiple* other systems, such as Azure CosmosDB, to which some of the write traffic is being directed).
> To mitigate these limitations and reduce write pressure, we’ve migrated, and continue to migrate, shardable (i.e. workloads that can be horizontally partitioned), write-heavy workloads to sharded systems such as Azure Cosmos DB, optimising application logic to minimise unnecessary writes. We also no longer allow adding new tables to the current PostgreSQL deployment. New workloads default to the sharded systems.
I wonder how easy it is for developers to maintain and evolve this solution of miscellaneous database systems.
So yes, you can go far with a single primary, but you can also potentially never easily get away from it.
poemxo|1 month ago
Etheryte|1 month ago
oofbey|1 month ago
That said big beef is so simple to start with. And this story is a strong example that YAGNI is a practical reality for almost everybody wrt “distributed everything”.
iamlintaoz|1 month ago
kevincox|1 month ago
Quite possibly they would have been better off staying purely postgres but with sharing. But impossible to know.
msp26|1 month ago