top | item 39710346

(no title)

menthe | 1 year ago

This, seriously. The long-term maintenance, tribal knowledge & risks associated with this giant hack will be greater than anything they'd ever have expected. Inb4 global outage post-mortem & key-man dependency salaries.

There's no virtually no excuse not spinning up a pg pod (or two) for each tenant - heck even a namespace with the whole stack.

Embed your 4-phases migrations directly in your releases / deployments, slap a py script to manage progressive rollouts, and you're done.

Discovery is automated, blast / loss radius is reduced to the smallest denominator, you can now monitor / pin / adjust the stack for each customer individually as necessary, sort the release ordering / schedule based on client criticality / sensitivity, you can now easily geolocate the deployment to the tenant's location, charge by resource usage, and much more.

And you can still query & roll-up all of your databases at once for analytics with Trino/DBT with nothing more but a yaml inventory.

No magic, no proprietary garbage.

discuss

order

nightpool|1 year ago

Figma has millions of customers. The idea of having a Postgres pod for each one would be nearly impossible without completely overhauling their DB choice.

menthe|1 year ago

You are making a major conflation here. While they do millions of users, they were last reported to only have ~60k tenants.

Decently sized EKS nodes can easily hold nearly 800 pods each (as documented), that'd make it 75 nodes. Each EKS cluster supports up to 13,500 nodes. Spread in a couple of regions to improve your customer experience, you're looking at 20 EKS nodes per cluster. This is a nothingburger.

Besides, it's far from being rocket science to co-locate tenant schemas on medium-sized pg instances, monitor tenant growth, and re-balance schemas as necessary. Tenants' contracts does not evolve overnight, and certainly does not grow orders of magnitude on week over week basis - a company using Figma either has 10 seats, 100 seats, 1000, or 10,000 seats. It's easy to plan ahead for. And I would MUCH rather having to think of re-balancing a heavy hitter customer's schema to another instance every now and then (can be 100% automated too), compared to facing a business-wide SPOF, and having to hire L07+ DBAs to maintain a proprietary query parser / planner / router.

Hell, OVH does tenant-based deployments of Ceph clusters, with collocated/coscheduled SSD/HDD hardware and does hot-spot resolution. And running Ceph is significantly more demanding and admin+monitoring heavy.

asguy|1 year ago

Thousands of directories over thousands of installations? It’s not that far fetched.