I'm at a company that is weighing a very similar decision (we are on RDS Postgres with a rapidly growing database that will require some horizontal partitioning). There really isn't an easy solution. We spoke to people who have done sharding in-house (Figma, Robinhood) as well as others who migrated to natively distributed systems like Cockroach (Doordash).
If you decide to move off of RDS but stay on Postgres, you can run your own Postgres but now lose all the benefits of a managed service. You could move off of AWS (eg to Azure), but moving to a different cloud is a huge lift. That, btw, would also be required if you want to try something like Spanner (move to GCP). Moving off of Postgres to another database is also risky. The migration will obviously take some effort, but you're also probably talking about lots of code changes, schema changes, etc, as well as unknown operational risks that you'll discover on the new system. This applies if you're talking about things like Cockroach or even moving to MySQL.
That said, rolling your own sharding is a MASSIVE undertaking. Limitless looks promising, since it takes care of a lot of the issues that Figma ended up spending time on (basically, you shouldn't need something like Figma's DBProxy, as shard routing, shard splitting, etc will be taken care of). It's still in preview though, and like the article mentioned, the costs may be high.
Overall, no easy decisions on this one, unfortunately.
> That said, rolling your own sharding is a MASSIVE undertaking.
It's a large challenge, but it's absolutely doable. A ton of companies did this 10-15 years ago, basically every successful social network, user generated content site, many e-commerce sites, massively multiplayer games, etc. Today's pre-baked solutions didn't exist then, so we all just rolled our own, typically on MySQL back then.
With DIY, the key thing is to sidestep any need for cross-shard joins. This is easier if you only use your relational DB for OLTP, and already have OLAP use-cases elsewhere.
Storing "association" type relation tables on 2 shards helps tremendously too: for example, if user A follows user B, you want to record this on both user A and user B's shards. This way you can do "list all IDs of everyone user A follows" as well as "list all IDs of users following user B" without crossing shard boundaries. Once you have the IDs, you have to do a multi-shard query to get the actual user rows, but that's a simple scatter-gather by ID and easy to parallelize.
Implementing shard splitting is hard, but again definitely doable. Or avoid it entirely by putting many smaller shards on each physical DB server -- then instead of splitting a big server, you can just move an entire shard to another server, which avoids any row-by-row operations.
Many other tricks like this. It's a lot of tribal knowledge scattered across database conference talks from a decade ago :)
What I do not understand is they say "we explored CockroachDB, TiDB, Spanner, and Vitess". Those are not compatible with PostgreSQL beyond the protocol and migration would require massive rewrites and tests to get the same behavior. YugabyteDB is using PostgreSQL for the SQL processing, to provide same features and behavior and distributes with a Spanner-like architecture.
I'm not saying that there's no risk and no efforts, but they are limited. And testing is easy as you don't have to change the application code.
I don't understand why they didn't spend a few days on a proof of concept with YugabyteDB and explored only the solutions where application cannot work as-is.
We use Citus. Very similar performance properties to DIY sharding but much more polished. Currently at 7 TB, self hosted. Growing roughly at 100 % per year, write-heavy. Works fine for us.
I’m definitely of the opinion that what Figma[0] (and earlier, Notion[1]) did is what I’d call “actual engineering”.
Both of these companies are very specific about their circumstances and requirements
- Time is ticking, and downtime is guaranteed if they don’t do anything
- They are not interested in giving up the massive amount feature AWS supports via RDS, very specially around data recovery (anyone involved with Business Continuity planning would understand the importance of this)
- They need to be able to do it safely, incrementally, and without closing the door on reverting their implementation/rollout
- The impact on Developers should be minimal
“Engineering” at its core is about navigating the solution space given your requirements, and they did it well!
Both Figma and Notion meticulously focused on the minimal feature set they needed, in the most important order, to prevent disastrous downtime (e.g Figma didn’t need to support all of SQL for sharding, just their most used subset).
Both companies call out (rightfully so) that they have extensive experience operating RDS at this point, and that existing solutions either didn’t give them the control they needed (Notion) or required a significant rewrite or their data structures (Figma), which was not acceptable.
I think many people also completely underestimate how important operational experience with your solution is at this scale. Switch to Citus/Vitess? You’ll now find out the hard way all the “gotchas” of running those solutions at scale, and it would guarantedly have resulted in significant downtime as they procured this knowledge.
They’d also have to spend a ton of time catching up to RDS features they were suddenly lacking, which I would wager would take much more time than the time it took implementing their solutions.
the right way to look at it - IMHO - is to interpret "lots of RDS experience" as complete lack of run-your-own postgreSQL experience. and given that it's not surprising that their cost-benefit math give them the answer of "invest into a custom middleware, instead of moving to running our postgreSQL plus some sharding thing on top"
of course it's not cheap, but probably they are deep into the AWS money pit anyway (so running Citus/whatever would be similar TCO)
and it's okay, AWS is expensive for a lot of bootstrapped startups with huge infra requirements for each additional user, but Figma and Notion are virtually on the exact opposite of that spectrum
also it shows that there's no trivial solution in this space, sharding SQL DBs is very hard in general, and the extant solutions have sharp edges
Good points. Although, having worked on many high scale architectures before, I always err on the side of thinking that any technical solution of this magnitude would have far too many nuances for a blog post to capture. And I believe Figma’s post also talks mainly about the common denominator that’s easier to communicate to the external world.
For me to understand their choices, I’ll first have to understand their system, which can be a task in itself. Without that, I’d not become aware of the nuances, only general patterns.
Aren't there any good managed postgres solutions supporting citus? The decision here seems to have been to invent a whole new sharding solution instead of building enough in house DBA to self-host postgres (if you want to stay on Amazon, you can use any extension you want on EC2). Speaks for the state of engineering right now.
3rd party solutions can also add complexity you don't want. You need to keep up-to-date with their release schedules to have access to bug and security fixes, even though you would feature wise be happy with the older version.
Also these can add unnecessary complexity by having features you don't need. Or they might be missing features you need. Contributing up-stream can be difficult and there might be conflicts of interest especially for projects which have separate paid version.
Now seems a good time to point out, the wheel has literally been reinvented over and over again. The wheels of yesterday were terrible. Each version gets better. It's fine, reinvent away folks :)
Seriously, a naive database sharding algorithm could be implemented in a week or so by a competent dev.
A company like figma (billions in revenue) putting a small team to implementing a database sharding solution for an un-implemented use case (RDS, not just postgres). AND open sourcing it creating a value for the community is a net-good for the industry.
I still can't understand why they decided to use a single database for all their customers. If each customer needs access to its own data, why not a dedicated database for every customer?
Multi-tenant design is a huge win in terms of reducing developer toil and expense.
Many customers will have a tiny amount of data. For those customers a dedicated database is huge amount of overhead. There may not be any single customer who it makes sense to allocate dedicated "hardware" for.
Sure you have to deal with a one-time pain to shard your thingy, but you don't need to pay for tens-of-thousands of individual database servers, write interesting tools to keep their schemas in sync, wrangle their backups, etc.
Well there's trade-offs with this too, whether needing aggregate data across shards for features, reporting, etc. Shared data between customers, users, etc. API access, etc.
Even notion has a similar approach to sharding postgres but both of them could benefit from having shard IDs prefixed with YY/MM/DD(as needed) otherwise it's back to the shard navigator once they max out against org-ids for each shard's capacity
The answer is obvious: they invented their own sharding solution because it's a really really cool problem to work on and they have more engineers than they really need to develop their actual product. A more resource-constrained team would have found a solution that sharded their backend using one of the existing solutions out there.
I have seen this several times before and it's always a symptom of having too many engineers working below the waterline. Rather than work on the actual customer-facing problem, let's port the backend to do event-sourcing/cqrs, move all our infrastructure to k8s, change language from x to y etc.
These are all what I would call "internal goals" (ie they may or may not be necessary or even essential to progress but are not directly customer-visible in their outcomes even if they may enable customer features to be built or indirectly improve the customer experience later) and need to be held to an extremely high level of scrutiny.
If you're amazon/google/meta and you need to do this because of extreme user scale I might believe you. If you're CERN or someone and you need to do this because of absolutely ridiculous data scale I might believe you. The idea that it's better for figma to write their own sharding solution than it is to port to one of the existing ones just doesn't pass even the most basic sniff test.
I can buy your comment as an interesting and even credible hypothesis, but the absolutes which you deal in (“doesn’t pass even the most basic sniff test”) are damning. You are clearly lacking huge amounts of information and context and are passing your own assumptions as hard facts.
Also, I’m assuming Amazon or Google will sometimes roll their own solutions on problems of a scale in the same ballpark as Figma’s.
But anyhow, what’s the scale at which this becomes acceptable, exactly? Is there a magical number which serves as a universal threshold? Or is there - like in all engineering decisions - a very concrete economic case for which you and I both lack a lot of the requisite context and inputs?
We went through something similar at Notion a few years ago and also chose to stick with RDS Postgres and build sharding logic in our application’s database client.
In both our case and Figma’s, sharding Postgres ASAP was of critical importance because of transaction ID wraparound threat or other capacity issues that promise hard days-long downtime. The kind of downtime that costs 10s of millions of dollars of brand damage alone. Possibly even company ending.
In such a situation, failure is not an option, and you must pick the least risky solution. Moving to an unmanaged cluster system and figuring out your own point-in-time backup/restore, access control provisioning, etc etc has a lot more unknown unknowns than sticking with the managed database vendor you know. The potential failure scenarios of Citus have scary worst cases - we get backup and restore wrong but it seems to work fine in test, then we move to Citus, then something breaks and we can’t restore from backup after all. It’s equally bad to mis-estimate the amount of time needed to bring up the new system. Let’s say you estimate 6 months to get parity with RDS built in features needed to survive disaster and start moving data over, but instead it takes 10 months. Is there enough time left to finish before going hard down? The clock is ticking. Staying with RDS keeps a whole class of new risk out of the picture.
At least here at Notion, NO ONE wanted to build something complicated for fun. We really wanted the company we’d spent years working for and on-call for to survive.
The article suggests a different reason. What would be your approach if you wanted to stay on RDS?
> So, now, let me speculate. The real reason why Figma reinvented the wheel by creating their own custom solution for sharding might be as straightforward as this — Figma wanted to stay on RDS, and since Amazon had decided not to support the CitusData extension in the past, the Figma team had no choice but to develop their own sharding solution from scratch.
I suppose Figma might just be beyond the "let's find the fastest/cheapest way to get this working" point. I believe it makes sense for a company in that stage to mess about a bit, find different (maaaybe even better) ways of doing things, keep the engineering work interesting to attract/retain talent, be OK with the inevitable waste involved in that game. If you're chasing the global maximum, you shouldn't get too obsessed with local maxima.
That said, I've seen plenty of unprofitable startups with high burn rate play this game. That seems a bit suicidal to me.
I don't read the story this way personally (not saying that these scenarios do not occur, but I feel the narrative detailed in the original article makes sense even without "chasing cool problems").
The article should be titled "Why Figma HAD TO reinvent the wheel with PostgresSQL".
When you have a legacy system and not enough time, or will to move off of it the only option is to get inventive and build with what you have.
There is always a price. In this case the database team did something quick, cheap and easily. But the Application teams now have to deal with handling all the nuaces of the system. Maybe Figma has more people in these Apps teams with time on their hands to handle it.
Ozzie_osman|1 year ago
If you decide to move off of RDS but stay on Postgres, you can run your own Postgres but now lose all the benefits of a managed service. You could move off of AWS (eg to Azure), but moving to a different cloud is a huge lift. That, btw, would also be required if you want to try something like Spanner (move to GCP). Moving off of Postgres to another database is also risky. The migration will obviously take some effort, but you're also probably talking about lots of code changes, schema changes, etc, as well as unknown operational risks that you'll discover on the new system. This applies if you're talking about things like Cockroach or even moving to MySQL.
That said, rolling your own sharding is a MASSIVE undertaking. Limitless looks promising, since it takes care of a lot of the issues that Figma ended up spending time on (basically, you shouldn't need something like Figma's DBProxy, as shard routing, shard splitting, etc will be taken care of). It's still in preview though, and like the article mentioned, the costs may be high.
Overall, no easy decisions on this one, unfortunately.
evanelias|1 year ago
It's a large challenge, but it's absolutely doable. A ton of companies did this 10-15 years ago, basically every successful social network, user generated content site, many e-commerce sites, massively multiplayer games, etc. Today's pre-baked solutions didn't exist then, so we all just rolled our own, typically on MySQL back then.
With DIY, the key thing is to sidestep any need for cross-shard joins. This is easier if you only use your relational DB for OLTP, and already have OLAP use-cases elsewhere.
Storing "association" type relation tables on 2 shards helps tremendously too: for example, if user A follows user B, you want to record this on both user A and user B's shards. This way you can do "list all IDs of everyone user A follows" as well as "list all IDs of users following user B" without crossing shard boundaries. Once you have the IDs, you have to do a multi-shard query to get the actual user rows, but that's a simple scatter-gather by ID and easy to parallelize.
Implementing shard splitting is hard, but again definitely doable. Or avoid it entirely by putting many smaller shards on each physical DB server -- then instead of splitting a big server, you can just move an entire shard to another server, which avoids any row-by-row operations.
Many other tricks like this. It's a lot of tribal knowledge scattered across database conference talks from a decade ago :)
franckpachot|1 year ago
krab|1 year ago
ksec|1 year ago
Yes. It may not fits your need but take a look at PlanetScale. ( Based on MySQL and Vitess but I have seen quite a few people moving from Postgres )
Tehnix|1 year ago
Both of these companies are very specific about their circumstances and requirements
- Time is ticking, and downtime is guaranteed if they don’t do anything
- They are not interested in giving up the massive amount feature AWS supports via RDS, very specially around data recovery (anyone involved with Business Continuity planning would understand the importance of this)
- They need to be able to do it safely, incrementally, and without closing the door on reverting their implementation/rollout
- The impact on Developers should be minimal
“Engineering” at its core is about navigating the solution space given your requirements, and they did it well!
Both Figma and Notion meticulously focused on the minimal feature set they needed, in the most important order, to prevent disastrous downtime (e.g Figma didn’t need to support all of SQL for sharding, just their most used subset).
Both companies call out (rightfully so) that they have extensive experience operating RDS at this point, and that existing solutions either didn’t give them the control they needed (Notion) or required a significant rewrite or their data structures (Figma), which was not acceptable.
I think many people also completely underestimate how important operational experience with your solution is at this scale. Switch to Citus/Vitess? You’ll now find out the hard way all the “gotchas” of running those solutions at scale, and it would guarantedly have resulted in significant downtime as they procured this knowledge.
They’d also have to spend a ton of time catching up to RDS features they were suddenly lacking, which I would wager would take much more time than the time it took implementing their solutions.
Great job to both teams!
[0] https://www.figma.com/blog/how-figmas-databases-team-lived-t...
[1] https://www.notion.so/blog/sharding-postgres-at-notion
pas|1 year ago
of course it's not cheap, but probably they are deep into the AWS money pit anyway (so running Citus/whatever would be similar TCO)
and it's okay, AWS is expensive for a lot of bootstrapped startups with huge infra requirements for each additional user, but Figma and Notion are virtually on the exact opposite of that spectrum
also it shows that there's no trivial solution in this space, sharding SQL DBs is very hard in general, and the extant solutions have sharp edges
cplat|1 year ago
For me to understand their choices, I’ll first have to understand their system, which can be a task in itself. Without that, I’d not become aware of the nuances, only general patterns.
mkesper|1 year ago
durkie|1 year ago
jerrygenser|1 year ago
unknown|1 year ago
[deleted]
kingraoul|1 year ago
jpalomaki|1 year ago
Also these can add unnecessary complexity by having features you don't need. Or they might be missing features you need. Contributing up-stream can be difficult and there might be conflicts of interest especially for projects which have separate paid version.
willsmith72|1 year ago
Obviously it could interfere with demand for aurora limitless
plq|1 year ago
I don't think so because Microsoft bought Citus.
robust-cactus|1 year ago
mu53|1 year ago
A company like figma (billions in revenue) putting a small team to implementing a database sharding solution for an un-implemented use case (RDS, not just postgres). AND open sourcing it creating a value for the community is a net-good for the industry.
giva|1 year ago
harisund1990|1 year ago
jitl|1 year ago
Many customers will have a tiny amount of data. For those customers a dedicated database is huge amount of overhead. There may not be any single customer who it makes sense to allocate dedicated "hardware" for.
Sure you have to deal with a one-time pain to shard your thingy, but you don't need to pay for tens-of-thousands of individual database servers, write interesting tools to keep their schemas in sync, wrangle their backups, etc.
gedy|1 year ago
Well there's trade-offs with this too, whether needing aggregate data across shards for features, reporting, etc. Shared data between customers, users, etc. API access, etc.
hfucvyv|1 year ago
unknown|1 year ago
[deleted]
iAkashPaul|1 year ago
jitl|1 year ago
Our shard key - Workspace ID - is a UUIDv4 so there’s a pretty high number of orgs per shard without conflict.
seanhunter|1 year ago
I have seen this several times before and it's always a symptom of having too many engineers working below the waterline. Rather than work on the actual customer-facing problem, let's port the backend to do event-sourcing/cqrs, move all our infrastructure to k8s, change language from x to y etc.
These are all what I would call "internal goals" (ie they may or may not be necessary or even essential to progress but are not directly customer-visible in their outcomes even if they may enable customer features to be built or indirectly improve the customer experience later) and need to be held to an extremely high level of scrutiny.
If you're amazon/google/meta and you need to do this because of extreme user scale I might believe you. If you're CERN or someone and you need to do this because of absolutely ridiculous data scale I might believe you. The idea that it's better for figma to write their own sharding solution than it is to port to one of the existing ones just doesn't pass even the most basic sniff test.
aerhardt|1 year ago
Also, I’m assuming Amazon or Google will sometimes roll their own solutions on problems of a scale in the same ballpark as Figma’s.
But anyhow, what’s the scale at which this becomes acceptable, exactly? Is there a magical number which serves as a universal threshold? Or is there - like in all engineering decisions - a very concrete economic case for which you and I both lack a lot of the requisite context and inputs?
jitl|1 year ago
In both our case and Figma’s, sharding Postgres ASAP was of critical importance because of transaction ID wraparound threat or other capacity issues that promise hard days-long downtime. The kind of downtime that costs 10s of millions of dollars of brand damage alone. Possibly even company ending.
In such a situation, failure is not an option, and you must pick the least risky solution. Moving to an unmanaged cluster system and figuring out your own point-in-time backup/restore, access control provisioning, etc etc has a lot more unknown unknowns than sticking with the managed database vendor you know. The potential failure scenarios of Citus have scary worst cases - we get backup and restore wrong but it seems to work fine in test, then we move to Citus, then something breaks and we can’t restore from backup after all. It’s equally bad to mis-estimate the amount of time needed to bring up the new system. Let’s say you estimate 6 months to get parity with RDS built in features needed to survive disaster and start moving data over, but instead it takes 10 months. Is there enough time left to finish before going hard down? The clock is ticking. Staying with RDS keeps a whole class of new risk out of the picture.
At least here at Notion, NO ONE wanted to build something complicated for fun. We really wanted the company we’d spent years working for and on-call for to survive.
Our story: https://www.notion.so/blog/sharding-postgres-at-notion
thih9|1 year ago
> So, now, let me speculate. The real reason why Figma reinvented the wheel by creating their own custom solution for sharding might be as straightforward as this — Figma wanted to stay on RDS, and since Amazon had decided not to support the CitusData extension in the past, the Figma team had no choice but to develop their own sharding solution from scratch.
fhd2|1 year ago
That said, I've seen plenty of unprofitable startups with high burn rate play this game. That seems a bit suicidal to me.
djtango|1 year ago
thibaut_barrere|1 year ago
harisund1990|1 year ago
There is always a price. In this case the database team did something quick, cheap and easily. But the Application teams now have to deal with handling all the nuaces of the system. Maybe Figma has more people in these Apps teams with time on their hands to handle it.
marwis|1 year ago
It follows the same approach but is far more sophisticated and mature.
willi59549879|1 year ago
RunSet|1 year ago
unknown|1 year ago
[deleted]
adityapatadia|1 year ago
I know over simplified approach but majority of problem would be solved.
taormina|1 year ago