Lessons learned from sharding Postgres at Notion

[+] jordanthoms|4 years ago|reply

At Kami (kamiapp.com) we had to make a similar transition last year - we'd seen usage grow over 20x in a 6-month period on our classroom product due to the pandemic, and we were storing all our content in a single-primary Postgres database. We had plenty of headroom before but we were suddenly running out of write capacity, even after scaling vertically up to ~150 vcpu. Our workload is super write heavy so scaling out replicas wouldn't have helped much at all, and I was spending a lot of time manually vacuuming in our daily off-peak as we were burning transaction ids at an astonishing rate, ~800 million per day.

I looked at a bunch of options but ultimately managing a sharded database seemed like too much overhead for our small (but expanding!) dev team - so we decided to move our heavily loaded tables to CockroachDB instead, as since it's mostly Postgres compatible our transition would be easier, and it would be much easier to manage as it automatically balances load and heals. We're running cockroach on GKE and it's a really nice fit for running on kubernetes.

Ended up working well for us - we still have our smaller tables on PG but we want to move it all to Cockroach over time. There were some teething issues as we learnt what to monitor on it (read amplification is the big one here, though pebble has helped a lot vs the previous rocksdb), and it does need a lot of hardware at high scale (we're running 50 16-vcpu nodes for it) but overall I'm happy with it.

[+] craigkerstiens|4 years ago|reply

This is a great read and primer for anyone looking to shard and know they need scale. Having helped a ton of people shard their data, my advice still remains don't do it until you know you have to. The old Citus data blog is a great resource for some guidance around best practices and how to's even if not using Citus.

That said some of the things you can do from day 1 make things way easier when the time comes later. Workspace ID/Tenant ID/Customer ID, having your data grouped that way makes it very shardable and saves you a lot of work later on in the case you didn't have that in your data model.

I'm not sure I buy that Citus/Vitess are magic, both are reasonably clear how they work and you can dig into it. At the same time I'd weigh the downsides of Citus (can't speak to Vitess) in that the online rebalancer isn't open source so at that point it's a proprietary product.

[+] dkhenry|4 years ago|reply

Vitess is pretty close to magic for a lot of use cases. The downside of its rebalancer is that the re-sharding process requires a full expansion of the shards that are being re-balanced, which is to say if you have two shards [-80] and [80-], and you want to split out part of the first shard you need to provision two new shards [-40] and [40-80], let the data copy into them, and then you can mark them as active and delete the old [-80] shard. In this brave new cloud world, that not as much of an issue as it would have been in the old data of your own datacenter.

The big problem everyone I have talked to about sharding runs into is managing the shards as you expand. In this case it looks like notion over sharded so they can spin out up to 480 physical nodes, but when they need the 481 it is going to be a nightmare, thats what Vitess gets you for free, expand to any number of shards and just never worry about it again

[+] Johnny555|4 years ago|reply

don't do it until you know you have to

If you wait until you have to, then isn't it too late to plan and test your solution? So you suffer from degraded performance and/or outages while you rush out a solution.

[+] ablekh|4 years ago|reply

I'm not sure what you meant by "online rebalancer" above, but shard rebalancer in Citus is open source since March 2021: https://www.citusdata.com/blog/2021/03/13/scaling-out-postgr....

[+] alberth|4 years ago|reply

>>” Workspace ID/Tenant ID/Customer ID, having your data grouped that way makes it very shardable”

Dumb question, if someone starts off by grouping their data only by Customer ID and then later needs to shard. Couldn’t all of the sharding problems go away if they simply created a new Customer_Tenant table to map customer_id into tenant_group_id

[+] unknown|4 years ago|reply

[deleted]

[+] mbell|4 years ago|reply

Notion seems like an interesting data storage problem. The vast majority of the data is in the `block` entity, which are organized into a sort of nested set, written/updated individually (user edits one at a time) but read in ranged chunks (a doc).

Off hand this seems like an almost worst case for PG. Since the updates to blocks could contain large data (causing them to be moved often) and there is one big table; it seems likely that the blocks for a single notion document will end up being non-continuous on disk and thus require a lot of IO/memory trashing to read them back out. PG doesn't have a way to tell it how to organize data on disk so there is no good way around this (CLUSTER doesn't count, it's unusable in most use cases).

Arm chair engineering of course - but my first thought would be to find another storage system for blocks that better fits the use case and leave the rest in PG. This does introduce other problems, but it just feels like storing data like this in PG is a bad fit. Maybe storing an entire doc's worth of block entities in a jsonb column would avoid a lot of this?

[+] alberth|4 years ago|reply

>> “Why 480 shards? … 480 is divisible by a lot of numbers”.

This is an important point, doing things divisible by 12 gives you a lot of flexibility. It’s not a coincidence both time (clocks) and degrees (360) are multiples of 12.

[+] infogulch|4 years ago|reply

Numberphile made a video about "highly composite numbers" / "anti-primes", which are numbers that have more factors than all smaller numbers. https://youtu.be/2JM2oImb9Qg

There is an oeis sequence of them that starts "1, 2, 4, 6, 12, 24, 36, 48, 60, 120, 180, 240, 360, 720, 840...", which notably does not include 480. https://oeis.org/A002182

[+] Misdicorl|4 years ago|reply

My rule of thumb is one factor of three and then as many factors of two as you need. Adding five is not much different from two twos or a two and a three. Ditto for seven.

The only downside imo is not having ten(s) be a balanced deploy number.

[+] garrettf|4 years ago|reply

Howdy all, author here! Sharding our Postgres monolith was a huge undertaking for a small team. We got a huge boost from folks that joined within weeks before the switch-over date. If you’re interested in thinking about problems like this, I’d love to chat. Plus we’re also hiring across all roles—check my profile for details.

I’m happy to answer questions about the project here, feel free to reply below.

[+] gunnarmorling|4 years ago|reply

Nice write-up! Two questions:

- Can you share details on the routing? I.e. how does the app know which database + schema it needs to go to for given workspace?

- Did you consider using several databases on the same Postgres host (instead of schemas within a single database)? Not sure what's better really, curious whether you have any thoughts about it.

Thanks!

[+] ClassAndBurn|4 years ago|reply

What were the limitations that required you to move all customers to a shared system at once?

Could you have selected some workspaces with lower traffic to migrate first? That would have decreased the load on the primary, potentially speeding up the replication, which is a flywheel to enable more customers to migrate to shards.

[+] varyherb|4 years ago|reply

Are you at all concerned about the selection of the workspace ID as the partition key? With every workspace on a single partition, couldn't a high-throughput workspace create a "hot" partition that will negatively impact other workspaces on that partition?

At the cost of potentially introducing more cross-partition queries, you might benefit from splitting up high-throughput workspaces. See strategy in https://d0.awsstatic.com/whitepapers/Multi_Tenant_SaaS_Stora..., pages 17-20.

[+] silentsea90|4 years ago|reply

> [2] In addition to packaged solutions, we considered a number of alternatives: switching to another database system such as DynamoDB (deemed too risky for our use case)..

1) Can you please talk more about the risks of using DynamoDB or a similar NoSQL solution?

2) Did you consider Spanner, which is the SQL DB of choice within Google and is available on Google Cloud.

Thanks for the wonderful engineering blog posts!

[+] Ozzie_osman|4 years ago|reply

Curious if you had any tables that were "universal" and weren't shardable, global stuff that isn't workspace-specific. Would these be replicated across all shards? Or just stored in some special db (in which case joining to them might be difficult).

[+] sgarrity|4 years ago|reply

The Notion blog has had some quality posts, but it seems to be missing an RSS feed.

You can subscribe via email (no thanks), and I realize there are some good ways to turn an email subscription into a feed (Feedbin.com handles this nicely). Still, I thought some public shaming here might encourage the to build this basic blog feature (that would help people follow their company!).

[+] flowerlad|4 years ago|reply

Anyone starting a tech startup will face this decision: Should I start with a database that has built-in support for sharding? Or should I just start with MySQL or PostgreSQL and defer the sharding question for later?

Notion chose to do manual sharding (aka application-level sharding). That's what you end up doing if you didn't choose a database that has built-in sharding from the get go, because it is extremely hard to switch to a different database technology. (Larry Ellison compares it to a catholic marriage -- there is no divorce!)

I skimmed through the article to find the critical piece of info I was looking for: rationale for doing manual sharding. The rationale supplied in this article is "we wanted control over the distribution of our data." That's a weak explanation. It's the kind of thing you say to justify the bad choice made earlier on: failure to choose a database that supports automatic sharding from the get go.

[+] bsder|4 years ago|reply

From the article:

> By mid-2020, it was clear that product usage would surpass the abilities of our trusty Postgres monolith, which had served us dutifully through five years and four orders of magnitude of growth.

This should answer your question. Just use a standard database and get on with coding features instead.

If you are successful enough to get to the point that sharding PostgreSQL becomes your bottleneck, you've won.

In addition, you really can't solve scaling problems up front. Where your bottleneck actually occurs will differ from where you think it will.

[+] nosefrog|4 years ago|reply

What open source databases support automatic sharding? I think their only other option would have been something like cockroachdb, which probably wasn't very mature when notion started.

[+] endisneigh|4 years ago|reply

On a related note, how far can a single Postgres instance get you?

If you had some behemoth with 32TB ram and 1PB storage could all of motion fit?

I’m obviously ignoring the obvious single point of failure here, but sometimes the simplicity could be worth it if you’re willing to handle that.

I’d be curious to hear about a site that’s in the Alexa 1000 architected how I describe.

[+] aeyes|4 years ago|reply

> If you had some behemoth with 32TB ram and 1PB storage could all of motion fit?

Postgres has a bunch of low level locks and buffers (protected by locks) which are essentially single threaded. So even if you had a 500 CPU instance at some point you'd not be able to get more throughput out of it.

Of course it also depends on what you are doing. Large tables with high update rate are harder to handle than large tables which are insert only. My personal opinion is that Postgres tables with >100GB data (without indexes) are a starting to be a pain to work with, no matter how much RAM, CPU or IOPS you have.

[+] jandrewrogers|4 years ago|reply

Postgres starts to struggle when you push it past 10TB, the details of which will depend on your specific data model and workload. I've seen it pushed to 50TB but I would not recommend it. The architecture simply isn't designed for that kind of scale.

There are fast database engines specifically designed for single servers with 1PB of storage, and these servers commonly have hundreds of cores. This is much more efficient for some use cases. It is the kind of thing you find on edge platforms designed to manage and operate on sensor data models. You build database engine internals very differently when working at this storage density and number of cores; many good architectural ideas in databases at much smaller scales become severe bottlenecks even on a single server.

[+] capableweb|4 years ago|reply

Many companies I worked with stopped working with Notion in the past because of performance problems, the application was simply too slow. Has this changed with the new sharded architecture, as I haven't used it since probably last year?

[+] hinkley|4 years ago|reply

From personal experience (not with Notion) once you fall behind it’s hard to catch up, and nobody declares an emergency until it’s already happening.

Doing big surgery takes time, meanwhile your workload continues to grow and grow. It’s not uncommon for a major win to only reset the clock by 6-12 months, and then you either have to run to the next one, or you have two teams working on separate angles at the same time. All the while people are learning as they go because the business settled on catch-up instead of capacity planning.

Being a little successful can be tough.

[+] tempest_|4 years ago|reply

The app is a lot more snappy, not super fast, but usable.

Search is still pretty slow.

[+] methyl|4 years ago|reply

I can notice the performance no longer being the problem for me. The only gripe I have with Notion is too much flexibility, it’s easy to completely mess something up even by accident.

[+] alberth|4 years ago|reply

Curious, what app have you seen people migrate too?

[+] leetrout|4 years ago|reply

What size data do y'all start worrying about vertical scaling?

Im surprised at the 500gb table / 10tb db sizes listed.

I start to worry when my db hits 1tb and any single "hot" table is over 100gb.

I know there isnt a one size fits all answer but I'm genuinely curious if I'm just too behind living in lessons learned ~5 years ago and and we are in a better place now.

[+] notJim|4 years ago|reply

Overall this sounds very similar to Etsy's sharding setup, which I think descends from Flickr's sharding setup. Interesting that this is still the way to do it.

[+] bamazizi|4 years ago|reply

Curious about thought on using something totally different like column oriented db, like ClickHouse, would've suited a block data oriented app like Notion better?

Does anyone with experience have any thoughts in favour or against such implementation?

[+] dilyevsky|4 years ago|reply

Totally different usecase with clickhouse - the point queries there still slower than citus but it kicks its ass on multi row aggregations. For auto-sharded oltp db cockroachdb or tidb seem like way better options

[+] akarki15|4 years ago|reply

+1. First thought i had when i saw this was like why is notion using a postgresdb in first place- i can see it being used for customer data (which probably doesn’t require sharding) but for the content itself, a nosql or colum oriented db is better fit. the question i have is what is notion storing in these sharded postgresdb?

[+] sillysaurusx|4 years ago|reply

Slightly unrelated, but notion's unofficial python API is cool AF! https://github.com/jamalex/notion-py

It's way more powerful than their official API. The official API can only interact with databases, which (as anyone who uses Notion knows) is a tiny subset of the overall things you want to use Notion for.

The unofficial Python API lets you have complete programmatic access to all notion pages.

The tradeoff is that certain things break. However, after reading the code, I found it quite easy to fix the problems I ran into (https://github.com/jamalex/notion-py/pull/345) and I suspect you'll be able to do the same. (I empathize with maintainers not prioritizing their open source work ahead of family life, business, etc.)

Was surprised that it was even possible to have a nice Python class for every possible Notion object, let alone control them and update them on the fly. I wish their official API would be as flexible. Maybe one day.

[+] caffeinecrazy|4 years ago|reply

I'm curious about whether using Postgres native partitioning was considered? If so, I'd like to hear the rationale for not doing it.

It seems like you made the database sharding decision as a result of vacuum taking too long. Partitioning the problematic table (assuming you're storing "blocks" in 1 giant table) would have enabled per-partition vacuuming, which will avoid long running vacuum processes.

[+] louwrentius|4 years ago|reply

I still wonder so hard what a single bare metal server would have done in terms of performance. [0]

May have saved months of planning that may never have been required.

[0]: https://yourdatafitsinram.net/

[+] sa46|4 years ago|reply

You'll hit all sorts of other fun problems by vertically scaling.

- Many concurrent connections to a single database which Postgres has traditionally not handled well (though improved in recent versions).

- You're now on the hook for writing a database control plane.

- Backup and restore is much flakier since the data volume is so much larger. Lots of weird shit starts happening when you download tens of TB.

- In general, everything is a harder once you start nearing machine limits.

[+] 0xbadcafebee|4 years ago|reply

This can quickly become untenable depending on the availability and contractual requirements you have. You're literally building on a single point of failure. And if it's really big and you design wrong, recovery could take weeks.

[+] dikei|4 years ago|reply

I have 2 questions:

* How do you setup hot stand-by for each database ?

* Do you have coordinated backup for all the databases, or are they backed up individually ?

[+] DarthNebo|4 years ago|reply

This was so insightful, gave me a clearer picture of what all needs to be done during migration!

[+] theodric|4 years ago|reply

[deleted]

[+] TruthWillHurt|4 years ago|reply

What is this, 2008? "Our team spent months architecting this migration". Maybe you should have gone with a cloud offering like Dynamo/RDS, or a serverless database like BigQuery/Aurora.

121 comments