(no title)
crux | 1 year ago
Notion does not sell its users' data.
Instead, I want to expand on one of the first use-cases for the Notion data lake, which was by my team. This is an elaboration of the description in TFA under the heading "Use case support".
As is described there, Notion's block permissions are highly normalized at the source of truth. This is usually quite efficient and generally brings along all the benefits of normalization in application databases. However, we need to _denormalize_ all the permissions that relate to a specific document when we index it into our search index.
When we transactionally reindex a document "online", this is no problem. However, when we need to reindex an entire search cluster from scratch, loading every ancestor of each page in order to collect all of its permissions is far too expensive.
Thus, one of the primary needs that my team had from the new data lake is "tree traversal and permission data construction for each block". We rewrote our "offline" reindexer to read from the data lake instead of reading from RDS instances serving database snapshots. This allowed us to dramatically reduce the impact of iterating through every page when spinning up a new cluster (not to mention save a boatload in spinning up those ad-hoc RDS instances).
I hope this miniature deep dive gives a little bit more color on the uses of this data store—as it is emphatically _not_ to sell our users' data!
jzelinskie|1 year ago
Full disclosure: I'm a founder of authzed (W21), the company building SpiceDB, an open source project inspired by Google's internal scalable authorization system. We offer a product that streams changes to fully denormalized permissions for search engines to consume, but I'm not trying to pitch; you just don't often hear about other solutions built in this space!
atak1|1 year ago
Have you explored a pattern like https://runtrellis.com or https://unstructured.io/ for unnesting?
mritchie712|1 year ago
> Iceberg and Delta Lake, on the other hand, weren’t optimized for our update-heavy workload when we considered them in 2022
Curious about your thoughts here. Have you followed Icebergs progress? Do you think it'd be a tougher decision in 2024 between Hudi and Iceberg?
unknown|1 year ago
[deleted]
infogulch|1 year ago
jitl|1 year ago
The blocks (pages are a block) in Notion are a big tree, with your workspace at the root. Some attributes of blocks affect the search index of their recursive children, like permissions: granting access to a page grants access to its recursive child blocks.
When you change permissions, we kick off an online recursive reindex job for that page and its recursive subpages. While the job is running, the index has stale entries with outdated permissions.
When you search, we query the index for pages matching your query that you have to. Because the index permissions can be stale, we also reload the result set from Postgres and apply our normal online server-side permission checks to filter out pages you lost access to but that have stale permissions in the index.