Shipping Multi-Tenant SaaS Using Postgres Row-Level Security

mkurz|3 years ago

Be aware when using RLS with views: By default the RLS policy will be executed with the permissions of the owner of the view instead with the permissions of the user executing the current query. This way it can easily happen that the RLS policy will be bypassed because the owner of the view is a admin account or the same account that owns the underlying table (see the the gotchas section of the original post).

However, upcoming PostgreSQL 15 adds support for security invoker views: https://github.com/postgres/postgres/commit/7faa5fc84bf46ea6... That means you can then define the security_invoker attribute when creating a view and this "... causes the underlying base relations to be checked against the privileges of the user of the view rather than the view owner" (see https://www.postgresql.org/docs/15/sql-createview.html) PG15 beta 1 release notes: https://www.postgresql.org/about/news/postgresql-15-beta-1-r...

markhalonen|3 years ago

This was a shock to us after using RLS for a while. The solution outlined here worked great for us: https://www.benburwell.com/posts/row-level-security-postgres...

alberth|3 years ago

That seems like a bug to me, and a significant one as well.

The underlining promise of RLS (sometimes even referred to as “virtual private database”) in an RDBMS, is that data should never leak because it’s handled transparently by the db.

This seems like a significant leakage point that the user has to personally manage.

bearjaws|3 years ago

This is such a killer feature in PG, my new job uses it and it makes audits of our tenancy model dead simple.

Coming from a SaaS company that used MySQL, we would get asked by some customers how we guarantee we segmented their data, and it always ended at the app layer. One customer (A fortune 10 company) asked if we could switch to SQL Server to get this feature...

Our largest customers ask how we do database multi-tenant and we point to our SDLC + PG docs and they go 'K'.

e1g|3 years ago

Every B2B client who asked us how we handle multi-tenancy also asked how we ensure their data is erased at the end of the contract. Using a shared database with RLS means you have to go through all DB backups, delete individual rows for that tenant, then re-generate the backup. That’s a non-starter, so we opted for having one DB per tenant which also makes sharding, scaling, balancing, and handling data-residency challenges easier.

eastbound|3 years ago

I honestly don’t understand how Oracle is still alive. Postgres has so many of these killer features.

Also, I wonder how others do tenant separation, what other solutions there are.

gz5|3 years ago

And PG supports layer 3 shut-down of link listeners and inbound fw ports. So you can combine the L7 tenancy with a secure networking architecture which eliminates the problems of managing firewalls and ACLs. One of the open source examples: https://youtu.be/s-skpw7bUfI

ksec|3 years ago

Fortune Top 10

1 Walmart

2 Amazon

3 Apple

4 CVS Health

5 UnitedHealth Group

6 Exxon Mobil

7 Berkshire Hathaway

8 Alphabet

9 McKesson

10 AmerisourceBergen

We can rule out 2,3,7,8 …

simonw|3 years ago

I don't fully understand the performance implications here.

Say I was using this for a blog engine, and I wanted to run this SQL query:

    select * from entries;

But I actually only want to get back entries that my current user is allowed to view - where author_id = 57 for example.

Would PostgreSQL automatically turn the above query into the equivalent of this:

    select * from entries where author_id = 57;

And hence run quickly (assuming there's an index on that author_id column)?

Or would it need to run an additional SQL query check for every single row returned by my query to check row permissions, adding up to a lot of extra overhead?

ossopite|3 years ago

yes, postgres will add such a condition to the query and in simple cases like this is able to use a corresponding index

unfortunately this can break down in more complex cases. roughly postgres trusts a limited set of functions and operators not to leak information about rows (e.g. via error messages) that the RLS policy says a query should not be able to see. that set includes basic comparisons but not more esoteric operations like JSON lookups. at some point postgres will insist on checking the RLS policy result for a row before doing any further work, which can preclude the use of indexes

lmeyerov|3 years ago

We were looking at RLS, various ABAC integrated frameworks (casbin, ..), and zanzibar clones late last year --

* RLS is super appealing. Long-term, the architecture just makes so much more sense than bringing in additional maintenance/security/perf/etc burdens. So over time, I expect it to hollow out how much the others need to do, reducing them just to developer experience & tools (policy analysis, db log auditing, ...). Short-term, I'd only use it for simple internal projects because cross-tenant sharing is so useful in so many domains (esp if growing a business), and for now, RLS seems full of perf/expressivity/etc. footguns. So I wouldn't use for a SaaS unless something severely distinct tenant like payroll, and even then, I'd have a lot of operational questions before jumping in.

* For the needed flexibility and app layer controls, we took the middle of casbin, though others tools emerging to. Unlike the zanzibar style tools that bring another DB + runtime + ..., casbin's system of record is our existing system of record. Using it is more like a regular library call than growing the dumpster fire that is most distributed systems. Database backups, maintenance, migrations, etc are business as usual, no need to introduce more PITAs here, and especially not a vendor-in-the-middle with proprietary API protocols that we're stuck with ~forever as a dependency.

* A separate managed service might make zanzibar-style OK in some cases. One aspect is ensuring the use case won't suffer the view problem. From there, it just comes down to governance & risk. Auth0 being bought by Okta means we kind of know what it'll look like for awhile, and big cloud providers have growing identity services, which may be fine for folks. Startup-of-the-month owning parts of your control plane is scarier to me: if they get hacked, go out of business, get acquired by EvilCorp or raise $100M in VC and jack up prices, etc.

There's a lot of innovation to do here. A super-RLS postgres startup is on my list of easily growable ideas :)

On a related note: We're doing a bunch of analytics work on how to look at internal+customer auth logs -- viz, anomaly detection, and supervised behavioral AI -- so if folks are into things like looking into account take overs & privilege escalations / access abuse / fraud in their own logs, would love to chat!

jzelinskie|3 years ago

As the developer of an external authorization system (full disclosure)[0], I feel obligated to chime in the critiques of external authorization systems in this article. I don't think they're far off base, as we do recommend RLS for use cases like what the article covers, but anyways, here's my two cents:

1+2: Cost + Unnecessary complexity: this argument can be used against anything that doesn't fit the given use case. There's no silver bullet for any choice of solution. You should only adopt the solution that makes the most sense for you and vendors should be candid about when they wouldn't recommend adopting their solution -- it'd be bad for both the users and reputation of the solution.

3: External dependencies: That depends on the toolchain. Integration testing against SpiceDB is easier than Postgres, IMO [1]. SpiceDB integration tests can run fully parallelized and can also model check your schema so that you're certain there are no flaws in your design. In practice, I haven't seen folks write tests to assert that their assumptions about RLS are maintained over time. The last place you want invariants to drift is authorization code.

4: Multi-tenancy is core to our product: I'm not sure I'm steel-manning this point, but I'll do my best. Most companies do not employ authorization experts and solutions worth their salt should support modeling multi-tenant use cases in a safe way. SpiceDB has a schema language with idioms and recommendations to implement functionality like multi-tenancy, but still leaves it in the hands of developers to construct the abstraction that matches their domain[2].

[0]: https://github.com/authzed/spicedb

[1]: https://github.com/authzed/examples/tree/main/integration-te...

[2]: https://docs.authzed.com/guides/schema

gwen-shapira|3 years ago

The blog explicitly said that if the requirements involve actual authorization models (beyond simple tenancy) then RLS is not the best fit (see: https://thenile.dev/blog/multi-tenant-rls#if-you-have-sophis...).

I think this covers both the complexity aspect and the difference between what you get from RLS and what external authz brings to the table (schema, for example).

I do think that RLS is a great way for a company without authz experts to built a multi-tenant MVP safely. I've yet to see a single pre-PMF company that worries about authorization beyond that, this is a series-B concern in my experience.

shaicoleman|3 years ago

We're currently using the schema-per-tenant, and it's working very well for us:

* No extra operational overhead, it's just one database

* Allows to delete a single schema, useful for GDPR compliance

* Allows to easily backup/restore a single schema

* Easier to view and reason about the data from an admin point of view

* An issue in a single tenant doesn't affect other tenants

* Downtime for maintenance is shorter (e.g. database migration, non-concurrent REINDEX, VACUUM FULL, etc.)

* Less chance of deadlocks, locking for updates, etc.

* Allows easier testing and development by subsetting tenants data

* Smaller indexes, more efficient joins, faster table scans, more optimal query plans, etc. With row level security, every index needs to be a compound index

* Easy path to sharding per tenant if needed. Just move some schemas to a different DB

* Allows to have shared data and per-tenant data on the same database. That doesn't work with the tenant-per-database approach

There are a few cons, but they are pretty minor compared to the alternative approaches:

* A bit more code to deal in the tenancy, migrations, etc. We opted to write our own code rather than use an existing solution

* A bit more hassle when dealing with PostgreSQL extensions . It's best to install extensions into a separate extensions schema

* Possible caching bugs so you need to namespace the cache, and clear the query cache when switching tenant

* The security guarantees of per tenant solution aren't perfect, so you need to ensure you have no SQL injection vulnerabilities

bvirb|3 years ago

We ran a multi-tenant SaaS product for years w/ a schema-per-tenant approach. For the most part it all worked pretty great.

We ran into issues here and there but always found a way to work around them:

* Incremental backups were a pain because of needing to lock so many objects (# of schemas X # of tables per schema).

* The extra code to deal w/ migrations was kinda messy (as you mentioned).

* Globally unique IDs become the combination of the row ID + the tenant ID, etc...

For us though the real deal-breaker turned out to be that we wanted to have real foreign keys pointing to individual rows in tenant schemas from outside of the tenant schema and we couldn't. No way to fix that one since with multi-schema the "tenant" relies on DB metadata (the schema name).

We ended up migrating the whole app to RLS (which itself was a pretty interesting journey). We were afraid of performance issues since the multi-schema approach kinda gives you partitioning for free, but with the index usage on the RLS constraints we've had great performance (at least for our use case!).

After quite a bit of time working with both multi-schema & RLS I probably wouldn't go back to multi-schema unless I had a real compelling reason to do so due to the added complexity. I really liked the multi-schema approach, and I think most of the critiques of it I found were relatively easy to work around, but RLS has been a lot simpler for us.

uhoh-itsmaciek|3 years ago

There are some other cons:

Memory usage and I/O can be less efficient. Postgres handles table data in 8kb pages, so even if you're just reading a single row, that reads 8kb from disk and puts 8kb in the Postgres buffer cache, with that row and whatever happens to be next to it in the physical layout of the underlying table. Postgres does this because of locality of reference: it's cheaper to bulk-load data from disk, and, statistically speaking, you may need the adjacent data soon. If each user is touching separate tables, you're loading a page per row for each user, and you're missing out on some of the locality benefits.

Another problem is monitoring (disclosure: I work for pganalyze, which offers a Postgres monitoring service). The pg_stat_statements extension can track execution stats of all normalized queries in your database, and that's a very useful tool to find and address performance problems. But whereas queries like "SELECT * FROM posts WHERE user_id = 123" and "SELECT * FROM posts WHERE user_id = 345" normalize to the same thing, schema-qualified queries like "SELECT * FROM user_123.posts" and "SELECT * FROM user_345.posts" normalize to different things, so you cannot easily consider their performance in aggregate (not to mention bloating pg_stat_statements by tracking so many distinct query stats). This is the case even when you're using search_path so that your schema is not explicitly in your query text.

Also, performance of tools like pg_dump is not great with a ton of database objects (tables and schemas) and, e.g., you can run into max_locks_per_transaction [1] limits, and changing that requires a server restart.

I wouldn't say you should never do schema-based multi-tenancy (you point out some good advantages above), but I'd be extremely skeptical of using it in situations where you expect to have a lot of users.

[1]: https://www.postgresql.org/docs/current/runtime-config-locks...

andy_ppp|3 years ago

I find adding loads of stuff to Postgres exciting and fun, but I want all of my logic in the code in GitHub, rather that floating around in my global data store. Has anyone thought about a data layer that allows you to define this stuff programmatically rather than in SQL but then it configures your data layer to work like this. Not necessarily an ORM but more a business logic layer that compiles everything down to use features like this. Or maybe even a data layer that is a set of programmatic building blocks that works as described?

hamandcheese|3 years ago

Flyway is a plain-sql migration tool with support for “repeatable migrations” which somewhat do what you want:

https://flywaydb.org/documentation/tutorials/repeatable

macNchz|3 years ago

I’ve set something like that up a handful of times in a kind of ad-hoc manner, by subclassing/extending the autogeneration tools from existing db migration frameworks to just detect changes in a directory of .sql files. Has worked pretty well to keep stored procedures/triggers/materialized views up to date with the repo.

mixmastamyk|3 years ago

You're describing an ORM, or perhaps SQLAlchemy, which has a lower level interface. .sql files work fine in version control as well. "create or replace …" pattern can make them idempotent.

simlevesque|3 years ago

Supabase kinda does this.

uhoh-itsmaciek|3 years ago

>Another issue we caught during testing was that some requests were being authorized with a previous request’s user id.

This is the terrifying part about RLS to me: having to rely on managing the user id as part of the database connection session seems like an easy way to shoot yourself in the foot (especially when combined with connection pooling). Adding WHERE clauses everywhere isn't great, but at least it's explicit.

That said, I've never used RLS, and I am pretty curious: it does seem like a great solution other than that one gotcha.

koolba|3 years ago

If you do this right, it ends up in just one place and the code that handles checking in / out connections from your pool will handle wrapping your connection usage with the appropriate session context. And of course subsequently clearing it with a DISCARD ALL.

sgarman|3 years ago

Am I right in my understanding that EVERY request that comes in to their api creates a new connection to the database? What about reusing connections with connection pools or one level up using pgbouncer or thing. Can you actually use RLS while reusing connections?

lemax|3 years ago

It's possible to implement this without creating new connections to the database for each request by using SET LOCAL and wrapping every query in a transaction. Instead of applying RLS based on the current user, you apply RLS based on the parameter value you set at the beginning of the transaction. You can set this parameter value based on the user session in your application.

Your RLS policy looks as follows: CREATE POLICY tenant_${tableName}_isolation_policy ON "${tableName}" USING ("tenant_id" = current_setting('app.current_tenant');

Your queries look something like this: BEGIN TRANSACTION SET LOCAL app.current_tenant = '${tenant}'; SELECT * from some_table END TRANSACTION;

You can even initialize your writes with a `tenant_id` column defaulted to your `current_setting('app.current_tenant')`

rst|3 years ago

Nope. Quoting the article itself:

"In the traditional use case of direct db access, RLS works by defining policies on tables that filter rows based on the current db user. For a SaaS application, however, defining a new db user for each app user is clunky. For an application use case you can dynamically set and retrieve users using Postgres’ current_settings() function ( i.e:

  SET app.current_app_user = ‘usr_123’

and

  SELECT current_settings(‘app.current_app_user)

)."

The policies that they define reference these settings, so they can do a "set" at the start of processing every web request, on a pre-existing db connection.

cstejerean|3 years ago

You can reuse the connection with a connection pool and use SET ROLE when you check it out.

unknown|3 years ago

[deleted]

ishanr|3 years ago

I use RLS quite heavily for my app Sudopad (https://sudopad.com) and it has been working quite well so far.

One gotcha specific to Supabase (where I run the backend) is because there is no anonymous login in Supabase, turning on RLS and using database functions marked as security definers are the way to go. Otherwise there is no easy way of stopping a 'select * from x' since some rows might not have a user_id if they are anonymous and I still want people to access the row if they know a specific primary key uuid.

mgl|3 years ago

Row-level security is always a tricky and hard to enforce assumption as this not how we relational databases really.

Much bigger fan of the approach described here:

Scalability, Allocation, and Processing of Data for Multitenancy

https://stratoflow.com/data-scalability-allocation-processin...

fswd|3 years ago

I use this for a startup in a re-write of their solution. It simplifies my queries and mutations, and security concerns. It also drammatically reduces the complexity of my code. There's also ROLES (Guest/Public user, Authenticated, Admin) and combinding the roles with Row Level Security.

I like it so much I don't want to go back!

ei8ths|3 years ago

I needed this two years ago, i was looking at this but couldn't figure out how to do it with a existing db connection pool to reuse connections. I might be migrating to this soon so that things will be more isolated from the tenants.

a-dub|3 years ago

this is cool. next up, row level encryption with private information retrieval methods for enabling queries and searches homomorphically (on data encrypted by the client that the service provider never has a key for).

POPOSYS|3 years ago

This is interesting - do you have any hits / docs on how to implement that?

xsreality|3 years ago

Curious how advanced authorization (like ABAC) can be implemented with RLS. For example returning resources that are accessible to the team I belong to within the tenant.

spacemanmatt|3 years ago

If I were leaning into RLS today I would do it through PostgREST

pikdum|3 years ago

PostGraphile is also a really neat tool, using GraphQL instead of REST.

andrewstuart|3 years ago

I once implemented RLS/Postgres for Django.

It worked pretty well.

The basic mechanism was to intercept all outbound SQL queries and wrap them in postgres environment variables that set up the RLS.

santa_boy|3 years ago

This is awesome. Are there any similar features that can be implemented with Mariadb? One of my favorite products is integrated with Mariadb.

kache_|3 years ago

Context aware data access is really cool. And hard :)

unknown|3 years ago

[deleted]

jtwebman|3 years ago

What about stop using ORM abstractions as an option then it is much harder to forget needed filters?

nbevans|3 years ago

Using RLS to implement multi-tenancy is a terrible idea. Just deploy a database per tenant. It's not hard. Why overcomplicate it?

bicijay|3 years ago

Deploying a database per tenant is not that easy. You have a lot of new overhead, migrations become a pain in the ass (already are) and a lot of other little problems...

I would say a database per tenant is overcomplicating it.

pg_bot|3 years ago

Deploying multiple databases is typically costly in the infrastructure as a service space. Plus you have more operational overhead in ensuring backups work and keeping things secure. It's much easier to use Postgres' schemas to segment the data within one single database. Frankly schemas are much easier to reason about, maintain, scale, and keep compliant than row level security.

icedchai|3 years ago

I worked for a SaaS that did that. We had 1000's of clients. Migrations would take all afternoon. We had custom connection pools, custom migration runners, and other weird stuff (this was about 10 years ago.) It was way too complicated, especially since most of the tenants had very little data.

jasongi|3 years ago

Only really worth it if each tenant is creating enough value to justify this. Hard to see how a SaaS product with a cheap/free tier would pull this off.

unknown|3 years ago

[deleted]

paxys|3 years ago

Is having to write "SELECT [...] WHERE user_id=<123>" really considered a security hole? Isn't that how like every service in existence operates? Coming up with complicated auth systems and patterns just because you are scared you will accidentally skip that WHERE clause seems bizarre to me.

nordsieck|3 years ago

> Is having to write "SELECT [...] WHERE user_id=<123>" really considered a security hole? Isn't that how like every service in existence operates? Coming up with complicated auth systems and patterns just because you are scared you will accidentally skip that WHERE clause seems bizarre to me.

Is having to avoid use after free really considered a security hole? Isn't that how like every program in existence operates? Coming up with complicated languages and frameworks just because you're scared you will accidentally use a variable after it's been freed seems bizarre to me.

As it turns out, humans are bad at being consistent, whereas computers are much better. Maybe this particularly solution isn't "the right thing", but it's at least an attempt at modifying the environment such that mistakes no longer happen. And at a meta level, that is precisely the right thing to do.

bvirb|3 years ago

It's pretty nice using RLS that the entire query will follow the rules applied in the database. So for complex queries with say joins and/or subqueries they will all automatically follow the RLS policies as well. In our case we also have some global lookup tables that don't have RLS policies which can also be joined.

We've found it pretty nice to cut out a whole class of possible bugs by being able to defer it to the database level. At the application level we end up with a wrapper that sets (and guarantees unsetting) multi-tenant access to the correct tenant, and then we never have to add "tenant_id = ..." anywhere, regardless of the query. Regardless of whether we forget in some query (which we almost surely would), it cuts out quite a bit of extra code.

You can also do some cool stuff like add RLS policies for read-only multi-tenant access. Then you can query data across multiple tenants while enforcing that nothing accidentally gets written.

mbreese|3 years ago

From my perspective, it isn’t the security aspects that are limiting, but the usability.

If you want to have any access controls that isn’t a simple user_id==123, SQL WHERE clauses can get complicated.

Users, groups, or any kind of fine grained access control can make simple queries non-trivial. It’s even worse if a user can be authorized to view data across different accounts.

lemax|3 years ago

In my experience we've looked toward this kind of solution in a large legacy single-tenant application that wants to go multi tenant with more safety guarantees.

thrownaway561|3 years ago

I think this is mainly an issue when you're using RAW SQL statements. If you're using an ORM, there are many ways to add a where clause to the statements automatically without having to update your code every where.

galaxyLogic|3 years ago

When you say 'user_id' do you mean each end-user of the system or each customer?

I assume you have a few customers and then very many users belonging to each customer.

w-j-w|3 years ago

[deleted]

117 comments