top | item 12649734

Designing a SaaS Database for Scale with Postgres

241 points| mixmastamyk | 9 years ago |citusdata.com | reply

54 comments

order
[+] e1g|9 years ago|reply
We serve enterprise customers, and use the 'separate database' strategy for these reasons -

1. Stronger data isolation. Technically schemas provide the same level of isolation, but they are much harder to explain or defend during compliance audits.

2. More secure backups. Client data can be encrypted with their own key and stored as per their needs.

3. More useful backups. The frequency and retention of backups can vary to meet the SLA requirements (and costs). And we if something goes wrong, we can recover that particular customer's data in isolation without impacting the rest of the application/customers or worry about merging it in.

4. Secure data deletion. Many European customers demand that all their data is securely removed upon termination. This creates a massive problem with purging backup information if everything is in the same dump.

5. Independent load & performance. If one customer is particularly high-load or chatty, we can move them onto a separate server where they don't impact well behaved folks.

6. Easier horizontal scalability. Just move heavy loads to their own servers + read replicas.

7. Direct access to data. Specifically, we can use BI tools that do not understand schemas and even give direct access to clients' analysts.

8.Independent migration paths. Sometimes the customer's timetable does not allow them to upgrade to the newest version right now (e.g. they are in the middle of using the app for a key process this month). We can leave their account routed to the previous version in both the codebase and the data store.

Out of those, the key 3 reasons are: stronger data isolation, better backup strategy, and more predictable compliance story. But that's enterprise: even if we're widely successful, we'll have "thousands" of customers - never millions. And we can manage thousands of databases, so this architecture path is preferable to us within those boundaries.

[+] noxee|9 years ago|reply
How do you handle look ups that would require cross tenant queries? e.g. each customer has multiple devices connected devices to our system and we want to know the health (is it online, is it responding, etc).

The one way I've thought of current is either having an aggregate table in a "central" database that would be used to collect these kind of statistics. It would be "real-time" but it would be near real-time depending on the frequency of updates. The downside is you have the over head of maintaining a separate data source.

The other option was to just have the software set up to query each tenant at a time and take the performance/time hit. That's not really the best experience and probably violates the idea of data isolation.

[+] andreygrehov|9 years ago|reply
> 8.Independent migration paths.

In this case, how is the business logic (codewise) handled across different states of databases?

[+] mbesto|9 years ago|reply
> But that's enterprise: even if we're widely successful, we'll have "thousands" of customers - never millions. And we can manage thousands of databases, so this architecture path is preferable to us within those boundaries.

Exactly. Unless you're ACV is $100k+ this is strategy isn't really sustainable and might materially impact your gross margins.

EDIT: Further, read the more technical limitations of this approach: https://news.ycombinator.com/item?id=12652021

[+] leemac|9 years ago|reply
I second this strategy. It does cause a few more headaches, but overall clients prefer it to a single DB. We're in enterprise finance, so this comes up a lot and the answer is typically well received.

Almost all the points above are the precise reason we went database-per-tenant. We also chose this method because our system was self-hosted originally (Single DB, non-multi-tenant) for 4 years or so. Adding columns to each table to separate was not ideal and we didn't want to run the risk of 'forgetting a where clause' at any point.

For those wondering, all of our databases have the same schema. We use Visual Studio Database Projects to handle updates to keep it all in sync.

[+] Roboprog|9 years ago|reply
Having spent almost 5 years at a financial company, I get you. To others out there, make sure you charge clients accordingly if you have to do this.

"Captain Obvious" out!

[+] koolba|9 years ago|reply
Separate clusters (not just databases) also gives you the ability to create read only replicas/hot standby on a per customer/application basis.
[+] icebraining|9 years ago|reply
Same here (European company, hosting a business SaaS platform for companies of all sizes).

I frankly prefer this approach, since it completely bypasses the problems that come with sharding and distributed deployments, since every customer is completely isolated. I just wish there was more tooling available that supported our architecture.

[+] pedrocarvalho|9 years ago|reply
Same here. Also working with enterprise customers, sometimes competitors (or at least, tangentially). Thinking of business reassons, there is absolutely no way that they would accept to "share" a database. On everyday ops, it's reason 8 - Independent upgrades on different time points.
[+] taude|9 years ago|reply
Ditto, here. On top of the fact that our large Enterprise customers have their own data isolation/security requirements (each one their own unique requirements).
[+] memracom|9 years ago|reply
Salesforce has successfully deployed multitenancy SaaS at a huge scale, and they have all tenants sharing the same tables. Every row in every table has an Organization-ID column and all queries include the Org Id to keep the data separated. They built their SaaS on top of Oracle RDBMS but are now using PostgreSQL in some not yet publicized way.

Google for Salesforce multitenant architecture and you will find several presentations/videos where they explain much of this architecture. IMHO this is required reading for anyone designing any sort of multitenant SaaS because you are less likely to make disastrous mistakes when you can compare your solution to a known successful solution, and explain where your solution differs and why.

[+] ngrilly|9 years ago|reply
> They built their SaaS on top of Oracle RDBMS but are now using PostgreSQL in some not yet publicized way.

How do you know that?

[+] davidroetzel|9 years ago|reply
The articles mentions the approach to have one schema per tenant but sadly does not mention this again afterwards.

Are there serious issues with this approach I am not aware of?

[+] craigkerstiens|9 years ago|reply
Craig from Citus here. There are a number of issues with one schema per tenant, the biggest one is that at a larger scale Postgres will mostly just not work anymore. Things like pg_dump start to fall over and while this has been improved some, there is still an upper limit somewhere between 1,000 and 10,000 tenants. Further having to then run schema migrations against all of them can be quite painful. There are tools that help on the schema migration front, but what we've seen is that again at scale things start to break. If you'll only ever have 100 customers then by schema can work for you.
[+] matheusd|9 years ago|reply
One issue we had: pgdump performance was abysmal for dumping a single schema out of thousands.

We had to write a custom backup script to handle individual backups, taking advantage of the knowledge of our own database architecture (basically: we don't have to read the schema list and figure out relationships between them because we already know that).

If you read pgdump's source code, when doing the actual backup it uses postgres COPY command, so it was easy enough to write our custom exporter.

[+] aidos|9 years ago|reply
Also interested in this. I architected my system with a single multi-tenant db. My thinking was that it simplified things initially (less admin / backups / pooling etc) but it would be far easier to split into individual dbs later (than to combine).

We are a b2b product and we've picked up customers around the world. Now it seems like it might make sense to shard geographically. Is that common?

[+] rahkiin|9 years ago|reply
I also have a database with schema per tenant. In the tenant code I set the schema. I like it and would like to see more information as well. One issue, I think, is that it does not have full isolation and no easy sharding. As sad by another commenter, I plan to distribute my schemas on multiple servers when neccessary, like the sharding approach.
[+] inopinatus|9 years ago|reply
My personal, recent experience is that my current startup's business model would be poorly served as a result.

You see I originally thought I was building a SaaS service, but it's actually turned out that my customers needed a two-sided network. A per-tenant schema would've been an painful impediment to making that paradigm shift.

[+] prabhatjha|9 years ago|reply
Multi-tenancy at scale with RDBMS is just hard to do. Once you get to certain volume, you just have to have different data stores for different needs. Only a subset of business require ACID properties so you would use RDBMS for those and for the rest choose a NoSQL stack that has the least resistance to ramp up WRT team skill set.

You can get away without using NoSQL but you probably would end up spending time and money maintaining several DB clusters, migrations, handling replication lags etc.

[+] dragonwriter|9 years ago|reply
> Only a subset of business require ACID properties so you would use RDBMS for those and for the rest choose a NoSQL stack

While ACID is an important and nice set of properties of RDBMS's, its not the only reason to choose them; its a lot more adaptable to changing query patterns and needs than less-structured document-oriented datastores. There are sometimes performance and other reasons to choose some non-relational datastore for a particular load, but "don't need ACID guarantees" is, itself, an insufficient reason for abandoning the RDBMS model for NoSQL (don't need ACID guarantees, of course, is a necessary condition to selecting a non-ACID store, but not sufficient to choose one.)

[+] gkop|9 years ago|reply
What are best practices for gracefully allowing end users to use a single login for multiple of your SaaS tenants when the tenants are in separate databases?
[+] icebraining|9 years ago|reply
I'd say OAuth, like StackExchange does. Then you can run your own authentication and/or accept others like Google. In my experience (we run a db-per-tenant service, with OAuth for internal support users), it works fine.
[+] lifeisstillgood|9 years ago|reply
to;dr denormalise a bit so that primary keys represent a full ancestor path to the "usual" primary key, making shardingnand collocation of customers much simpler

And use Postgres:-)

Seems a nice article but I kind of assume that to be basic knowledge amount DBAs ?

[+] lucaspiller|9 years ago|reply
How does this compare to having a separate Postgres schema per tenant? You'd need to make your app a bit more complicated (to set the right schema before every query), but you won't need to denormalise.
[+] billmalarky|9 years ago|reply
Is the primary key in the hierarchical database a composite key of the full "traditional pk" lineage?

Or do you keep the actual PK on the adgroup table the traditional key (ie "adgroupid") then you just also throw customerid on adgroup (even though you will never join adgroup directly to customer) so you can shard each table by customerid?

What is the advantage of using a composite PK instead of the traditional PK so long as all ancestral ids are also stored on the table (so sharding is collocated correctly).

[+] codnee|9 years ago|reply
It might be among DBAs, but not so much among others who might be interested. You have got to see the "clever" tricks some people come up with to solve some basic db problems.
[+] mamcx|9 years ago|reply
Is not clear to me how they make it. If I understood, the idea is not use a column to differentiate tenants (because costly joins) but use "hierarchical database model.". But how is that? How is the structure of the tables?
[+] harel|9 years ago|reply
I use uuid as primary key when I plan for scale and future sharding. It's negligible performance hit of at all and I can move entire datasets from one db to another without worrying about references.