(no title)
mbyio
|
5 years ago
This is a very difficult thing to do. Very impressive. I have so many questions but my number one is: were you able to evaluate alternatives to your existing vertical scaling based setup? For example, cockroachdb, multi-master postgres, using sharding instead of a single DB, etc. At that database size, you are well past the point in which a more advanced DB technology would theoretically help you scale and simplify your architecture, so I'm curious why you didn't go that route.
paulryanrogers|5 years ago
Once solved though horizontal is nice, if more involved to maintain.
namibj|5 years ago
You'll have to accept that transactions can fail due to conflicts, so if they are interactive, you'll have to retry manually.
Edit: I'd like hear criticism, instead of just seeing disapproval.
tommyzli|5 years ago
We thought about migrating to Citus, but I don't have a good idea of how to shard our dataset efficiently.
If we were to shard by user id, then creating a match between two people would require cross-shard transactions and joins. Sharding by geography is also tough because people move around pretty frequently.
skinkestek|5 years ago
My best guesses are
- either it is SAAS in which case shard it should make sense to shard by customer
- or it is something-to-consumer (social networking?) on which case I guess you'll have to take a step back and see if you can sacrifice one of your current assumptions
... but I feel I'm missing something since what I am saying feels a bit trivial.
mslot|5 years ago
That does seem like a fun exercise :).
The general rules of the game are: You can only scale up throughput of queries/transactions that only access 1 shard (some percentage going to 2 shards can be ok). You can only scale down response time of large operations that span across shard since they are parallelized. You should only join distributed tables on their distribution column. You can join reference tables on any column.
The thing that comes to mind is to use a reference table for any user data that is used to find/score matches. Reference tables are replicated to every node and can be joined with distributed tables and each other using arbitrary join clauses, so joining by score or distance is not a problem, but you need to store the data multiple times.
One of the immediate benefits of reference tables is that reads can be load-balanced across the nodes, either by using a setting (citus.task_assignment_policy = 'round-robin') or using a distributed table as a routing/parallelization scheme.
The advantage of having the distributed users table in the join is mainly that you divide the work in a way that keeps each worker node's cache relatively hot for a specific subset of users, though you'll still be scanning most of the data to find matches.Where it gets a bit more interesting is if your dating site is opinionated / does not let you search, since you can then generate matches upfront in batches in parallel.
For interests/matches, it might make sense to have some redundancy in order to achieve reads that go to 1 shard as much possible. For data related to a specific match, you can perhaps use the smallest user ID as the distribution column to avoid the redundancy. This exercise goes on for a while. You still get the benefits of PostgreSQL and ability to scale up throughput of common operations or scale down response time of batch operations, but it does require careful data model choices.(Citus engineer who enjoys distributed systems puzzles)