(no title)
zjaffee | 1 month ago
I can however say that when I had a job at a major cloud provider optimizing spark core for our customers, one of the key areas where we saw rapid improvement was simply through fewer machines with vertically scaled hardware almost always outperformed any sort of distributed system (abet not always from a price performance perspective).
The real value often comes from the ability to do retries, and leverage left over underutilized hardware (i.e. spot instances, or in your own data center at times when scale is lower), handle hardware failures, ect, all with the ability for the full above suite of tools to work.
dapperdrake|1 month ago
sgarland|1 month ago
A simple equijoin with high cardinality and indexed columns will usually be extremely fast. The same join in a 1:M might be fast, or it might result in a massive fanout. In the case of the latter, if your RDBMS uses a clustering index, and if you’ve designed your schemata to exploit this fact (e.g. a table called UserPurchase that has a PK of (user_id, purchase_id)) can still be quite fast.
Aggregations often imply large amounts of data being retrieved, though this is not necessarily true.
zjaffee|1 month ago