top | item 21064194

(no title)

imchairmanm | 6 years ago

Hello, author here. That's a good question and something I had a hard time sorting out as I worked on this.

I think those fall into a different category confusingly sometimes called column-oriented databases. They're primarily used for analytic-focused tasks and get their name from storing data by column instead of by row (all data in a single column is stored to disk together).

I didn't include those as a separate category here because they're basically relational databases with a different underlying storage strategy to allow for easier column-based aggregation and so forth.

My colleague shared this article [1] with me, which definitely helped inform how I distinguished between the two in my head.

[1] http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-m...

discuss

order

minitoar|6 years ago

That makes sense, they really are just relational databases optimized for certain tasks, with corresponding limitations e.g. they don't support arbitrary joins.

barrkel|6 years ago

There's nothing intrinsic about not supporting joins, in a columnar store; it's just that you lose a huge amount of the linear scanning performance if you have to do joins for each value. Most columnar stores I've used (primarily Impala, SparkSQL and Clickhouse) all support joins, but they materialize one side of the join as an in-memory hash table, which limits the allowable size of the join, and is a cost multiplier for a distributed query. I believe per the docs that MemSQL can mix and match row-based with columnar more easily, but joins are always going to be really slow compared to the speed you can get from scanning the minimum number of columns to answer your question.