Whenever I look at large aggregation benchmarks like this, I try to estimate cycles/value or better cycles/byte.
Take this query:
SELECT cab_type,
count(*)
FROM trips
GROUP BY cab_type;
This is just counting occurrences of distinct values from a bag of total values sized @ 1.1B.
He's got 8 cores @ 2.7GHz, which presumably can clock up for short bursts at least a bit even when they're all running all out. Let's say 3B cycles/core/second. So in .134 seconds (the best measured time) he's burning ~3.2B cycles to aggregate 1.1B values, or about 3 cycles/value.
While that's ridiculously efficient for a traditional row-oriented database, for a columnar scheme as I'm sure OmniSciDB is using, it's less efficient than I might have expected.
Presumably the # of distinct cab types is relatively small, and you could dictionary-encode all possible values in a byte at worst. I'd expect opportunities both for computationally friendly compact encoding ("yellow" is presumably a dominant outlier and could make RLE quite profitable) and SIMD data parallel approaches that should let you roll through 4,8,16 values in a cycle or two.
Even adding LZ4 should only cost you about a cycle a byte.
That's not to denigrate OmniSciDB: They're already several orders of magnitude better than traditional database solutions, and plumbing all the way down from high-level SQL to bit twiddling SIMD is no small feat. More that there's substantial headroom to make systems like this even faster, at least until you hit the memory bandwidth wall.
I think this is a good point. On GPUs, SIMT is effectively automatic vectorization, so our focus has been on the memory bandwidth wall (we make use of cuda shared memory in nvidia GPU mode for aggregates like the above query). Non-random access compression on GPUs also has been a nonstarter, at least historically. With more recent GPUs and more recent versions of CUDA, perhaps this is changing. But on CPUs, we have started looking into vectorization. There is a tradeoff, though -- the vectorization LLVM passes do add time to the compilation phase, and at subsecond query speeds that time isn't always worth it.
There are also a few other tricks to get closer to roofline performance. If you sort the input data on the key you're grouping by you can see small performance improvements, mostly from better cache locality. But, part of the "magic" of OmniSciDB is that you can group on any key and get good performance without ingesting, reindexing, etc.
As someone who has built a columnstore database and tested it on that query shape (group-by over a column with only a few distinct values) Its possible to go faster than 1 cycle per row via SIMD and operating directly on compressed data (0.87 cycles per row). This blog post gives some details on how it’s done (https://www.memsql.com/blog/how-to-process-trillion-rows-per...)
But for the inferences you've made (e.g. the # of distinct cab types is relatively small) you need knowledge of the whole data. What if someone uses the wrong column name? Getting the correct summary of data quickly is easy -- if you already know the answer.
Something like `count(*)` needs to work well where you have no idea at all about the data.
What would it take to get it to be 1 cycle/value? Does 1 cycle here mean 1 instruction? If not, how many instructions does it allow? It's just a matter of copying (in user-space I'm guessing) memory around, right?
For those wanting to try it for themselves, we recently released a preview of our full stack for Mac (containing both OmniSciDB as well as our Immerse frontend for interactive visual analytics), available for free here: https://www.omnisci.com/mac-preview. This is a bit of an experiment for us, so we'd love your feedback! Note that the Mac preview doesn't yet have the scalable rendering capabilities our platform is known for, but stay tuned.
You can also install the open source version of OmniSciDB, either via tar/deb/rpm/Docker for Linux (https://www.omnisci.com/platform/downloads/open-source) or by following the build instructions for Mac in our git repo: https://github.com/omnisci/omniscidb (hopefully will have standalone builds for Mac up soon). You can also run a Dockerized version on your Mac, but as a disclaimer the performance, particularly around storage access, lags a bare metal install.
i have zero experience authoring brew packages but as a consumer, a ‘brew install omniscidb’ would be really nice. especially with things like database that you might build apps that depend on it, it’s convenient to have a tool manage versions installed. it’s also very nice to have a common interface for me to manage these things. i use brew for mysql postgres redis qt rbenv node and more. fitting into that ecosystem makes it easier for me to bring this in as a dependency as it just another brew dev dependency.
Agreed. I'm also a bit frustrated that he never tuned vertica in his test of it - he just loaded the data into the default superprojection and queried it like that. Nearly all of vertica's power comes from its concept of projections - just like normal DBs benefit from indexes. I'd really like to see how it performs to these other systems once it's been tuned properly because in my experience it's always been the fastest DB I've ever worked with.
You may find the recent RAPIDS open source community GPU stack benchmark on TPCx-BB relevant, where they beat out the rest of the tools at something like 2x cheaper hw for 44x faster. It's an interesting industry benchmark b/c requires handling diverse use cases and out-of-core optimizations (eg, 1TB going through a 16GB GPU, and combo of matrix, tables, etc), so just say BlazingSQL or cuML won't really work, but all together covered it.
(To make this link work press the "100 mln." button when you arrive. It seems there is an encoding issue on this link that causes it not work correctly when copied to some sites.)
I’m almost entirely sure Litwintschik is misinformed in regards to the GPUs in his laptop.
Yes, he does have the Intel GPU he mentioned, but if he paid $200 to upgrade the GPU as he claims, he would also have a dedicated AMD Radeon Pro 5500M 8GB.
Actually, I don't think it's possible to configure a 16" MBP without a discrete GPU - they all come with AMDs. Only the 13" MBPs can be configured without one, but he says he's using a 16".
He may have been going by "About this Mac", which will show the Intel GPU if you're not plugged into an external monitor, or using a program that activates the dedicated GPU.
I once contacted the author to check out my open source package and benchmark it and he mentioned that he actually charges for the benchmarking exercise. So yeah.
>The GPU won't be used by OmniSciDB in this benchmark but for the record it's an Intel UHD Graphics 630 with 1,536 MB of GPU RAM. This GPU was a $200 upgrade over the stock GPU Apple ships with this notebook. Nonetheless, it won't have a material impact on this benchmark.
He lost me here...I get that it doesn't matter, but come on, if you don't know that your computer has a GPU other than the integrated graphics (that you admit you paid more to upgrade) then what are you really doing...
Another comment suggested that he got this information from the "about my mac" popup, which only shows the dedicated GPU when connected to an external display or when using an application that uses the dedicated GPU.
If he ran his benchmarks and then checked his hardware while writing this article then the author might've gotten confused by that.
Mark did a benchmark of SQLite using its internal file format a few years ago (https://tech.marksblogg.com/billion-nyc-taxi-rides-sqlite-pa...), clocking the import at 5.5 hours. It looks like this was done though on a spinning disk, so given a proper SSD, and a newer version of SQLite, it might be much faster.
Caveat: these benchmarks only test the simplest of operations like aggregation (GROUP BY, COUNT, AVG) and sorts (ORDER BY). No JOINs or window operations are performed. Even basic filtering (WHERE) doesn't seem to have been tested. YMMV.
Data does not fit in Ram so i guess in the end its about file formats and minimizing disk access, thats why some of the competition benchmarksbare terrible no?
[+] [-] twoodfin|5 years ago|reply
Take this query:
This is just counting occurrences of distinct values from a bag of total values sized @ 1.1B.He's got 8 cores @ 2.7GHz, which presumably can clock up for short bursts at least a bit even when they're all running all out. Let's say 3B cycles/core/second. So in .134 seconds (the best measured time) he's burning ~3.2B cycles to aggregate 1.1B values, or about 3 cycles/value.
While that's ridiculously efficient for a traditional row-oriented database, for a columnar scheme as I'm sure OmniSciDB is using, it's less efficient than I might have expected.
Presumably the # of distinct cab types is relatively small, and you could dictionary-encode all possible values in a byte at worst. I'd expect opportunities both for computationally friendly compact encoding ("yellow" is presumably a dominant outlier and could make RLE quite profitable) and SIMD data parallel approaches that should let you roll through 4,8,16 values in a cycle or two.
Even adding LZ4 should only cost you about a cycle a byte.
That's not to denigrate OmniSciDB: They're already several orders of magnitude better than traditional database solutions, and plumbing all the way down from high-level SQL to bit twiddling SIMD is no small feat. More that there's substantial headroom to make systems like this even faster, at least until you hit the memory bandwidth wall.
[+] [-] alexbaden|5 years ago|reply
I think this is a good point. On GPUs, SIMT is effectively automatic vectorization, so our focus has been on the memory bandwidth wall (we make use of cuda shared memory in nvidia GPU mode for aggregates like the above query). Non-random access compression on GPUs also has been a nonstarter, at least historically. With more recent GPUs and more recent versions of CUDA, perhaps this is changing. But on CPUs, we have started looking into vectorization. There is a tradeoff, though -- the vectorization LLVM passes do add time to the compilation phase, and at subsecond query speeds that time isn't always worth it.
There are also a few other tricks to get closer to roofline performance. If you sort the input data on the key you're grouping by you can see small performance improvements, mostly from better cache locality. But, part of the "magic" of OmniSciDB is that you can group on any key and get good performance without ingesting, reindexing, etc.
[+] [-] AdamProut|5 years ago|reply
[+] [-] hugh-avherald|5 years ago|reply
Something like `count(*)` needs to work well where you have no idea at all about the data.
[+] [-] MuffinFlavored|5 years ago|reply
[+] [-] tmostak|5 years ago|reply
You can also install the open source version of OmniSciDB, either via tar/deb/rpm/Docker for Linux (https://www.omnisci.com/platform/downloads/open-source) or by following the build instructions for Mac in our git repo: https://github.com/omnisci/omniscidb (hopefully will have standalone builds for Mac up soon). You can also run a Dockerized version on your Mac, but as a disclaimer the performance, particularly around storage access, lags a bare metal install.
[+] [-] rileytg|5 years ago|reply
[+] [-] ra|5 years ago|reply
Is this on your roadmap?
[+] [-] hodgesrm|5 years ago|reply
The ClickHouse team was (obviously!) very interested in Mark's result and tried out OmniSciDB on the standard analytics benchmark that CH uses to check performance. Results are here: https://presentations.clickhouse.tech/original_website/bench...]
Anyway, really intriguing results from Mark. Looking forward to learning more about the source of the differences.
Disclaimer: I work at Altinity, which supports ClickHouse.
Edit: Fixed bad link
[+] [-] chicagobuss|5 years ago|reply
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/Gl...
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/Ad...
[+] [-] lmeyerov|5 years ago|reply
[+] [-] hodgesrm|5 years ago|reply
[+] [-] skavi|5 years ago|reply
Yes, he does have the Intel GPU he mentioned, but if he paid $200 to upgrade the GPU as he claims, he would also have a dedicated AMD Radeon Pro 5500M 8GB.
[+] [-] moogleii|5 years ago|reply
[+] [-] bluedino|5 years ago|reply
[+] [-] marklit|5 years ago|reply
[+] [-] xiaodai|5 years ago|reply
[+] [-] wenc|5 years ago|reply
Consultants have a rule: "Never say No. Always say: this is how much it will cost".
It's a "no" but with a threshold.
[+] [-] evanriley|5 years ago|reply
[+] [-] aj7|5 years ago|reply
[+] [-] angryyellowman|5 years ago|reply
He lost me here...I get that it doesn't matter, but come on, if you don't know that your computer has a GPU other than the integrated graphics (that you admit you paid more to upgrade) then what are you really doing...
[+] [-] jeroenhd|5 years ago|reply
If he ran his benchmarks and then checked his hardware while writing this article then the author might've gotten confused by that.
[+] [-] danso|5 years ago|reply
I’m really curious how a simple import (no indexes or data typing) into SQLite would compare. But I don’t have 700GB of free SSD space to spare.
[+] [-] tmostak|5 years ago|reply
[+] [-] momonga|5 years ago|reply
[1] https://tech.marksblogg.com/benchmarks.html
[2] https://tech.marksblogg.com/billion-nyc-taxi-rides-sqlite-pa...
[+] [-] wenc|5 years ago|reply
https://tech.marksblogg.com/benchmarks.html
Caveat: these benchmarks only test the simplest of operations like aggregation (GROUP BY, COUNT, AVG) and sorts (ORDER BY). No JOINs or window operations are performed. Even basic filtering (WHERE) doesn't seem to have been tested. YMMV.
[+] [-] x87678r|5 years ago|reply
[+] [-] rexreed|5 years ago|reply
[+] [-] pvtmert|5 years ago|reply
[+] [-] zX41ZdbW|5 years ago|reply
[+] [-] coolgeek|5 years ago|reply
Is this what a keyword-stuffed URL looks like? This is terrible! This does nothing at all to communicate semantic meaning about the post's content
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] popotamonga|5 years ago|reply
[+] [-] antb123|5 years ago|reply
[+] [-] innocenat|5 years ago|reply
[+] [-] unknown|5 years ago|reply
[deleted]