1.1B Taxi Rides Using OmniSciDB and a MacBook Pro

[+] twoodfin|5 years ago|reply

Whenever I look at large aggregation benchmarks like this, I try to estimate cycles/value or better cycles/byte.

Take this query:

    SELECT cab_type,
           count(*)
    FROM trips
    GROUP BY cab_type;

This is just counting occurrences of distinct values from a bag of total values sized @ 1.1B.

He's got 8 cores @ 2.7GHz, which presumably can clock up for short bursts at least a bit even when they're all running all out. Let's say 3B cycles/core/second. So in .134 seconds (the best measured time) he's burning ~3.2B cycles to aggregate 1.1B values, or about 3 cycles/value.

While that's ridiculously efficient for a traditional row-oriented database, for a columnar scheme as I'm sure OmniSciDB is using, it's less efficient than I might have expected.

Presumably the # of distinct cab types is relatively small, and you could dictionary-encode all possible values in a byte at worst. I'd expect opportunities both for computationally friendly compact encoding ("yellow" is presumably a dominant outlier and could make RLE quite profitable) and SIMD data parallel approaches that should let you roll through 4,8,16 values in a cycle or two.

Even adding LZ4 should only cost you about a cycle a byte.

That's not to denigrate OmniSciDB: They're already several orders of magnitude better than traditional database solutions, and plumbing all the way down from high-level SQL to bit twiddling SIMD is no small feat. More that there's substantial headroom to make systems like this even faster, at least until you hit the memory bandwidth wall.

[+] alexbaden|5 years ago|reply

(disclaimer: I work for OmniSci)

I think this is a good point. On GPUs, SIMT is effectively automatic vectorization, so our focus has been on the memory bandwidth wall (we make use of cuda shared memory in nvidia GPU mode for aggregates like the above query). Non-random access compression on GPUs also has been a nonstarter, at least historically. With more recent GPUs and more recent versions of CUDA, perhaps this is changing. But on CPUs, we have started looking into vectorization. There is a tradeoff, though -- the vectorization LLVM passes do add time to the compilation phase, and at subsecond query speeds that time isn't always worth it.

There are also a few other tricks to get closer to roofline performance. If you sort the input data on the key you're grouping by you can see small performance improvements, mostly from better cache locality. But, part of the "magic" of OmniSciDB is that you can group on any key and get good performance without ingesting, reindexing, etc.

[+] AdamProut|5 years ago|reply

As someone who has built a columnstore database and tested it on that query shape (group-by over a column with only a few distinct values) Its possible to go faster than 1 cycle per row via SIMD and operating directly on compressed data (0.87 cycles per row). This blog post gives some details on how it’s done (https://www.memsql.com/blog/how-to-process-trillion-rows-per...)

[+] hugh-avherald|5 years ago|reply

But for the inferences you've made (e.g. the # of distinct cab types is relatively small) you need knowledge of the whole data. What if someone uses the wrong column name? Getting the correct summary of data quickly is easy -- if you already know the answer.

Something like `count(*)` needs to work well where you have no idea at all about the data.

[+] MuffinFlavored|5 years ago|reply

What would it take to get it to be 1 cycle/value? Does 1 cycle here mean 1 instruction? If not, how many instructions does it allow? It's just a matter of copying (in user-space I'm guessing) memory around, right?

[+] tmostak|5 years ago|reply

For those wanting to try it for themselves, we recently released a preview of our full stack for Mac (containing both OmniSciDB as well as our Immerse frontend for interactive visual analytics), available for free here: https://www.omnisci.com/mac-preview. This is a bit of an experiment for us, so we'd love your feedback! Note that the Mac preview doesn't yet have the scalable rendering capabilities our platform is known for, but stay tuned.

You can also install the open source version of OmniSciDB, either via tar/deb/rpm/Docker for Linux (https://www.omnisci.com/platform/downloads/open-source) or by following the build instructions for Mac in our git repo: https://github.com/omnisci/omniscidb (hopefully will have standalone builds for Mac up soon). You can also run a Dockerized version on your Mac, but as a disclaimer the performance, particularly around storage access, lags a bare metal install.

[+] rileytg|5 years ago|reply

i have zero experience authoring brew packages but as a consumer, a ‘brew install omniscidb’ would be really nice. especially with things like database that you might build apps that depend on it, it’s convenient to have a tool manage versions installed. it’s also very nice to have a common interface for me to manage these things. i use brew for mysql postgres redis qt rbenv node and more. fitting into that ecosystem makes it easier for me to bring this in as a dependency as it just another brew dev dependency.

[+] ra|5 years ago|reply

There's an old issue in your Github repo where you suggest that you may support OpenCL for GPU.

Is this on your roadmap?

[+] hodgesrm|5 years ago|reply

It would be great to understand why OmniSciDB does so well on this benchmark but seems to do far less well on others.

The ClickHouse team was (obviously!) very interested in Mark's result and tried out OmniSciDB on the standard analytics benchmark that CH uses to check performance. Results are here: https://presentations.clickhouse.tech/original_website/bench...]

Anyway, really intriguing results from Mark. Looking forward to learning more about the source of the differences.

Disclaimer: I work at Altinity, which supports ClickHouse.

Edit: Fixed bad link

[+] chicagobuss|5 years ago|reply

Agreed. I'm also a bit frustrated that he never tuned vertica in his test of it - he just loaded the data into the default superprojection and queried it like that. Nearly all of vertica's power comes from its concept of projections - just like normal DBs benefit from indexes. I'd really like to see how it performs to these other systems once it's been tuned properly because in my experience it's always been the fastest DB I've ever worked with.

https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/Gl...

https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/Ad...

[+] lmeyerov|5 years ago|reply

You may find the recent RAPIDS open source community GPU stack benchmark on TPCx-BB relevant, where they beat out the rest of the tools at something like 2x cheaper hw for 44x faster. It's an interesting industry benchmark b/c requires handling diverse use cases and out-of-core optimizations (eg, 1TB going through a 16GB GPU, and combo of matrix, tables, etc), so just say BlazingSQL or cuML won't really work, but all together covered it.

[+] hodgesrm|5 years ago|reply

(To make this link work press the "100 mln." button when you arrive. It seems there is an encoding issue on this link that causes it not work correctly when copied to some sites.)

[+] skavi|5 years ago|reply

I’m almost entirely sure Litwintschik is misinformed in regards to the GPUs in his laptop.

Yes, he does have the Intel GPU he mentioned, but if he paid $200 to upgrade the GPU as he claims, he would also have a dedicated AMD Radeon Pro 5500M 8GB.

[+] moogleii|5 years ago|reply

Actually, I don't think it's possible to configure a 16" MBP without a discrete GPU - they all come with AMDs. Only the 13" MBPs can be configured without one, but he says he's using a 16".

[+] bluedino|5 years ago|reply

He may have been going by "About this Mac", which will show the Intel GPU if you're not plugged into an external monitor, or using a program that activates the dedicated GPU.

[+] marklit|5 years ago|reply

Sorry, I just cleaned that paragraph up. Cheers for the heads up.

[+] xiaodai|5 years ago|reply

I once contacted the author to check out my open source package and benchmark it and he mentioned that he actually charges for the benchmarking exercise. So yeah.

[+] wenc|5 years ago|reply

I can't speak to your specific situation, but I grew up around consultants.

Consultants have a rule: "Never say No. Always say: this is how much it will cost".

It's a "no" but with a threshold.

[+] evanriley|5 years ago|reply

I feel like I've missed why he would do that for free...

[+] aj7|5 years ago|reply

Well, it’s work and expertise on his part, is it not?

[+] angryyellowman|5 years ago|reply

>The GPU won't be used by OmniSciDB in this benchmark but for the record it's an Intel UHD Graphics 630 with 1,536 MB of GPU RAM. This GPU was a $200 upgrade over the stock GPU Apple ships with this notebook. Nonetheless, it won't have a material impact on this benchmark.

He lost me here...I get that it doesn't matter, but come on, if you don't know that your computer has a GPU other than the integrated graphics (that you admit you paid more to upgrade) then what are you really doing...

[+] jeroenhd|5 years ago|reply

Another comment suggested that he got this information from the "about my mac" popup, which only shows the dedicated GPU when connected to an external display or when using an application that uses the dedicated GPU.

If he ran his benchmarks and then checked his hardware while writing this article then the author might've gotten confused by that.

[+] danso|5 years ago|reply

    COPY trips
    FROM '/Users/mark/taxi_csv/*.gz'
    WITH (HEADER='false');

> The above managed to complete in 31 minutes and 40 seconds. The resulting import produced 294 GB of data in OmniSciDB's internal format.

I’m really curious how a simple import (no indexes or data typing) into SQLite would compare. But I don’t have 700GB of free SSD space to spare.

[+] tmostak|5 years ago|reply

Mark did a benchmark of SQLite using its internal file format a few years ago (https://tech.marksblogg.com/billion-nyc-taxi-rides-sqlite-pa...), clocking the import at 5.5 hours. It looks like this was done though on a spinning disk, so given a proper SSD, and a newer version of SQLite, it might be much faster.

[+] momonga|5 years ago|reply

The same blog has other benchmarks[1], one of which is SQLite[2].

[1] https://tech.marksblogg.com/benchmarks.html

[2] https://tech.marksblogg.com/billion-nyc-taxi-rides-sqlite-pa...

[+] wenc|5 years ago|reply

For a list of benchmarks by the same author.

https://tech.marksblogg.com/benchmarks.html

Caveat: these benchmarks only test the simplest of operations like aggregation (GROUP BY, COUNT, AVG) and sorts (ORDER BY). No JOINs or window operations are performed. Even basic filtering (WHERE) doesn't seem to have been tested. YMMV.

[+] x87678r|5 years ago|reply

Wow click house is right up there.

[+] rexreed|5 years ago|reply

Side note: you can see a cool demo of OmniSci at https://www.aidemoshowcase.com/2020/07/08/omnisci/

[+] pvtmert|5 years ago|reply

I wonder why nobody uses -C and --strip options of tar. It handles those stuff automatically.

    mkdir -p application
    tar --strip=1 -C application -xf archive.tar

For example installing NodeJS from archive when composed with curl:

    OS=$(uname -s)-x64
    VER=12.16.1
    curl -#L "https://nodejs.org/dist/v${VER}/node-v${VER}-${OS,,}.tar.gz" \
    | sudo tar --strip=1 -xzC /usr/local

IMHO thats probably the reason why most apps are having proprietary installers :)

[+] zX41ZdbW|5 years ago|reply

Follow-up: https://news.ycombinator.com/item?id=23990844

[+] coolgeek|5 years ago|reply

> https://tech.marksblogg.com/omnisci-macos-macbookpro-mbp.htm...

Is this what a keyword-stuffed URL looks like? This is terrible! This does nothing at all to communicate semantic meaning about the post's content

[+] unknown|5 years ago|reply

[deleted]

[+] popotamonga|5 years ago|reply

Data does not fit in Ram so i guess in the end its about file formats and minimizing disk access, thats why some of the competition benchmarksbare terrible no?

57 comments