1.1B Taxi Rides on Kdb+/q and 4 Xeon Phi CPUs

[+] buckie|9 years ago|reply

In my experience, kdb+'s k and q (which is broadly speaking a legibility wrapper around k, which again broadly speaking is APL without unicode) are phenomenally fast for dense time series datasets. Though they can struggle (relatively, still pretty fast) with sparse data that's not really what they are built for. They were built for high-performance trading systems, and trading data is dense.

If you like writing dense, clever regexs (which I do) then you'll love k & q. The amount that you can get done with just a few characters is unparalleled.

Which leads to, IMHO, their main drawback: k/q (like clever regexes) are often write-only code. Picking up another's codebase or even your own after some time has passed can be very hard/impossible because of how mindbendingly dense with logic the code it. Even if they were the best choice for a given domain, I'd try to steer clear of using them for anything other then exploratory work that doesn't need to be maintained.

[+] RodgerTheGreat|9 years ago|reply

I program in K professionally and help maintain a fairly large codebase. It's an unusual-looking language and it takes practice to learn to "skim" K code in the way that most programmers are used to for their favorite curly-bracket language, but it can be learned.

The biggest messes I have to clean up come less from "clever" code than they do from people who try to program in K as if it were some other language. For example, somebody fond of for loops might write the following to apply a function 'f' to pairings of values and their index in a list:

    result:()
    i:0
    do[#v
      r:r,,f[v[i];i]
      i:i+1
    ]

Ugly, complicated, but "close at hand". There is of course a much nicer and more idiomatic way to do the same thing:

    result: f'[v;!#v]

Most of the time conciseness isn't the goal, but you get it as a side effect of writing good code and working "with the grain" of the language.

[+] jxy|9 years ago|reply

   > their main drawback: k/q (like clever regexes) are often write-only code

This depends on the code reader's mentality. One line of k (or q or j or apl) would do what 10 lines of verbose languages do. For verbose languages, your expectation is spending 1 minute for 10 lines of code to fully understand it; but for terse languages, you need to change your expectation to spending 1 minute for only 1 line of code. You are not going to understand anything if you still want to spend 6 seconds per line.

On the other hand, proficiency is important. You read English articles in a slow pace in your first grade; you are not going to be a speed reader without practice, even if English is the only language you speak. No one would expect you to speed read Japanese right after you can compose simple Japanese sentences.

[+] gwu78|9 years ago|reply

"If you like writing..."

For me it is the opposite. I like not having to type any more than necessary. I like writing as little code as possible.

I also like being able to read through a program listing without having to wade through page upon page of long, camel case function names, and trying to follow five levels of indentation.

Not because I think that is the "wrong" approach for everyone but because I personally struggle with overcoming language and documentation verbosity in order to make any progress. It is the wrong approach for me. I wonder if perhaps this is how it feels for the typical programmer, in the OP's comment, who might struggle with a lack of verbosity.

Sometimes I streamedit (search and replace) blocks of code to remove indentation and shorten long function names to two letter labels, just so I can read without distraction.

What are the chances of finding someone who has the same preference for terseness and who is as skilled as Arthur Whitney?

q.k is about 240 lines in the version I am using. Most of the functions in .q, .Q and the other namespaces fit on a single line. This is a thing of beauty, IMO. For me, this makes studying the language manageable.

Someone in this thread posted a fizzbuzz solution last week that I thought illustrated how one can "work outward", incrementally adding primitives to the left or right to keep modifying output until the desired result is reached.

I use kdb+ daily but I'm a slow learner and still not very good with composing programs from scratch in k.

What little I have written was also done by "working outward", so the fizzbuzz example gave me some hope maybe I'm not too far off track. Thank you for that example. I hope we see some more.

I am thankful k exists, even if the intepreter is not open source and there's no BSD port.

As someone else commented somewhere, perhaps the most awkward aspect of k is getting it to interface with anything else. I think this is what keeps me from using it more frequently for more daily tasks.

But this may be more of a problem with everything else, and not necessarily with k.

I wish a more competent C programmer than I would write a lightweight replacement using linenoise or libedit to substitute for "rlwrap".

[+] de_Selby|9 years ago|reply

That's more on you than it is the language though. You can write obtuse write-only code in any language.

I will concede that there is a culture of trying to be a bit too clever on the k4 mailing list but it's perfectly possible to write maintainable code in kdb+

[+] branchless|9 years ago|reply

The main problem with Q is the users. Most users are math grads first, programmers second. They don't use version control. They don't comment their code.

Yes Q is dense but it can be annotated with comments.

I also have to maintain some java code written by this lot and it's nearly as unintelligible.

[+] mtanski|9 years ago|reply

I've build columnar OLAP databases and database engines in C++ for work. Now I'm doing it in my free time. Based on my experience the Phi and it's architecture is very exciting for OLAP databases workloads.

Reasons:

- Even in a OLAP database you end up with quite a few places that have very branchy code. Research on GPU friendly algorithms on things like (complex) JOINS and GROUP BY is pretty new. Additionally complex queries will functions and operations that you might not have a good GPU implementation for (like regex matching)

- Compression. You can use input data that compressed in anyway that there is a x86_64 library for. So you can now use LZ4, ZHUFF, GZIP, XZ. You can have 70+ independent threads decompressing input data (it's OLAP so it's pre-partitioned anyways). (Technically branching, again)

- Indexing techniques that cannot efficient implemented on the GPU can be used again. (Again branching)

- If you handle your own processing scheduling well, you will end up with near optimal IO / memory pattern (make sure to schedule the work on the core with local memory) and you not bound PCIe speed of the GPU. With enough PCIe lanes and lots of SSD drives you process as near memory speeds (esp. when we'll have Xpoint memory)

So the bottom line is if can intelligently farm out work in correct size chunks (it's OLAP so it's prob partitioned anyways) the the Phi is fantastic processor.

I'm primarily talking about the bootable package with Omni-Path interconnect (for multiple).

[+] anonu|9 years ago|reply

I've been using kdb+/q for a long time (7 years now) - and can attest to its speed. Objects are placed in memory with the intention that you will run vectorized operations over them. So both the memory-model and the language are designed to work together.

Lots of people complain about the conciseness of the language and that it is "write-once" code. I tend to disagree. While it might take a while to understand code you didn't write (or even code you wrote a while ago), focusing on writing in q rather than the terser k can improve readability tremendously.

My only wish is that someone would write a free/open-source 64-bit interpreter for q - with similar performance and speed to the closed version. Kona (for k) gets close https://github.com/kevinlawler/kona

[+] srpeck|9 years ago|reply

Some other k-inspired languages to have a look at:

- https://github.com/johnearnest/ok

- https://github.com/zholos/kuc

- https://github.com/ngn/k

- http://t3x.org/klong/

- https://github.com/tlack/xxl

[+] gricardo99|9 years ago|reply

Somewhat related is Kerf, also by Kevin Lawler.

http://www.kerfsoftware.com/

It seems more like a kdb+ competitor than open-source alternative, and isn't using q.

[+] svan99|9 years ago|reply

Nice writeup. If you are interested in learning KDB/Q, please take a look at this book: http://code.kx.com/mkdocs/qformortals3/

[+] picodoc|9 years ago|reply

or for a really quick high level overview you can use this: https://learnxinyminutes.com/docs/kdb+/

it's a really beautiful little language once you get into it :-)

[+] nnx|9 years ago|reply

I absolutely love this blog series. Can't wait to read what's next :)

First time I noticed (mention of) recap at http://tech.marksblogg.com/benchmarks.html

[+] qume|9 years ago|reply

If he made the layout a bit uglier, and made the language more esoteric and generally difficult to understand, this would make a fantastic academic paper.

But seriously, what a wonderful world it would be if all papers were this well written.

[+] chiph|9 years ago|reply

Under a second to do an avg across 1.1 billion rows spread over four machines. That's pretty amazing.

[+] jxy|9 years ago|reply

For a columnar database, that's a continuous chunk of memory. Assuming 32bit q defaults to 32bit int, 1.1 billion integers across four machines means each 64-core (with 4 threads/core) KNL chip is averaging over 275M elements of int array, or 1.1M 32bit int operations per thread. Now think again whether that's amazing or not.

[+] obl|9 years ago|reply

is it ? those things are trivial enough to be entirely bandwidth limited. total_amount is 4 byte, passenger_count is 1 and those are tightly packed in a column layout. streaming through that in 150ms is almost within the reach of a single normal chip with dual channel DDR3 ram.

Of course, not quite, and that's discounting the (small) sync overhead but still, no need to shell out 4 big servers, overpriced phi chips and fancy wide bus memory.

[+] mmcclellan|9 years ago|reply

This was a good idea for a test. I'll definitely check out the author's other stuff. Commenting briefly on cost: while the article mentions the free 32 bit version early on, the actual benchmarks were done using the commercial version. I've had the impression the comercial version was cost prohibitive for us poor folks. For those interested in experimenting with Xeon Phi though, it looks like you can get started for ~$5k: http://dap.xeonphi.com/

[+] Twirrim|9 years ago|reply

If you just want to meddle with a Xeon Phi, you can get some as cheap as $300: https://www.amazon.com/Intel-BC31S1P-Xeon-31S1P-Coprocessor/... , though that's from the 3100 rather than 7200 family, and so won't perform as fast. That kind of money puts it more into the hobby territory, though.

[+] WhitneyLand|9 years ago|reply

I don't see how these results provide much useful information in terms of being able to say x is faster than y.

The hardware doesn't seem consistent across different benchmarks. He says it's fast for a "cpu system", but for practical purposes Phi competes more with GPGPUs.

Would this be just as fast with one redis system with 512GB ram? I don't know too many apples to oranges here.

[+] sologub|9 years ago|reply

The author isn't testing just the software but combinations of various software/hardware systems, including for example PostgreSQL on an i5 CPU, 16GB RAM and 850 SSD hardware: http://tech.marksblogg.com/billion-nyc-taxi-rides-postgresql...

[+] lorenzhs|9 years ago|reply

A single machine doesn't have the required memory bandwidth to do this in the same time.

[+] mattnewton|9 years ago|reply

This is the comparison to the titan X / mapD article I was looking for. Still looks like the gpu is very competitive.

Sort of meta, but Mark's job seems awesome. Gets all these toys and writes about configuring them. (The actual configuring is probably a pain but still)

[+] 1024core|9 years ago|reply

    % cat startmaster.q

    k).Q.p:{$[~#.Q.D;.Q.p2[x;`:.]':y;(,/(,/.Q.p2[x]'/':)':(#.z.pd;0N)#.Q.P[i](;)'y)@<,/

Looks like line noise... :D

[+] gravypod|9 years ago|reply

At what point will CPUs out parallelize GPUs and will we be able to move vidoe rendering back onto the CPU?

I see that as being something I'd very much like.

[+] dunkelheit|9 years ago|reply

Pretty wide array of technologies covered in these benchmarks. I wonder how ClickHouse will fare, should be very competitive.

[+] wyldfire|9 years ago|reply

How does Phi's MCDRAM compare to GDDR5 (wrt throughput)?

[+] loeg|9 years ago|reply

According to wikipedia, the fastest GDDR5 can do 256 Gbit per chip[0]. I don't know how many chips are typically used. MCDRAM in the article does 400 GB/s, or 3,200 Gb/s. That would require 12.5 of those GDDR5 chips, assuming they scale linearly.

[0]: https://en.wikipedia.org/wiki/GDDR5_SDRAM#Commercial_impleme...

[+] pvitz|9 years ago|reply

Has somebody here experience with Jd and could comment on the status or the performance? Thanks!

[+] eggy|9 years ago|reply

The interpreted Jd is fast, but you need the compiled, commercial license Jd, or you used to, for the speed test.

I love J compared with K, but that is because I found it first, and the differences between J and K are minimal, but a different enough to keep me using J.

[+] gbrown_|9 years ago|reply

Nice to see some KNL usage outside the traditional large HPC centers :D

[+] nextos|9 years ago|reply

Yes, I remained a bit skeptical, but it seems to be taking off after latest iteration.

[+] stuntprogrammer|9 years ago|reply

If the combination of such languages, high-performance hardware, and large scale compute problems is interesting.. the startup I work for in Mountain View is hiring...

[+] hpcjoe|9 years ago|reply

:D Hope things are well by you!

[+] andrewstuart2|9 years ago|reply

I get the distinct feeling this is not the usual price for a Xeon Phi. Still, might keep an eye to see if it comes back into stock.

https://www.walmart.com/ip/INTEL-SERVER-CPU-SC7120P-XEON-PHI...

[+] unknown|9 years ago|reply

[deleted]

[+] tmostak|9 years ago|reply

It looks like year is extracted from pickup_datetime at ETL, so hence its not a fair comparison against the other databases that do this at runtime in Q3 and Q4. In something like MapD Q3 would be nearly as fast as Q1 (~20ms) without the extract function, which involves relatively complicated math.

[+] thedarkknight0|9 years ago|reply

Wow, they are some incredibly impressive numbers. Great write up. HT to Mark

102 comments