Sorting data in parallel CPU vs GPU

[+] defrost|13 years ago|reply

I do enjoy these 21st Century partial rediscoveries of the good old trusty Fibonacci Polyphase merge sort. It's like the late 70's / early 80's all over again, only with commodity rather than custom hardware.

There's some online discussion here: http://flylib.com/books/en/3.55.1.115/1/

The book references are: External Sorting, Section 5.4 in The Art of Computer Programming, Vol. 3: Sorting and Searching by Donald Knuth, Addison-Wesley, 1973.

and External Sorting, Chapter 13 in Algorithms by Robert Sedgewick, Addison-Wesley, 1983.

Sorting and merging blocks is the beginning of the journey, learning the various tricks to doing it optimally is where it gets interesting, although perhaps no longer relevant; 30 years ago the OPs approach might take nearly a day with polyphase tweaks reducing the grind down to under half an hour, these days it's all over in a fraction of a second and more or less bound by I/O throughput speed - does optimising sort times still matter?

[+] tolos|13 years ago|reply

> does optimising sort times still matter?

Yes, not all new computers are fast (calculators, cars, µC, etc).

[+] jey|13 years ago|reply

Tangential factoid: GNU libstdc++ has a parallel mode that includes parallel std::sort, http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode...

[+] jbert|13 years ago|reply

Does anyone have an idea what might cause the small bump/spike in performance is on the CPU-8 graph at approx 1e6 elements? It looks to be worth around 20-25% of the performance.

If it's related to a sizing parameter of the parallelisation, that suggests that choosing that parameter well could be important.

[+] supergirl|13 years ago|reply

It looks like a sudden degradation of performance. Might be the point when the caches become full. That CPU has 6MB L3 + 256KB L2 x 8 which means ~8MB. The bump is at 1043786 elements which require 8154KB of memory.

[+] aseidl|13 years ago|reply

I've rerun the tests on some slightly more recent hardware: a Tesla K20c and Xeon E5-2630. See the 'new benchmarks' section on: https://github.com/aseidl/Sort_data_parallel

[+] xedarius|13 years ago|reply

Interesting, yet I'd have liked a little more of a break down of the results. For example the GPU will probably need to transfer the sorted data to and from the card. This will have a dramatic effect on performance.

[+] exDM69|13 years ago|reply

> This will have a dramatic effect on performance.

Agreed, but the effect is bigger on latency than throughput. If the problem set is bigger than available GPU memory, the latency is at least partially hidden by streaming as the GPU can be kept busy while DMA is still transferring. The same thing applies if you will be doing more computations on the same data in the GPU so it won't have to be transferred back and forth.

[+] moreaccounts|13 years ago|reply

Take a look at CUDA's pinned memory.

[+] coolj|13 years ago|reply

Please post the numbers for that box with vanilla (non-parallel) std::sort, on the whole array, as well! I'm curious!

[+] onedognight|13 years ago|reply

It looks like the OP should be using std::inplace_merge instead of std::merge/std::copy.

[+] pavanky|13 years ago|reply

Using 445M to compare against an i7 is ridiculous (Even if the i7 is 3 generations old).

[+] jholman|13 years ago|reply

Yeah. To contextualize that a little...

The i7-740QM was introduced almost 3 years ago, June 2010, at release time costing $378 (if you bought in lots of 1000).

I didn't find historical pricing data for the 445M video card, but here's a relevant data point: I bought my GeForce 470 at almost exactly that time, for $420 retail (so the bulk price would be around $400, I think?). Measured in raw FLOPS, the GF470 is over three times as fast as the 445M (the video card tested).

So dollar for dollar, TFA undervalues the GPU by about a factor of three?

Sources: http://en.wikipedia.org/wiki/Comparison_of_Intel_processors http://en.wikipedia.org/wiki/Nehalem_%28microarchitecture%29 http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_p...

[+] aseidl|13 years ago|reply

I'm working on running the tests on 2x Xeon E5-2630 and a Tesla K20c. Will post the results here in a couple hours.

[+] octopus|13 years ago|reply

You think the average programmer (user) has a Tesla computing device on his computer ?

[+] allanmac|13 years ago|reply

A 445M is an sm_21 Fermi device with 3 SMs (144 cores). This is a low-end GPU.

22 comments