I do enjoy these 21st Century partial rediscoveries of the good old trusty Fibonacci Polyphase merge sort. It's like the late 70's / early 80's all over again, only with commodity rather than custom hardware.
The book references are:
External Sorting, Section 5.4 in The Art of Computer Programming, Vol. 3: Sorting and Searching by Donald Knuth, Addison-Wesley, 1973.
and
External Sorting, Chapter 13 in Algorithms by Robert Sedgewick, Addison-Wesley, 1983.
Sorting and merging blocks is the beginning of the journey, learning the various tricks to doing it optimally is where it gets interesting, although perhaps no longer relevant; 30 years ago the OPs approach might take nearly a day with polyphase tweaks reducing the grind down to under half an hour, these days it's all over in a fraction of a second and more or less bound by I/O throughput speed - does optimising sort times still matter?
Does anyone have an idea what might cause the small bump/spike in performance is on the CPU-8 graph at approx 1e6 elements? It looks to be worth around 20-25% of the performance.
If it's related to a sizing parameter of the parallelisation, that suggests that choosing that parameter well could be important.
It looks like a sudden degradation of performance. Might be the point when the caches become full. That CPU has 6MB L3 + 256KB L2 x 8 which means ~8MB. The bump is at 1043786 elements which require 8154KB of memory.
Interesting, yet I'd have liked a little more of a break down of the results. For example the GPU will probably need to transfer the sorted data to and from the card. This will have a dramatic effect on performance.
> This will have a dramatic effect on performance.
Agreed, but the effect is bigger on latency than throughput. If the problem set is bigger than available GPU memory, the latency is at least partially hidden by streaming as the GPU can be kept busy while DMA is still transferring. The same thing applies if you will be doing more computations on the same data in the GPU so it won't have to be transferred back and forth.
The i7-740QM was introduced almost 3 years ago, June 2010, at release time costing $378 (if you bought in lots of 1000).
I didn't find historical pricing data for the 445M video card, but here's a relevant data point: I bought my GeForce 470 at almost exactly that time, for $420 retail (so the bulk price would be around $400, I think?). Measured in raw FLOPS, the GF470 is over three times as fast as the 445M (the video card tested).
So dollar for dollar, TFA undervalues the GPU by about a factor of three?
[+] [-] defrost|13 years ago|reply
There's some online discussion here: http://flylib.com/books/en/3.55.1.115/1/
The book references are: External Sorting, Section 5.4 in The Art of Computer Programming, Vol. 3: Sorting and Searching by Donald Knuth, Addison-Wesley, 1973.
and External Sorting, Chapter 13 in Algorithms by Robert Sedgewick, Addison-Wesley, 1983.
Sorting and merging blocks is the beginning of the journey, learning the various tricks to doing it optimally is where it gets interesting, although perhaps no longer relevant; 30 years ago the OPs approach might take nearly a day with polyphase tweaks reducing the grind down to under half an hour, these days it's all over in a fraction of a second and more or less bound by I/O throughput speed - does optimising sort times still matter?
[+] [-] tolos|13 years ago|reply
Yes, not all new computers are fast (calculators, cars, µC, etc).
[+] [-] jey|13 years ago|reply
[+] [-] jbert|13 years ago|reply
If it's related to a sizing parameter of the parallelisation, that suggests that choosing that parameter well could be important.
[+] [-] supergirl|13 years ago|reply
[+] [-] aseidl|13 years ago|reply
[+] [-] xedarius|13 years ago|reply
[+] [-] exDM69|13 years ago|reply
Agreed, but the effect is bigger on latency than throughput. If the problem set is bigger than available GPU memory, the latency is at least partially hidden by streaming as the GPU can be kept busy while DMA is still transferring. The same thing applies if you will be doing more computations on the same data in the GPU so it won't have to be transferred back and forth.
[+] [-] moreaccounts|13 years ago|reply
[+] [-] coolj|13 years ago|reply
[+] [-] onedognight|13 years ago|reply
[+] [-] pavanky|13 years ago|reply
[+] [-] jholman|13 years ago|reply
The i7-740QM was introduced almost 3 years ago, June 2010, at release time costing $378 (if you bought in lots of 1000).
I didn't find historical pricing data for the 445M video card, but here's a relevant data point: I bought my GeForce 470 at almost exactly that time, for $420 retail (so the bulk price would be around $400, I think?). Measured in raw FLOPS, the GF470 is over three times as fast as the 445M (the video card tested).
So dollar for dollar, TFA undervalues the GPU by about a factor of three?
Sources: http://en.wikipedia.org/wiki/Comparison_of_Intel_processors http://en.wikipedia.org/wiki/Nehalem_%28microarchitecture%29 http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_p...
[+] [-] aseidl|13 years ago|reply
[+] [-] octopus|13 years ago|reply
[+] [-] allanmac|13 years ago|reply