top | item 8754351

(no title)

tmurray | 11 years ago

disclaimer: I work in this space and have done so for a while, including previously on CUDA and on Titan.

GPUs for general purpose computation were never 100x faster than CPUs like people claimed in 2008 or so. They're just not. That was basically NV marketing mixed with a lot of people publishing some pretty bad early work on GPUs.

Lots of early papers that fanned GPU hype followed the same basic form: "We have this standard algorithm, we tested it on a single CPU core with minimal optimizations and no SIMD (or maybe some terrible MATLAB code with zero optimization), we tested a heavily optimized GPU version, and look the GPU version is faster! By the way, we didn't port any of those optimizations back to the CPU version or measure PCIe transfer time to/from the GPU." It was utterly trivial to get any paper into a conference by porting anything to the GPU and reporting a speedup. Most of the GPU related papers from this time were awful. I remember one in particular that claimed a 1000x speedup by timing just the amount of time it took for the kernel launch to the GPU instead of the actual kernel runtime, and somehow nobody (either the authors or the reviewers) realized that this was utterly impossible.

GPUs have more FLOPs and more memory bandwidth in exchange for requiring PCIe and lots of parallel work. if your algorithm needs those more than anything else (like cache), can minimize PCIe transfer time, and handles the whole massive parallelism thing well, then GPUs are a pretty good bet. If you can't, then they're not going to work particularly well.

(now, if you need to do 2D interpolation and can use the texture fetch hardware on the GPU to do it instead of a bunch of arbitrary math... yeah, that's a _huge_ performance increase because you get that interpolation for free from special-purpose hardware. but that's incredibly rare in practice)

discuss

fat0wl|11 years ago

ah, yes. :) very nice detailed summary of some of the issues in this sect of "academia" (I put that in quotes only because all the research seems to be co-written by corps).

I am into audio DSP & am planning to port a couple of audio algorithms (lots of FFT & linear algebra) to run on GPU but haven't even gotten to it because I considered it a pre-mature optimization to this point. I'm sure it would improve performance, but nowhere near what GPU advocates would claim.

My biggest reason? "PCIe transfer time to/from GPU", plus it would be unoptimized GPU code. Once you read a few of these papers it becomes painfully obvious that a lot of tuning goes into the GPU algorithms that offer anything more than a low single-digit factor of speedup. It's still very significant (cutting a 3 hour algorithm down to 1 would be huge) but if you're in an early stage of research it may be a toss-up over whether its better to just tune the algorithm itself / run computations overnight rather than going through the trouble of writing a GPU-based POC. Maybe if you have 1 or 2 under your belt its not such a big deal but for most of the researchers I know GPU algorithm rewrites would not be trivial. (I've been doing enterprise Java coding for about 2 years now so the idea isn't so intimidating now, but in a past life of mucking around with Matlab scripts I'm sure it would have been daunting).