top | item 15870214

(no title)

panosv | 8 years ago

How about if you did the same on 8 or 16 core CPU that can have much more than 16 GB of memory and is not as expensive to move data around its own memory?

discuss

bitL|8 years ago

Roughly 1000x slower? GPUs nowadays have 5000+ "cores" inside.

panosv|8 years ago

That's the point. On the GPU side they use all the 5000+ cores to parallelize the algorithm (they use the hardware to its full potential). On the CPU side they use just one core (at least there is no mention around the cores used on the CPU). It's like saying a Camry beat a Ferrari in maximum speed, but you don't mention that the Ferrari was only in the first gear for that specific race.

dragontamer|8 years ago

> Roughly 1000x slower?

Not really. A modern Coffee Lake i7 has several distinct advantages over GPUs. (AMD Ryzen also has similar advantages, but I'm gonna focus on Coffee Lake)

1. AVX2 (256-bit SIMD), for 32-bit ints / floats that's 8 operations per cycle. AVX512 exists (16 operations per cycle) but it its only on Server architectures. Also, AVX512 has... issues... with the superscaling point#2 below. So I'm assuming AVX2 / 256-bit SIMD.

2. Superscalar execution: Every Skylake i7 (and Coffee Lake by extension) has THREE AVX ports (Port0, Port1, and Port5). We're now up to 24-operations per cycle in fully optimized code... although Skylake AVX2 can only do 16 Fused-multiply-adds at a time per core.

3. Intel machines run at 4GHz or so, maybe 3GHz for some of the really high core-count models. GPUs only run at 1.6GHz or so. This effectively gives a 2x to 2.5x multiplier.

So realistically, an Intel Coffee Lake core at full speed is roughly equivalent to 32 GPU "cores". (8x from AVX2 SIMD, x2 or x3 from Superscalar, and x2 from clock speed). If we compare like-with-like, a $1000 Nvidia Titan X (Pascal) has 3584 cores. While a $1000 Intel i9-7900 Skylake has 10 CPU cores (each of which can perform as well as 32-NVidia cores in Fused MultiplyAdd FLOPs).

i9-7900 Skylake is maybe 10x slower than an Nvidia Titan X when both are pushed to their limits. At least, on paper.

And remember: CPUs can "act" like a GPU by using SIMD instructions such as AVX2. GPUs cannot act like a CPU with regards to latency-bound tasks. So the CPU / GPU split is way closer than what most people would expect.

-------------

A major advantage GPUs have is their "Shared" memory (in CUDA) or "LDS" memory (in OpenCL). CPUs have a rough equivalent in L1 Cache, but GPUs also have L1 cache to work with. Based on what I've seen, GPU "cores" can all access Shared / LDS memory every clock (if optimized perfectly: perfectly coalesced accesses across memory-channels and whatever. Not easy to do, but its possible).

But Intel Cores can only do ~2 accesses per clock to their L1 cache.

GPUs can execute atomic operations on the Shared / LDS memory extremely efficiently. So coordination and synchronization of "threads", as well as memory-movements to-and-from this shared region is significantly faster than anything the CPU can hope to accomplish.

A second major advantage is that GPUs often use GDDR5 or GDDR5x (or even HBM), which is superior main-memory. The Titan X has 480 GB/s (that's "big" B, bytes) of main memory bandwidth.

A quad-channel i9-7900 Skylake will only get ~82 GB/second when equipped with 4x DDR4-3200MHz ram.

GPUs have a memory-advantage that CPUs cannot hope to beat. And IMO, that's where their major practicality lies. The GPU architecture has a way harder memory model to program for, but its way more efficient to execute.

visarga|8 years ago

That's wrong by orders of magnitude. The actual speedup of GPU's is about 8x. Those GPU cores are much weaker than CPU cores.

The paper in discussion here reports 10x speedup for GPU vs CPU.