(no title)
waltpad | 5 years ago
It is true that there are tasks where threading matters, but still require a CPU rather than a GPU. I wonder however if these tasks do need full SSE/AVX etc. Couldn't these extensions be removed of the CPU cores and instead have the necessary work performed by the GPU?
It would be interesting to produce statistics on how much these extensions are used in these scenario. Imagine how much space and complexity could be saved on a CPU die by making stripped down versions. That space could in turn be used for more cores!
I read a little about the Xeon PHI cpus, which iirc, is a multicore CPU with a very small ISA, but I wonder why x86 makers aren't trying to go in that direction: isn't there plenty of dedicated workloads which would happily run on these (eg, web servers), or is this just a (too) simplistic view?
dragontamer|5 years ago
SSE/AVX shares an L1 cache that's damn near instantaneous to access for the CPU core. Total L1 bandwidth is on the scale of TB/s.
PCIe -> GPU takes 1-microsecond to 10-microseconds per access, and operates only at 50GB/s (or 1/20th the speed of L1 bandwidths).
------------
Case in point: Memset is very commonly AVX'd to clear out L1 cache and initialize ~1kb to 32kb of data to 0 as quickly as possible.
There's no way for "memset" to move from CPU to GPU unless you feel like obliterating the entire point of L1, L2, and L3 cache. If you moved a "memset" to GPU, it'd operate only at 15GB/s (the speed of PCIe 3.0 x16 lanes), far, far slower than L1 cache AVX-loads/stores.
SIMD units, like SSE and AVX, are highly "local" and have huge advantages.
TinkersW|5 years ago
waltpad|5 years ago
The main problem is, afaik, that there is not enough control about where the code will run in these languages. At some point, one will want to describe all the algorithms using a single language, and somehow describe how the workload will have to be distributed across all the processors, or at least that's what I've been thinking about for a while. Once you have that level of control, the need for a versatile CPU is less clear. Note that nowadays people seems happy with hybrid solutions where the code is scattered across several languages (eg, one for the main program and one for the shaders, or for the client side UI), so my position is maybe not very strong.
HW-wise, is it possible that integrated GPUs are the first steps toward an architecture where CPU and GPU have better interconnections (ie, larger communication bandwidth and smaller latency) to the point where SIMD becomes moot? There is also the SWAR approach, where one doesn't rely on intrinsic SIMD instructions, but instead emulate them (though it's probably not very realistic for floating point computation).
Some other ideas:
- Apple has this neural engine in their latest chips, which is basically dedicated HW for neural networks
- In the wild, people are getting more and more interested in building their custom ASICs to cut software's middle-man cost: for them, the CPU solution is not good enough
- Intel recently introduced a new matrix ops extension in their CPUs: maybe at some point they'll introduce full GPU capabilities directly baked in the CPU? I am a little worried about the resulting ISA.
Anyway, I am not an HW engineer, nor a very good software one. I only have a limited view of the difficulties in writing good, CPU or GPU efficient code. My first post was prompted by remembering the first "large scale" multicores CPUs 15 years ago (specifically the Ultrasparc T1) which wheren't SIMD heavy. The direction naturally shifted as progress was made on SIMD to try to compete with GPUs, when it seems to me that originally CPUs and GPUs were complementary.
I tend to support modular solutions, but I don't know how costly that would be in term of efficiency at the HW level.
waltpad|5 years ago
Xeon PHI on the other hand was the first host of AVX-512 instruction set. Sorry.