top | item 5167922

(no title)

tmurray | 13 years ago

I worked on CUDA at NVIDIA for over four years and was the primary API designer for a large part of that time. I started on RS at Google in September.

Basically, he gives us too little credit for the execution model (it's young, it's improving very quickly and is not at all designed to emulate anything else that exists today) and assumes that GPU compute has the same tradeoffs on mobile as desktop (it doesn't at all). You'll see more from us soon.

discuss

order

compilercreator|13 years ago

Hi. Author here. Your name is quite famous in the GPGPU community, and it is great to hear that you now work on RSC. My experience does not compare to yours and I do hope my post is seen in a positive light. Would love to discuss the issues in depth sometime.

Anyway if you were to ignore everything in the post except one item, that would be to please fix gather/scatter in RSC. A parallel computing API without proper gather/scatter is simply not very useful, irrespective of whether it is on desktop or mobile.

I will keep following RSC and look forward to the developments you are hinting at.

tmurray|13 years ago

I implemented scatter back in October, but it just barely missed Android 4.2. It's in the next release.

shard|13 years ago

Could you clarify about the GPU compute tradeoffs on desktop versus mobile? Why is it different between the two?

tmurray|13 years ago

Desktop: high-end consumer GPUs have about 10-15x the single-precision FLOPs and 4-6x the bandwidth of a single Intel CPU socket. At this point, usually connected via PCIe Gen3. There are two real vendors (NVIDIA and AMD), and what comprises a system is generally the same (CPU + some number of GPUs).

Mobile: GPU has 3-5x the FLOPs of the CPU and no bandwidth advantage because of the shared memory pool between CPU and GPU. GPUs have very wide ranges of functionality. Even the CPUs behave very differently (Krait in Nexus 4 sometimes chews through code that the A15 in Nexus 10 chokes on and vice-versa). What comprises a system varies tremendously--CPU, CPU + GPU, CPU + GPU + other processors, etc.

A developer shouldn't be expected to have to tune for 20 different processors and system architectures in order to ship an app on the Android market. That's the problem we're trying to solve, not simply exposing access to GPU compute.