It says that a lot of the reason TensorFlow initially lagged in performance is because a lot of those performance issues only manifested under NVCC, whereas they had been using GPUCC internally.
The list below shows some of the more important optimizations for GPUs... A few of them have not been upstreamed due to lack of a customizable target-independent optimization pipeline.
So the LLVM version of gpucc will be incomplete? Will there be a release of the original stand-alone gpucc?
I see Eli Bendersky's name on this; his site ( http://eli.thegreenplace.net/ ) has a number of interesting C++ articles, some of which I've even carefully printed out and taped into my notebook of really useful things. If you're a C++ programmer, there are a lot of useful reads on there.
I don't see anything specifically about this in the archives, but maybe that's something to look forwards to.
When I look at CUDA code, it seems to be a big loop targeting the GPU memory with standard c code, allocating memory with standard functions and specifying where code lives with simple defines.
When I look at OpenCL, it is... I don't know what it is. I haven't figure it out after considerable scanning. And that has cemented my decision to avoid it because I don't have infinite time to scan obscurity.
For example, here is a standard "first OpenCL program" - ~200 lines of boiler plate and no simple example of our many cores working together to do something brutally simple and useful like add two vectors. Just "hello world" from GPU.
As far as I can tell, as a production of a multitude of vendors all of which have different stuff, OpenCl is a monstrosity where you have a wide of variety of functionalities supported but none of those functionalities is guaranteed to be present - hence 200 lines of boiler plate. Kind of like the umpteen Unix flavors and such back in the day, "Open standards" that are bridges between only semi-compatible hardware have generally been doomed abortions discarded in favor of a single best approach that all vendors are forced to adopt.
So it seems like the best thing is jettisoning the monstrosity and cloning CUDA for other hardware.
Not a compiler guy but a GPU programmer. This is exciting! Attended one of the authors' lecture a while ago. Although at this point I assume gpucc would be super-optimized for deep learning (by which I mean dense matrix multiplication), this is very good for the community so that people can work on various versions that either focus on better general performance, or difference feature sets for specific applications in the future.
Just as a point of interest, is there any limitation to supporting CUDA on AMD hardware (were this to be compiled with the AMDGPU backend)? With the obvious lack of libraries, etc.
AMD's new Boltzmann initiative includes an LLVM-based compiler which has been posted online. I'm not sure what are the plans around an OpenCL fronted, but the backend should be there, so I think an OpenCL support in LLVM for AMD GPUs could be a realistic goal.
The Tensorflow code mentions "GCUDACC" in several places, and from the surrounding comments it seems to be targeted at OpenCL as well as CUDA. So it seems that this has been at least considered.
I suspect that this compiler is generating ptx and not true native binaries for nvidia's architectures. Nvidia's proprietary compiler stack is still heavily involved in the conversion of ptx ir to native binaries. Essentially.. this isn't a full open source stack.
> I suspect that this compiler is generating ptx and not true native binaries for nvidia's architectures
It would take all of getting to page 2 of the article to confirm this instead of speculating...
OTOH, there is an intriguing footnote that
> We are also experimenting compiling [virtual ISA] PTX to [Nvidia's proprietary Shader ASSembler] SASS before program execution and embedding the SASS directly into the resultant binary
but the paper mentions in the conclusion that a SASS spec is not publicly available. It would be interesting for someone involved to comment more on that. Experiments on reverse engineering the compiled PTX results?
If implementing a replacement for nvcc gave these gains, I would imagine being able to control an offline version of the (normally JIT) compilation to SASS would also yield large benefits. It would likely be incredibly architecture dependent, but for the big machine learning projects that still might be worth the expense.
You are right that gpucc still depends on NVIDIA's ptxas tool that translates PTX to native binaries. NVIDIA does not publish the specification of their native binaries. Besides that, it is fully open-source.
[+] [-] haberman|10 years ago|reply
It says that a lot of the reason TensorFlow initially lagged in performance is because a lot of those performance issues only manifested under NVCC, whereas they had been using GPUCC internally.
[+] [-] namtrac|10 years ago|reply
[+] [-] svensken|10 years ago|reply
Can anyone comment on the following quote:
The list below shows some of the more important optimizations for GPUs... A few of them have not been upstreamed due to lack of a customizable target-independent optimization pipeline.
So the LLVM version of gpucc will be incomplete? Will there be a release of the original stand-alone gpucc?
[+] [-] m_mueller|10 years ago|reply
[+] [-] ashitlerferad|10 years ago|reply
[+] [-] EliRivers|10 years ago|reply
I don't see anything specifically about this in the archives, but maybe that's something to look forwards to.
[+] [-] wmf|10 years ago|reply
[+] [-] joe_the_user|10 years ago|reply
When I look at CUDA code, it seems to be a big loop targeting the GPU memory with standard c code, allocating memory with standard functions and specifying where code lives with simple defines.
When I look at OpenCL, it is... I don't know what it is. I haven't figure it out after considerable scanning. And that has cemented my decision to avoid it because I don't have infinite time to scan obscurity.
For example, here is a standard "first OpenCL program" - ~200 lines of boiler plate and no simple example of our many cores working together to do something brutally simple and useful like add two vectors. Just "hello world" from GPU.
As far as I can tell, as a production of a multitude of vendors all of which have different stuff, OpenCl is a monstrosity where you have a wide of variety of functionalities supported but none of those functionalities is guaranteed to be present - hence 200 lines of boiler plate. Kind of like the umpteen Unix flavors and such back in the day, "Open standards" that are bridges between only semi-compatible hardware have generally been doomed abortions discarded in favor of a single best approach that all vendors are forced to adopt.
So it seems like the best thing is jettisoning the monstrosity and cloning CUDA for other hardware.
https://www.fixstars.com/en/opencl/book/OpenCLProgrammingBoo...
[+] [-] mattnewton|10 years ago|reply
[+] [-] wiso|10 years ago|reply
[+] [-] yzh|10 years ago|reply
[+] [-] cjbprime|10 years ago|reply
[+] [-] wujingyue|10 years ago|reply
[+] [-] jpgvm|10 years ago|reply
[+] [-] Alphasite_|10 years ago|reply
[+] [-] slizard|10 years ago|reply
http://gpuopen.com/compute-product/hcc-heterogeneous-compute... https://github.com/RadeonOpenCompute/hcc
[+] [-] barneso|10 years ago|reply
[+] [-] fooblaster|10 years ago|reply
[+] [-] magicalist|10 years ago|reply
It would take all of getting to page 2 of the article to confirm this instead of speculating...
OTOH, there is an intriguing footnote that
> We are also experimenting compiling [virtual ISA] PTX to [Nvidia's proprietary Shader ASSembler] SASS before program execution and embedding the SASS directly into the resultant binary
but the paper mentions in the conclusion that a SASS spec is not publicly available. It would be interesting for someone involved to comment more on that. Experiments on reverse engineering the compiled PTX results?
If implementing a replacement for nvcc gave these gains, I would imagine being able to control an offline version of the (normally JIT) compilation to SASS would also yield large benefits. It would likely be incredibly architecture dependent, but for the big machine learning projects that still might be worth the expense.
[+] [-] wujingyue|10 years ago|reply
[+] [-] rsp1984|10 years ago|reply
[+] [-] maaku|10 years ago|reply
[+] [-] wujingyue|10 years ago|reply
[+] [-] varelse|10 years ago|reply
[+] [-] hsivonen|10 years ago|reply
[+] [-] wcrichton|10 years ago|reply