If I were a hiring person at AMD or Intel, I'd shortlist this guy for a job, as they need help competing against the headstart CUDA has in the GPU-base compute space.
What are the common applications for these sorts of GPU-accelerated FFTs? We mostly just solved problems analytically in undergrad, and the little bit of naive coding we did seemed pretty fast. I feel like this must be used for problems I would have learned about in grad school, if I had continued in electrical engineering.
I have used VkFFT to create GPU version of a magnetic simulation software Spirit (https://github.com/DTolm/spirit).
Except for FFT it also has a lot of general linear algebra routines, like efficient GPU reduce/scan and system solvers, like CG, LBFGS, VP, Runge-Kutta and Depondt. This version of Spirit is faster than CUDA based software that has been out and updated for ~6 years due to the fact that I have full control over all the code I use. You might want to check the discussions on reddit for this project: https://www.reddit.com/r/MachineLearning/comments/ilcw2f/p_v... and https://www.reddit.com/r/programming/comments/il9sar/vulkan_...
MD simulations rely on FFT but I'm not sure how much is typically (or can be) done on the GPU. For example, NAMD employs cuFFT on the GPU in some cases. (https://aip.scitation.org/doi/10.1063/5.0014475)
The same as any FFT, but accelerated; with the tradeoff that the cost of moving data from and to the GPU needs to be amortized.
It's also a good proof of concept for other kinds of GPU computations.
You can run OpenCL kernels on Vulkan at least in theory: SPIR-V supports OpenCL memory model. CUDA might be machine translatable if you can compile into LLVM target (clang seems to have experimental support developed outside of Nvidia) which you then retarget into SPIR-V using a cross-compiler. The LLVM to SPIR-V cross-compiler however is limited in its translation for the time being.
In general, Vulkan is a thing which commands the GPU, but is not opinionated on what the language used to represent the kernel is as long as it compiles to SPIR-V. SPIR-V in itself is like parallel LLVM IR. If you look into the project source, the shaders are in GLSL which have been pre-compiled using a cross-compiler into SPIR-V. The C file you find on the project root constitutes as the loader program for the SPIR-V files.
"VkFFT aims to provide community with an open-source alternative to Nvidia's cuFFT library, while achieving better performance."
There are no error bars on the graphs, so it's very hard to judge if the minor differences are significant. I work in research, so probably I'm peculiar about this point, but: I'd expect better from anyone who's taken basic statistics. But from a quick look, it seems like the performance is pretty much just "on par".
It would also be nice to know how performance is on other hardware. I'm assuming it's tuned to nvidida GPUs (or maybe even the specific GPU mentioned). But how does this perform on Intel or AMD hardware? How does it compare to `rocFFT` or Intel's own implementation?
The FFT and iFFT are performed consecutively up to 1000 times and then each run is done 5 more times. The total result is averaged both for VkFFT and cuFFT and stays roughly the same between launches. The minor performance gains (5-20%) are noticeable. If you have a better testing technique, I am open to the suggestions.
I have tested VkFFT on Intel UHD620 GPU and the performance scaled on the same rate as most benchmarks do. There are a couple of parameters that can be modified for different GPUs (like the amount of memory coalesced, which is 32bits on Nvidia GPUs after Pascal and is 64bits for Intel). I have no access to an AMD machine, otherwise I would have refined the lauch configuration parameters for it too. I have not tested other libraries than cuFFT yet.
I think this guy will have no problem getting hired. Being conscious enough to push code online works so much better than the CV preparation courses. You know you're on the right path when you are asked to play up your CV abstract than to downplay it.
Personally, I would have a hard time hiring anyone without a Github account and less so working in a place where nobody has one.
So you can pad your input array with zeros, but the algorithm doesn't know that it's padded, and will just compute with those zeros like any other value. If you could tell it that they were zeros it could take advantage of x*0=0 and x+0=x to significantly reduce computation. That's what I think that is.
I have some of these routines like Reduce and Scan in my other project https://github.com/DTolm/spirit. It also has implementations of linear algebra solvers like CG, VP, Runge-Kutta and some others. These routines have to be inlined in users shaders in some way to have a good performance. Releasing them as a standalone library will require some thinking due to the fact that some routines have multiple shader dispatches.
Currently, I hit the limit of maximum workgroups amount for one submit dispatch (this is why y and z axis are lower than x one for now). It can be removed by adding multiple dispatches to the code, which I will do in one of the next updates. To go past 2^24 I need to polish the four stage FFT algorithm to allow for >2 data transfers, which I have implemented, but not yet tested. There will also be a single precision limit in this range, as the twiddle factors values will be close to 1e-8 which will be close to a machine error.
zdw|5 years ago
slavik81|5 years ago
[1]: https://jobs.amd.com/job/Calgary-GPU-Libraries-Software-Deve... [2]: https://github.com/ROCmSoftwarePlatform/rocFFT
jjeaff|5 years ago
TomVDB|5 years ago
slavik81|5 years ago
DTolm|5 years ago
Reelin|5 years ago
Fluid flow, heat transfer, and other such physical phenomena that you might want to simulate.
Phase correlation in image processing is another example. (https://en.wikipedia.org/wiki/Phase_correlation)
MD simulations rely on FFT but I'm not sure how much is typically (or can be) done on the GPU. For example, NAMD employs cuFFT on the GPU in some cases. (https://aip.scitation.org/doi/10.1063/5.0014475)
hadeson|5 years ago
[0] https://arxiv.org/abs/1312.5851
enriquto|5 years ago
gorkish|5 years ago
looping__lui|5 years ago
HelloNurse|5 years ago
p1mrx|5 years ago
Jhsto|5 years ago
In general, Vulkan is a thing which commands the GPU, but is not opinionated on what the language used to represent the kernel is as long as it compiles to SPIR-V. SPIR-V in itself is like parallel LLVM IR. If you look into the project source, the shaders are in GLSL which have been pre-compiled using a cross-compiler into SPIR-V. The C file you find on the project root constitutes as the loader program for the SPIR-V files.
Futhark project did some initial benchmarks on translating OpenCL to Vulkan. The results were mainly slowdowns. You can read about it in here: https://futhark-lang.org/student-projects/steffen-msc-projec...
pjmlp|5 years ago
https://home.otoy.com/octane2020-rndr-released/
"OTOY | GTC 2020: Real-Time Raytracing, Holographic Displays, Light Field Media and RNDR Network"
https://www.youtube.com/watch?v=Qfy6CTaSHcc
querez|5 years ago
There are no error bars on the graphs, so it's very hard to judge if the minor differences are significant. I work in research, so probably I'm peculiar about this point, but: I'd expect better from anyone who's taken basic statistics. But from a quick look, it seems like the performance is pretty much just "on par".
It would also be nice to know how performance is on other hardware. I'm assuming it's tuned to nvidida GPUs (or maybe even the specific GPU mentioned). But how does this perform on Intel or AMD hardware? How does it compare to `rocFFT` or Intel's own implementation?
DTolm|5 years ago
I have tested VkFFT on Intel UHD620 GPU and the performance scaled on the same rate as most benchmarks do. There are a couple of parameters that can be modified for different GPUs (like the amount of memory coalesced, which is 32bits on Nvidia GPUs after Pascal and is 64bits for Intel). I have no access to an AMD machine, otherwise I would have refined the lauch configuration parameters for it too. I have not tested other libraries than cuFFT yet.
Jhsto|5 years ago
Personally, I would have a hard time hiring anyone without a Github account and less so working in a place where nobody has one.
ncmncm|5 years ago
oxxoxoxooo|5 years ago
gct|5 years ago
unknown|5 years ago
[deleted]
Lichtso|5 years ago
Seems a bit more feature complete than my take on the problem: https://github.com/Lichtso/VulkanFFT
Still, to beat CUDA with Vulkan a lot is still missing: Scan, Reduce, Sort, Aggregate, Partition, Select, Binning, etc.
DTolm|5 years ago
meisel|5 years ago
ncmncm|5 years ago
phkahler|5 years ago
microcolonel|5 years ago
It is a library.
fluffything|5 years ago
What about bigger than big? > 2^29 or so ? Are these sizes for double precision ?
DTolm|5 years ago
bobowzki|5 years ago
Mizza|5 years ago
A Free GPUFFT implementation will certainly help! Great work.
mmis1000|5 years ago
adamnemecek|5 years ago
rektide|5 years ago
person_of_color|5 years ago