top | item 24610128

VkFFT – Vulkan Fast Fourier Transform Library

220 points| ah- | 5 years ago |github.com

127 comments

order

zdw|5 years ago

If I were a hiring person at AMD or Intel, I'd shortlist this guy for a job, as they need help competing against the headstart CUDA has in the GPU-base compute space.

jjeaff|5 years ago

Ya, but the important question is can they invert a binary tree on a whiteboard?

TomVDB|5 years ago

One should hope that the non-CUDA GPU compute library ecosystem has already advanced beyond being able to calculate FFTs!

slavik81|5 years ago

What are the common applications for these sorts of GPU-accelerated FFTs? We mostly just solved problems analytically in undergrad, and the little bit of naive coding we did seemed pretty fast. I feel like this must be used for problems I would have learned about in grad school, if I had continued in electrical engineering.

DTolm|5 years ago

I have used VkFFT to create GPU version of a magnetic simulation software Spirit (https://github.com/DTolm/spirit). Except for FFT it also has a lot of general linear algebra routines, like efficient GPU reduce/scan and system solvers, like CG, LBFGS, VP, Runge-Kutta and Depondt. This version of Spirit is faster than CUDA based software that has been out and updated for ~6 years due to the fact that I have full control over all the code I use. You might want to check the discussions on reddit for this project: https://www.reddit.com/r/MachineLearning/comments/ilcw2f/p_v... and https://www.reddit.com/r/programming/comments/il9sar/vulkan_...

Reelin|5 years ago

Likely any HPC application that has an FFT somewhere in its pipeline and is otherwise amenable to being run on a GPU.

Fluid flow, heat transfer, and other such physical phenomena that you might want to simulate.

Phase correlation in image processing is another example. (https://en.wikipedia.org/wiki/Phase_correlation)

MD simulations rely on FFT but I'm not sure how much is typically (or can be) done on the GPU. For example, NAMD employs cuFFT on the GPU in some cases. (https://aip.scitation.org/doi/10.1063/5.0014475)

enriquto|5 years ago

If you could filter and focus raw radar data in realtime it would be really cool!

gorkish|5 years ago

Software defined radio / RF DSP is another area where FFT and IFFT performance and accuracy are critical.

looping__lui|5 years ago

Imaging. E.g., large convolutions.

HelloNurse|5 years ago

The same as any FFT, but accelerated; with the tradeoff that the cost of moving data from and to the GPU needs to be amortized. It's also a good proof of concept for other kinds of GPU computations.

p1mrx|5 years ago

How does using Vulkan for computation fit into the OpenCL/CUDA landscape? Is CUDA's proprietary nature doing meaningful harm, and does Vulkan help?

Jhsto|5 years ago

You can run OpenCL kernels on Vulkan at least in theory: SPIR-V supports OpenCL memory model. CUDA might be machine translatable if you can compile into LLVM target (clang seems to have experimental support developed outside of Nvidia) which you then retarget into SPIR-V using a cross-compiler. The LLVM to SPIR-V cross-compiler however is limited in its translation for the time being.

In general, Vulkan is a thing which commands the GPU, but is not opinionated on what the language used to represent the kernel is as long as it compiles to SPIR-V. SPIR-V in itself is like parallel LLVM IR. If you look into the project source, the shaders are in GLSL which have been pre-compiled using a cross-compiler into SPIR-V. The C file you find on the project root constitutes as the loader program for the SPIR-V files.

Futhark project did some initial benchmarks on translating OpenCL to Vulkan. The results were mainly slowdowns. You can read about it in here: https://futhark-lang.org/student-projects/steffen-msc-projec...

querez|5 years ago

"VkFFT aims to provide community with an open-source alternative to Nvidia's cuFFT library, while achieving better performance."

There are no error bars on the graphs, so it's very hard to judge if the minor differences are significant. I work in research, so probably I'm peculiar about this point, but: I'd expect better from anyone who's taken basic statistics. But from a quick look, it seems like the performance is pretty much just "on par".

It would also be nice to know how performance is on other hardware. I'm assuming it's tuned to nvidida GPUs (or maybe even the specific GPU mentioned). But how does this perform on Intel or AMD hardware? How does it compare to `rocFFT` or Intel's own implementation?

DTolm|5 years ago

The FFT and iFFT are performed consecutively up to 1000 times and then each run is done 5 more times. The total result is averaged both for VkFFT and cuFFT and stays roughly the same between launches. The minor performance gains (5-20%) are noticeable. If you have a better testing technique, I am open to the suggestions.

I have tested VkFFT on Intel UHD620 GPU and the performance scaled on the same rate as most benchmarks do. There are a couple of parameters that can be modified for different GPUs (like the amount of memory coalesced, which is 32bits on Nvidia GPUs after Pascal and is 64bits for Intel). I have no access to an AMD machine, otherwise I would have refined the lauch configuration parameters for it too. I have not tested other libraries than cuFFT yet.

Jhsto|5 years ago

I think this guy will have no problem getting hired. Being conscious enough to push code online works so much better than the CV preparation courses. You know you're on the right path when you are asked to play up your CV abstract than to downplay it.

Personally, I would have a hard time hiring anyone without a Github account and less so working in a place where nobody has one.

ncmncm|5 years ago

To me a Gitlab account, instead, would signify superior judgment.

oxxoxoxooo|5 years ago

What is "Native zero padding to model open systems"? And how come it is "up to 2x faster than simply padding input array with zeros"?

gct|5 years ago

So you can pad your input array with zeros, but the algorithm doesn't know that it's padded, and will just compute with those zeros like any other value. If you could tell it that they were zeros it could take advantage of x*0=0 and x+0=x to significantly reduce computation. That's what I think that is.

Lichtso|5 years ago

Very cool!

Seems a bit more feature complete than my take on the problem: https://github.com/Lichtso/VulkanFFT

Still, to beat CUDA with Vulkan a lot is still missing: Scan, Reduce, Sort, Aggregate, Partition, Select, Binning, etc.

DTolm|5 years ago

I have some of these routines like Reduce and Scan in my other project https://github.com/DTolm/spirit. It also has implementations of linear algebra solvers like CG, VP, Runge-Kutta and some others. These routines have to be inlined in users shaders in some way to have a good performance. Releasing them as a standalone library will require some thinking due to the fact that some routines have multiple shader dispatches.

meisel|5 years ago

Warning: LGPL license

ncmncm|5 years ago

... which, being a header-only library, happens to place no restrictions or requirements of any kind on the calling program.

phkahler|5 years ago

Isn't LGPL 2.1 is an odd license for something like this? Does it produce a library?

microcolonel|5 years ago

> Does it produce a library?

It is a library.

fluffything|5 years ago

> Support for big FFT dimension sizes. Current limits: C2C - (2^24, 2^15, 2^15),

What about bigger than big? > 2^29 or so ? Are these sizes for double precision ?

DTolm|5 years ago

Currently, I hit the limit of maximum workgroups amount for one submit dispatch (this is why y and z axis are lower than x one for now). It can be removed by adding multiple dispatches to the code, which I will do in one of the next updates. To go past 2^24 I need to polish the four stage FFT algorithm to allow for >2 data transfers, which I have implemented, but not yet tested. There will also be a single precision limit in this range, as the twiddle factors values will be close to 1e-8 which will be close to a machine error.

bobowzki|5 years ago

I wonder if this works on the raspberry pi with the new Vulkan drivers.

Mizza|5 years ago

I'm very eager to see GPU acceleration make its way into audio production, which is all still heavily CPU bound.

A Free GPUFFT implementation will certainly help! Great work.

adamnemecek|5 years ago

It's not gonna happen, audio is much less throughput intensive but a lot more latency sensitive.

rektide|5 years ago

may someday please someone help dethrone the underlord of AI & rise us up

person_of_color|5 years ago

This guy will get a foot in but still have to do a gotcha interview loop