Ask HN: What are good projects to understand CUDA and GPU Programming?

[+] ThePhysicist|7 years ago|reply

I wrote a CUDA/GPU based 2D electromagnetic simulation a while ago. The code is open-source and (hopefully) not too complicated:

https://github.com/adewes/fdtd-ml

Here's an example video of what the results look like (it shows how EM waves are reflected by a parabolic mirror):

https://www.youtube.com/watch?v=ZPSzAaxkg5c

I used mostly the PyCUDA documentation and examples as well as the official CUDA documentation (https://docs.nvidia.com/cuda/) to learn. I think what's most important is to first understand what blocks, grids and threads are and how they work (see e.g. here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....). With that knowledge you can start thinking about how you can structure your problem to solve it efficiently on the GPU. For the simulation, I have basically two 2D blocks of memory for each variable of interest (e.g. electric and magnetic fields in X,Y direction, current density, material properties) that I transfer the to the GPU. There, I use the discretized differential equation for the electromagnetic field to update the field values using the values from the first buffer (and the material properties + currents). I store the result in the second buffer. I then swap the buffer references (without copying any memory) for the next step of the simulation. I perform this step e.g. n times and then transfer some of the buffers back to the main memory to e.g. plot them. That's mostly it! Of course there are many intricacies and ways to optimize code, but getting a basic program running inside your GPU isn't that hard actually.

BTW, here's some really cool research work by Microsoft on GPU-based FDTD (finite difference time domain) simulations of wind instruments in two dimensions:

https://www.youtube.com/watch?v=7Kf-rlUZAaU

[+] peterbecich|7 years ago|reply

https://wiki.tiker.net/PyCuda/Examples I agree about the PyCUDA examples

[+] jamesg|7 years ago|reply

Since you mentioned image processing in particular, I’d recommend looking into Halide instead of (or as well as) CUDA. Few reasons:

1. It allows for easy experimentation with the order in which work is done (which turns out to be a major factor in performance) —- IMO, this is one of the trickier parts of programming (GPU or not), so tools to accelerate experimentation accelerate learning also.

2. It allows you to write your algorithm once and emit code to run on OpenGL, OpenCL, CUDA, Metal, various SIMD flavors, and a bunch more exotic targets. CUDA effectively limits you to desktop/laptop computers, and at this point I’d rather bet on needing a mobile version at some point than not.

3. It eliminates a ton of boilerplate code, so you can get started quickly.

4. It’s what the pros use. Much of Adobe’s image processing code is in Halide now, for instance (source: pretty much any presentation extolling the virtues of Halide). The Halide authors cite a particular algorithm — the Local Laplacian Filter - where an intern, in one afternoon, beat out a hand optimized C++ implementation that had taken months to develop with a Halide implementation. I don’t know if the specifics of that have been exaggerated, but directionally I believe it. It was pretty transformational in the codepath I used it for.

I feel like developing an intuition for the “shape” of algorithms that will perform well before diving into the specifics of low-level tools like CUDA will serve you well.

http://halide-lang.org/

[+] Assossa|7 years ago|reply

Would halide be a good option for writing a pathtracing engine?

[+] hevi_jos|7 years ago|reply

You should escalate in difficulty levels. GPU Programming is infinite and will always take more of your time than you plan. With escalation you could get positive feedback fast. 2D image filtering is extremely complex for a beginner.

Start drawing a graph of thousands of numbers with the CPU. Easy enough, but harder than it looks.

Accelerate the graph with the GPU so it has smooth animation, moving, zooming... Easy enough but way harder that it looks.

Take a sound sample, uncompress it and visualize in your graph system. Easy?

Take the sample and filter it with CUDA. Way easier filtering a 1D sample than 2D samples.

Play the filtered sound to "feel it".

Then you can filter 2D images if you want.

I recommend the graphic interface Dear Imgui: https://github.com/ocornut/imgui https://github.com/ocornut/imgui/issues/1902

Way faster development time than any other (states)interface with the disadvantage that it continuously draws in the screen consuming energy. Well worth it for rapid prototyping.

[+] AnthonBerg|7 years ago|reply

An image blur is a good place to start! Read horizontally from many pixels in parallel, sum them up as parallel as you can, normalize, write back. Repeat for vertical blur - and here it might be best to rotate the image by a quarter of a turn so vertical is horizontal!, because memory access is usually faster that way.

[+] Const-me|7 years ago|reply

Did it couple of times.

One trick is write output pixels transposed. This way both passes will be identical, and they both read image linearly. Two transposes cancel each other.

Another one is use local memory.

Finally, the right place for the kernel values is compiled into the code, in immediate values. Everything else is slower.

[+] tehsauce|7 years ago|reply

Learning to write raymarchers on shadertoy.com! It's a wonderful platform for gpu programming because you'll spend 100% of your time writing gpu code and 0% on installation, drivers, and build systems.

[+] achalpandeyy|7 years ago|reply

Do you have any resources for the same? I learnt the basics from the Book of Shaders and other tutorials but can't seem to find find any advanced lessons.

[+] dahart|7 years ago|reply

Try to write Floyd-Steinberg dithering on the GPU. ;)

I've learned a lot by trying to port branchy code & recursive algorithms to the GPU, then trying to figure out why the performance is terrible and how to fix it. Specifically, I learned a bunch trying to write custom primitive intersection shaders for OptiX (https://developer.nvidia.com/optix).

I like ShaderToy as well, but since it's so easy and hides the abstractions from you, you have to really look around for the amazing tricks other people there have used. Also write code that slows down your GPU and then improve it. Just getting pretty things to render and/or writing shaders that start out at 60fps won't help you learn as fast.

[+] pjc50|7 years ago|reply

Yeah, an algorithm that makes every pixel dependent on the previous one is pretty much the worst-case for GPU computation :)

[+] kaoD|7 years ago|reply

Conway's Game of Life.

Easily paralellizable, visually attractive, fun to code and use, has a narrow scope but can be improved with goodies (e.g. color for cell age). If you implement realtime interactivity, it forces you to bridge the gap between GPU and CPU world, which is a skill on its own.

[+] llukas|7 years ago|reply

There is caching CPU implementation that is very fast and it is not trivial to implement on GPU https://en.wikipedia.org/wiki/Hashlife

Naive implementation should be good problem for start.

[+] Moribund|7 years ago|reply

Super-million particle engine. Draw more than a million squares to the screen and move them according to some logic.

I've implemented one on almost every hardware I've touched (including the browser).

It looks amazing, relatively easy to do but has some great bits of learning along the way.

Good luck.

[+] mailslot|7 years ago|reply

An unconventional suggestion: If you are fluent in C & C++, the Tensorow codebase is decent. The kernels are implemented in both CPU & CUDA, so you can have a side-by-side comparison in code and performance without writing the boilerplate. You can fork your own branch to implement your own extensions and play around. Tensorflow isn’t only for deep learning.

[+] ArtWomb|7 years ago|reply

Lots of great suggestions here. The real prize may be something like getting real-time gpu-accelerated image and video filters on mobile devices. As an alternative to OpenCL / CUDA you can try Vulkan GLSL compute shaders targeting an Android GPU. See NDK docs.

Another possibility arises with active development from the JuliaLang GPU team. As far as I know there is no GPU-accelerated image processing library for the language yet, but all the popular image format encoders exist.

Intro to JuliaGPU Ecosystem

https://www.youtube.com/watch?v=6ntJ_al4oXA

Best of luck ;)

[+] nobody271|7 years ago|reply

I learned a little OpenCL a few years ago just because I wanted to see how GPUs were programmed. I tried several books:

- Heterogeneous Computing with OpenCL (http://www.hds.bme.hu/~fhegedus/C++/Heterogeneous_computing_...) - informative but not many examples

- The Art of Multiprocessor Programming (https://www.amazon.com/Art-Multiprocessor-Programming-Revise...) - I had a hard time getting traction with this book.

- OpenCL Parallel Programming Development Cookbook (https://www.amazon.com/dp/B00ESX1AH2/ref=dp-kindle-redirect?...) - Not a great reference but it had some easy to follow examples.

- Actually a few other books you might find when searching for parallel processing or parallel algorithms which just turn out to be entirely abstract math books.

People would ask my why I wanted to learn to program on a GPU and I didn't have an answer. Surely I would find an answer in one of those books. I saved a few of the projects:

- Edge detection (https://mega.nz/#!LJUwmLSa!dRijnB1xVhI9RAC1Xac_xRhT2IsfDG2sJ...) - fun!

- GPU template (https://mega.nz/#!yAsxATzb!Y4-9zRMCTSYHX1pKxWPQPl8WNDgnWkSAU...) - write GPU code with JavaScript

I have another one for bitonic sort somewhere (a parallel sort that sadly isn't even as good as quick sort).

The projects I enjoyed most were image filters (like edge detection). You could do a project that implements various image filters. If you did that you would not only get experience writing CUDA but you would learn how a lot of different filters are done.

[+] jason_slack|7 years ago|reply

I've been delving into OpenCL to parallelize some algorithmic trading stuff I have been doing. One thing I did a few days ago was to thread the way I load data into a MySQL Server. What took over an hour normally is down to under 10 mins.

I'd be interested in knowing is what GPU's you were/are using. I'm using a Sonnet eGFX Breakaway Box 550 (with Radeon RX Vega 56 Card).

[+] lamchob|7 years ago|reply

Solvers for linear and differential equations are very interesting. Look at geometric multigrid or asymmetric grid solvers.

Graph processing, sorting and histograms can also be very interesting.

Ray Tracing is also a classic application.

[+] hiesenbug|7 years ago|reply

You can create a program that calculates the histogram of an image or write a sobel filter for an image. Sobel is fairly simple as it's just matrix multiplication. Get familiar with how to manage memory between the CPU and GPU first and the different types of memories you have available on a GPU

[+] VLM|7 years ago|reply

Abstract answer: Find FPGA class notes and lab notes online

Very specific answer: I always wanted to do the FPGA lab where you simulate the vibrating parts of a percussion instrument like a drum head live in real time with a push button to hit the drum and an audio out. I suppose a CUDA GPU version would output a .wav file. I suppose if you can get Game Of Life working this is a logical next step where you're simulating a 2-D structure under load rather than an abstract 2-D automaton. I wonder what a drum head looks like as a topo-map slowed down by a factor of 10K or so, probably interesting looking with nodes and reflections all over.

[+] grusel|7 years ago|reply

You may want to look into Cupy. It is replicating the Numpy API on GPUs. It is not too complicated but fun to add missing functions. Picking easier parts first guides you closer and closer to becoming fluent in GPU programming.

[+] hyperpallium|7 years ago|reply

My experience: just make sure you're already confident implementing it on a CPU first. GPU coding is hard enough without actual difficulty.

If you want to cover "all the details that [you] have learned", it'll help to state what you learned. Also, what exercises you've done already - what level are you at?

1. a shadertoy (web, desktop or app) for GPU programming, with comveniences for image processing. That deals with all the BS boilerplate you learnt, from a general point of view, and will be useful.

2. Fluid simulation (but, there's a lot of non-GPU maths to understand).

[+] gmiller123456|7 years ago|reply

I just started working through "Professional CUDA C Programming": Ty McKercher , Max Grossman , John Cheng, and am finding it very interesting. It's from 2014, so I'd think there'd be something newer, but it's the best thing I could find on Safari Books. There's a lot of newer stuff that focuses on specific topics (like AI), but it's the best thing I could find that was general purpose.

[+] AndrewStephens|7 years ago|reply

WebGL is surprisingly easy once you have some boiler-plate code to load the textures and shaders, and has the advantage that you can shove the result on a web site to show others.

Here is a very simple project I did last year:

https://sheep.horse/2017/9/crossfading_photos_with_webgl_-_b...

[+] sbhn|7 years ago|reply

Cuda sha256

https://github.com/Sean-Bradley/CUDALookupSHA256

Cuda ripemd160

https://github.com/Sean-Bradley/CUDALookupRipeMD60

[+] singularity2001|7 years ago|reply

don't waste your time on cuda. Wait until an open artitecture replaces this proprietary clusterf*ck

[+] brutus1213|7 years ago|reply

I disagree partly. Cuda is defacto in DL at the moment, and is very well designed. It has academic roots and the neat ideas shine out.

Why did I say partly? So many optimization layers exist already (e.g. BLAS/cuBLAS). One may not really need to get down to the CUDA level.

[+] twtw|7 years ago|reply

Given that AMDs approach has shifted to implementing cuda (under the name "hip") and providing tools to automatically find/replace cuda to hip, I don't think the cuda api is going anywhere.

37 comments