top | item 39977832

(no title)

mjfisher | 1 year ago

> One time I reduced something from a 15 minute execution time to running hundreds of times per second

That's too good a story not to have just a little more detail. Are you willing to share more?

discuss

Sure. It was a fairly complicated image processing algorithm, but not necessarily something that you would want to go through a lot of trouble to implement on the GPU. At least not until you're desperate. And I should add, the results are pretty boring. It doesn't even generate anything interesting.

I read the paper that described the algorithm and implemented code on the CPU, thinking, quite stupidly, that it would be fast enough. Not fast, but fast enough. Nope. Performance was utterly horrible on my tiny 128x128 pixel test case. The hoped-for use cases, data sets of 4096x4096 or 10000x10000 were hopeless.

Performance was bad for a few key reasons: the original data was floating point, and it went through several complicated transformations before being quantized to RGBA. The transforms meant that the loops were like two lines total, with an ~800 line inner loop, plus quantization of course (which could not be done until you had the final results). In GLSL there are functions to do all the transformations, and most of them are hyper-optimized, or even have dedicated silicon in many cases. FMA, for example.

So I wrote some infra to make it possible to use a compute shader to do it. And I use the term 'infra' quite loosely. I configured our application to link to OpenGL and then added support for compute shaders. After a few days of pure hell, I was able to upload a texture, modify the memory with a compute shader, and then download the result. The whole notion of configuring workgroups and local groups was like having my pants set on fire. Especially for someone who had never worked on a GPU before. But OpenGL, it's just a simple C API, right? What could go wrong? There's all these helpful enumerations so the functions will be easy to call. And pixel formats, I know what those are. Color formats? Oh this won't be hard.

But once everything was working, it only took a few more days to make the compute shader work. The hardest part was reconfiguring my brain to stop thinking about the algorithm in terms of traversing the image in a double nested for loop - which is what you would do on the CPU. Actually, the first time I wrote it, that's what I did, in the shader. Yes, I actually did that. And it wasn't fast all. Oh man, it felt like I was fucked.

But in the end, it could process the 4096x4096 use case at 75 FPS, and even better, when I learned about array textures, I found that it could do even more work in parallel. That's how I got it from 15 minutes to hundreds of frames per second.