There are only 4 optimizations in computer science: inlining, partial evaluation, dead code elimination, & caching. It looks like AI researchers just discovered inlining & they already knew about caching so eventually they'll get to partial evaluation & dead code elimination.
Your list is so short it doesn't even include the basics such as reordering operations.
It also feels incredibly snarky to say "they knew about caching" and that they will get to partial evaluation and dead code elimination, when those seem to be particularly useless (beyond what the CUDA compiler itself does) when it comes to writing GPU kernels or doing machine learning in general.
You can't do any partial evaluation of a neural network because the activation functions are interrupting the multiplication of tensors. If you remove the activation function, then you end up with two linear layers that are equivalent to one linear layer, defeating the point of the idea. You could have trained a network with a single layer instead and achieved the same accuracy with a corresponding shorter training/inference time.
Dead code elimination is even more useless since most kernels are special purpose to begin with and you can't remove tensors without altering the architecture. Instead of adding useless tensors only to remove them, you could have simply used a better architecture.
AI actually has some optimizations unique to the field. You can in fact optimize a model to make it work; not a lot of other disciplines put as much emphasis on this as AI
That's a bit trite tbh. We all know of these techniques, but actually implementing them on GPUs in a low-overhead manner that maintains the model's fidelity is challenging. It's much more than just breaking out the old CS book and picking the next idea from there.
So if I'm understanding correctly, you decompose kernels into their per_sm_workload, then you figure out per_sm_data_dependency and then you can schedule sm_workloads from the next kernel to start running as soon as the data dependency is satisfied, not needing to wait for the other sms from the previous kernel to finish.
In this case are you're strickly fusing pre defined kernels or are you also optimizing them? Is this complimentary to your earlier work on search-based compilers?
Thats reasonably accurate, we're fusing both pre-defined operations as well as codegenned operations. Block-level operations live inside the search space, as do kernel, warp and thread level operations. Since it's a unified search space, we can look through tons of combinations of kernel, block, warp, and thread level ops. When we go to compile them to runnable code, thread ops get compiled to warp ops, warp ops get compiled to block ops, block ops get compiled to kernel ops (megakernels live here!), so at the end of the day everything that gets ran is a kernel.
In other words, very complimentary to our search-based approach.
measurablefunc|1 month ago
imtringued|1 month ago
It also feels incredibly snarky to say "they knew about caching" and that they will get to partial evaluation and dead code elimination, when those seem to be particularly useless (beyond what the CUDA compiler itself does) when it comes to writing GPU kernels or doing machine learning in general.
You can't do any partial evaluation of a neural network because the activation functions are interrupting the multiplication of tensors. If you remove the activation function, then you end up with two linear layers that are equivalent to one linear layer, defeating the point of the idea. You could have trained a network with a single layer instead and achieved the same accuracy with a corresponding shorter training/inference time.
Dead code elimination is even more useless since most kernels are special purpose to begin with and you can't remove tensors without altering the architecture. Instead of adding useless tensors only to remove them, you could have simply used a better architecture.
johndough|1 month ago
Strassen algorithm for matrix multiplication https://en.wikipedia.org/wiki/Strassen_algorithm
FFT convolution https://dsp.stackexchange.com/a/63211
Winograd convolution https://www.cv-foundation.org/openaccess/content_cvpr_2016/p...
And of course optimization algorithms themselves.
fragmede|1 month ago
mxkopy|1 month ago
jafioti|1 month ago
direwolf20|1 month ago
geremiiah|1 month ago
In this case are you're strickly fusing pre defined kernels or are you also optimizing them? Is this complimentary to your earlier work on search-based compilers?
jafioti|1 month ago
In other words, very complimentary to our search-based approach.