top | item 38494478

(no title)

There is a more detailed explanation at https://unsloth.ai/introducing

discuss

apsec112|2 years ago

That... doesn't really explain how they can get such a high number? Standard FLOP efficiency on fine-tuning big models is like 30-40%. How can you get 750%?

danielhanchen|2 years ago

Hey! Great question! That's what I'm confused about as well!

So in GPUs the goal is to saturate the GPU with matrix multiplies instead of data movement. I'll write a more detailed blog but approximately:

1. Flash Attention v2 reduces the time taken by 17% or so

2. RoPE Triton kernels: -7.1%

3. RMS Layernorm in Triton: -3.1%

4. Cross Entropy in Triton: -1%

5. Manual autograd for MLP: -4%

6. Manual QKV autograd: -2%

7. Manual O autograd: -2%

8. Smart cache evictions and reduced data duplications etc: -30%

9. And other tricks in the Max and Pro versions makes it 30x faster

You can see it's just tricks in each step, which accumulate together to make to go faster.

I'll write up a blog post to detail it all in the future!!!