top | item 38494478 (no title) TheGeminon | 2 years ago There is a more detailed explanation at https://unsloth.ai/introducing discuss order hn newest apsec112|2 years ago That... doesn't really explain how they can get such a high number? Standard FLOP efficiency on fine-tuning big models is like 30-40%. How can you get 750%? danielhanchen|2 years ago Hey! Great question! That's what I'm confused about as well!So in GPUs the goal is to saturate the GPU with matrix multiplies instead of data movement. I'll write a more detailed blog but approximately:1. Flash Attention v2 reduces the time taken by 17% or so2. RoPE Triton kernels: -7.1%3. RMS Layernorm in Triton: -3.1%4. Cross Entropy in Triton: -1%5. Manual autograd for MLP: -4%6. Manual QKV autograd: -2%7. Manual O autograd: -2%8. Smart cache evictions and reduced data duplications etc: -30%9. And other tricks in the Max and Pro versions makes it 30x fasterYou can see it's just tricks in each step, which accumulate together to make to go faster.I'll write up a blog post to detail it all in the future!!! load replies (2)
apsec112|2 years ago That... doesn't really explain how they can get such a high number? Standard FLOP efficiency on fine-tuning big models is like 30-40%. How can you get 750%? danielhanchen|2 years ago Hey! Great question! That's what I'm confused about as well!So in GPUs the goal is to saturate the GPU with matrix multiplies instead of data movement. I'll write a more detailed blog but approximately:1. Flash Attention v2 reduces the time taken by 17% or so2. RoPE Triton kernels: -7.1%3. RMS Layernorm in Triton: -3.1%4. Cross Entropy in Triton: -1%5. Manual autograd for MLP: -4%6. Manual QKV autograd: -2%7. Manual O autograd: -2%8. Smart cache evictions and reduced data duplications etc: -30%9. And other tricks in the Max and Pro versions makes it 30x fasterYou can see it's just tricks in each step, which accumulate together to make to go faster.I'll write up a blog post to detail it all in the future!!! load replies (2)
danielhanchen|2 years ago Hey! Great question! That's what I'm confused about as well!So in GPUs the goal is to saturate the GPU with matrix multiplies instead of data movement. I'll write a more detailed blog but approximately:1. Flash Attention v2 reduces the time taken by 17% or so2. RoPE Triton kernels: -7.1%3. RMS Layernorm in Triton: -3.1%4. Cross Entropy in Triton: -1%5. Manual autograd for MLP: -4%6. Manual QKV autograd: -2%7. Manual O autograd: -2%8. Smart cache evictions and reduced data duplications etc: -30%9. And other tricks in the Max and Pro versions makes it 30x fasterYou can see it's just tricks in each step, which accumulate together to make to go faster.I'll write up a blog post to detail it all in the future!!! load replies (2)
apsec112|2 years ago
danielhanchen|2 years ago
So in GPUs the goal is to saturate the GPU with matrix multiplies instead of data movement. I'll write a more detailed blog but approximately:
1. Flash Attention v2 reduces the time taken by 17% or so
2. RoPE Triton kernels: -7.1%
3. RMS Layernorm in Triton: -3.1%
4. Cross Entropy in Triton: -1%
5. Manual autograd for MLP: -4%
6. Manual QKV autograd: -2%
7. Manual O autograd: -2%
8. Smart cache evictions and reduced data duplications etc: -30%
9. And other tricks in the Max and Pro versions makes it 30x faster
You can see it's just tricks in each step, which accumulate together to make to go faster.
I'll write up a blog post to detail it all in the future!!!