(no title)
estreeper | 1 year ago
VRAM (GB) = 1.2 * number of parameters (in billions) * bits per parameter / 8
The 1.2 is just an estimation factor to account for the VRAM needed for things that aren't model parameters.Because quantization is often nearly free in terms of output quality, you should usually look for quantized versions. For example, Llama 3.2 uses 16-bit parameters but has a 4-bit quantized version, and looking at the formula above you can see that will allow you to run a 4x larger model.
Having enough VRAM will allow you to run a model, but performance is dependent on a lot of other factors. For a much deeper dive into how all of this works along with price/dollar recommendations (though from last year!), Tim Dettmers wrote this excellent article: https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...
Worth mentioning for the benefit of those who don't want to buy a GPU: there are also models which have been converted to run on CPU.
[0] https://blog.runpod.io/understanding-vram-and-how-much-your-...
morcus|1 year ago