top | item 42810753

(no title)

This is a tough question to answer, because it depends a lot on what you want to do! One way to approach it may be to look at what models you want to run and check the amount of VRAM they need. A back-of-the-napkin method taken from here[0] is:

    VRAM (GB) = 1.2 * number of parameters (in billions) * bits per parameter / 8

The 1.2 is just an estimation factor to account for the VRAM needed for things that aren't model parameters.

Because quantization is often nearly free in terms of output quality, you should usually look for quantized versions. For example, Llama 3.2 uses 16-bit parameters but has a 4-bit quantized version, and looking at the formula above you can see that will allow you to run a 4x larger model.

Having enough VRAM will allow you to run a model, but performance is dependent on a lot of other factors. For a much deeper dive into how all of this works along with price/dollar recommendations (though from last year!), Tim Dettmers wrote this excellent article: https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...

Worth mentioning for the benefit of those who don't want to buy a GPU: there are also models which have been converted to run on CPU.

[0] https://blog.runpod.io/understanding-vram-and-how-much-your-...

discuss

morcus|1 year ago

Thank you for the background!