top | item 41194508

(no title)

extheat | 1 year ago

A simple equation to approximate it is `memory_in_gb = parameters_in_billions * (bits/8)`

So at 32 bit full precision, 70 * (32 / 8) ~= 280GB

fp16, 70 * (16 / 8) ~= 140GB

8 bit, 70 * (8 / 8) ~= 70GB

4 bit, 70 * (4 / 8) ~= 35GB

However in things like llama.cpp quants sometimes it's mixed so some of the weights are Q5, some Q4, etc, so you usually want to take the higher number.

discuss

moffkalast|1 year ago

Well that and you also need a fair bit more space for the KV cache which can be a bit unpredictable. Models without GQA, flash attention or 4 bit cache support are really terrible in that regard, plus it depends on context length. Haven't found a good rule of thumb for that yet.