I don't see any explanation for why they trained 8B instead of 7B.
I thought that If you have a 16GB GPU, you can put 14GB(7B*16bits) model into it, but how does it fit If the model is exactly 16GB?
The bigger size is probably from the bigger vocabulary in the tokenizer. But most people are running this model quantized at least to 8 bits, and still reasonably down to 3-4 bpw.
rileyphone|1 year ago
kristianp|1 year ago
How does that affect anything? It still uses 16 bit floats in the model doesn't it?
dheera|1 year ago
JustBreath|1 year ago