top | item 40531940

(no title)

kir-gadjello | 1 year ago

While llama3-8b might be slightly more brittle under quantization, llama3-70b really surprised myself and others[1] in how well it performs even in the 2..3 bits per parameter regime. It requires one of the most advanced quantization methods (IQ2_XS specifically) to work, but the reward is a SoTA LLM that fits on one 4090 GPU with 8K context (KV-cache uncompressed btw) and allows for advanced usecases such as powering the agent engine I'm working on: https://github.com/kir-gadjello/picoagent-rnd

For me it completely replaced strong models such as Mixtral-8x7B and DeepSeek-Coder-Instruct-33B.

1. https://www.reddit.com/r/LocalLLaMA/comments/1cst400/result_...

discuss

order

d13|1 year ago

How does it compare against unequalised Llama 3 8B at 16fp? I’ve been using that locally and it’s almost replaced GPT4 for me. Runs in about 14GB of VRAM.

iwontberude|1 year ago

llama3 is nowhere near gpt4, though it is cool

LordDragonfang|1 year ago

What is your use case where you find it comparable to gpt4?

endofreach|1 year ago

> surprised myself and others[1] in how well it performs even in the 2..3 bits per parameter regime

I am too dumb for all of this ML stuff. Can you explain what exactly that means & why it's surprising?

m1el|1 year ago

Artificial neural networks work the following way: you have a bunch of “neurons” which have inputs and an output. Neuron’s inputs have weights associated with them, the larger the weight, the more influence the input has on the neuron. These weights need to be represented in our computers somehow, usually people use IEEE754 floating point numbers. But these numbers take a lot of space (32 or 16 bits). So one approach people have invented is to use more compact representation of these weights (10, 8, down to 2 bits). This process is called quantisation. Having a smaller representation makes running the model faster because models are currently limited by memory bandwidth (how long it takes to read weights from memory), going from 32 bits to 2 bits potentially leads to 16x speed up. The surprising part is that the models still produce decent results, even when a lot of information from the weights was “thrown away”.

renewiltord|1 year ago

Holy wow. Thank you for this. Very cool. I’ve been using 8b for things it might be worth using 70b for.