I propose a theory. My theory is that LLMs do not require a lot of computing power. What they do require is a small computer, but it's blown out of proportion with unnecessary information to artificially increase demand for high power hardware. So the reality why models are the way they are, especially local models, is because they actually suck at optimizing on purpose. As I said, it's even added to complexity to increase sales of hardware.What are your pros and what are your cons? And do not refer to secondhand information as in "well I was told" or "well I read a paper".
tim-tday|2 days ago
I’ve built my own LLM containers, ive built orchestration systems for fine tuning and model management. I’ve tried quantized models. I’ve tested a dozen or so models of different sizes.
You can’t really get around the fact that inference on cpu is slow, inference on gpu is gated by nvram (you need about 1gb of nvram per billion parameters, quantization reduces quality and increases operational toil). If you know of a consumer level gpu with 80-128gb of nvram that I can buy for less than $10k do please let me know.
Short of a specific proposal I’m going to classify your suggestion as not knowing enough about what you’re talking about for your proposal to make any sense.