(no title)
NathanielK | 4 months ago
I was not familiar with the hardware, so I was disappointed there wasn't a picture of the device. Tried to skim the article and it's a mess. Inconsistent formatting and emoji without a single graph to visualize benchmarks.
furyofantares|4 months ago
I bet the input to the LLM would have been more interesting.
furyofantares|4 months ago
It looks like it worked? Why's it say this?
> Verdict: Inference speed scales proportionally with model size.
Author only tried one model size and it's faster than NVIDIA's reported speed at a larger model. Not really a "Verdict".
> Verdict: 4-bit quantization is production-viable.
That's not really something you can conclude from messing around with it and saying you like the outputs.
> GPU Inference is Fundamentally Broken
Probably not? It probably just doesn't work in llama.cpp right now? Takes a while reading this to work out they tried ollama and then later llama.cpp, which I'd guess is basically testing llama.cpp twice. Actually I don't even believe that, I'm sure author ran into errors that might be a pain to figure out, but there's no evidence it's worse than that.
But then it says this is the "root cause":
Am I to believe GPU inference is really fundamentally broken? I'm not seeing the case made here, just claims. At this point the LLM seems to have gotten confused about whether it's talking about the memory fragmentation issue or the GPU inference issue. But it's hard to believe anything from this point on in the post.