top | item 45715583

(no title)

NathanielK | 4 months ago

Definitely reeks of someone who doesn't know what makes a readable blogpost and hoped the LLM did.

I was not familiar with the hardware, so I was disappointed there wasn't a picture of the device. Tried to skim the article and it's a mess. Inconsistent formatting and emoji without a single graph to visualize benchmarks.

discuss

furyofantares|4 months ago

I read the whole thing now and it's filled with slop. I don't really care about the emojis and the marketing voice too much. I do care that it's impossible to tell what the author cared about what they didn't, or if any of it is made up or extrapolated.

I bet the input to the LLM would have been more interesting.

furyofantares|4 months ago

> Training Performance is Real (When It Works)

It looks like it worked? Why's it say this?

> Verdict: Inference speed scales proportionally with model size.

Author only tried one model size and it's faster than NVIDIA's reported speed at a larger model. Not really a "Verdict".

> Verdict: 4-bit quantization is production-viable.

That's not really something you can conclude from messing around with it and saying you like the outputs.

> GPU Inference is Fundamentally Broken

Probably not? It probably just doesn't work in llama.cpp right now? Takes a while reading this to work out they tried ollama and then later llama.cpp, which I'd guess is basically testing llama.cpp twice. Actually I don't even believe that, I'm sure author ran into errors that might be a pain to figure out, but there's no evidence it's worse than that.

But then it says this is the "root cause":

    ARM64 + Blackwell + CUDA 13.0 = Bleeding Edge
    ↓
    Limited production testing
    ↓
    Edge cases in numerical precision (inference)
    ↓
    Memory management issues (training)

Am I to believe GPU inference is really fundamentally broken? I'm not seeing the case made here, just claims. At this point the LLM seems to have gotten confused about whether it's talking about the memory fragmentation issue or the GPU inference issue. But it's hard to believe anything from this point on in the post.