top | item 45078979

(no title)

wolfgangK | 6 months ago

Only those who don't care/know about prompt processing speed are buying Macs for LLM inference.

discuss

esseph|6 months ago

Don't know and don't care are definitely things that I could be, but it also makes sense if they want to keep lookups private.

com2kid|6 months ago

Even 40 tokens per second is plenty enough for real time usage. The average person reads at ~4 words per second, 40 tokens per second is going to be 15-20 words per second.

Even useful models like gemma3 27b are hitting 22 t/s on 4bit quants.

You aren't going to be reformatting gigabytes of PDFs or anything, but for a lot of common use cases, those speeds are fine.