(no title)
sanxiyn | 1 month ago
I think Llama 3 focus mostly reflects demand. It may be hard to believe, but many people aren't even aware gpt-oss exists.
sanxiyn | 1 month ago
I think Llama 3 focus mostly reflects demand. It may be hard to believe, but many people aren't even aware gpt-oss exists.
reactordev|1 month ago
The 8B models are easier to run on an RTX to compare it to local inference. What llama does on an RTX 5080 at 40t/s, Furiosa should do at 40,000t/s or whatever… it’s an easy way to have a flat comparison across all the different hardware llama.cpp runs on.
nl|1 month ago
That's 86 token/second/chip
By comparison, a H100 will do 2390 token/second/GPU
Am I comparing the wrong things somehow?
[1] https://inferencemax.semianalysis.com/
sanxiyn|1 month ago
binary132|1 month ago
zmmmmm|1 month ago
It still kind of makes the point that you are stuck with a very limited range of models that they are hand implementing. But at least it's a model I would actually use. Give me that in a box I can put in a standard data center with normal power supply and I'm definitely interested.
But I want to know the cost :-)