"Semi analysis did some cost estimates, and I did some but you’re likely paying somewhere in the 12 million dollar range for the equipment to serve a single query using llama-70b. Compare that to a couple of gpus, and it’s easy to see why they are struggling to sell hardware, they can’t scale down.
Since they didn’t use hbm, you need to stich enough cards together to get the memory to hold your model. It takes a lot of 256mb cards to get to 64gb, and there isn’t a good way to try the tech out since a single rack really can’t serve an LLM."
latchkey|1 year ago
"Semi analysis did some cost estimates, and I did some but you’re likely paying somewhere in the 12 million dollar range for the equipment to serve a single query using llama-70b. Compare that to a couple of gpus, and it’s easy to see why they are struggling to sell hardware, they can’t scale down.
Since they didn’t use hbm, you need to stich enough cards together to get the memory to hold your model. It takes a lot of 256mb cards to get to 64gb, and there isn’t a good way to try the tech out since a single rack really can’t serve an LLM."
https://news.ycombinator.com/item?id=39966620
faangguyindia|1 year ago
Fastest commercial model is Google's Gemini Flash (predictable speed)
qwertox|1 year ago