I guess it depends what you mean by "perf". If you optimize everything for the absolutely lowest latency given your power budget, your throughput is going to suck - and vice versa. Throughput is ultimately what matters when everything about AI is so clearly power-constrained, latency is a distraction. So TPU-like custom chips are likely the better choice.
I disagree. Yes it does matter, but because the popular interface is via chat, streaming the results of inference feels better to the squishy messy gross human operating the chat, even if it ends up taking longer. You can give all the benchmark results you want, humans aren't robots. They aren't data driven, they have feelings, and they're going to go with what feels better. That isn't true for all uses, but time to first byte is ridiculously important for human-computer interaction.
zozbot234|18 days ago
fragmede|18 days ago
I disagree. Yes it does matter, but because the popular interface is via chat, streaming the results of inference feels better to the squishy messy gross human operating the chat, even if it ends up taking longer. You can give all the benchmark results you want, humans aren't robots. They aren't data driven, they have feelings, and they're going to go with what feels better. That isn't true for all uses, but time to first byte is ridiculously important for human-computer interaction.
p1esk|18 days ago
energy123|18 days ago
Many coding usecases care about tokens/second, not tokens/dollar.
latchkey|18 days ago
Let's not forget that the CEO is an SEC felon who got caught trying to pull a fast one.
xnx|18 days ago
latchkey|18 days ago
Compare the photos of a Cerebras deployment to a TPU deployment.
https://www.nextplatform.com/wp-content/uploads/2023/07/cere...
https://assets.bwbx.io/images/users/iqjWHBFdfxIU/iOLs2FEQxQv...
The difference is striking.