top | item 46993951

(no title)

p1esk | 18 days ago

The real question is what’s their perf/dollar vs nvidia?

discuss

I guess it depends what you mean by "perf". If you optimize everything for the absolutely lowest latency given your power budget, your throughput is going to suck - and vice versa. Throughput is ultimately what matters when everything about AI is so clearly power-constrained, latency is a distraction. So TPU-like custom chips are likely the better choice.

fragmede|18 days ago

> Throughput is ultimately what matters

I disagree. Yes it does matter, but because the popular interface is via chat, streaming the results of inference feels better to the squishy messy gross human operating the chat, even if it ends up taking longer. You can give all the benchmark results you want, humans aren't robots. They aren't data driven, they have feelings, and they're going to go with what feels better. That isn't true for all uses, but time to first byte is ridiculously important for human-computer interaction.

p1esk|18 days ago

By perf I mean how much does it cost to serve 1T model to 1M users at 50 tokens/sec.

energy123|18 days ago

That's coupling two different usecases.

Many coding usecases care about tokens/second, not tokens/dollar.

latchkey|18 days ago

Exactly. They won't ever tell you. It is never published.

Let's not forget that the CEO is an SEC felon who got caught trying to pull a fast one.

xnx|18 days ago

Or Google TPUs.

latchkey|18 days ago

TPUs don't have enough memory either, but they have really great interconnects, so they can build a nice high density cluster.

Compare the photos of a Cerebras deployment to a TPU deployment.

https://www.nextplatform.com/wp-content/uploads/2023/07/cere...

https://assets.bwbx.io/images/users/iqjWHBFdfxIU/iOLs2FEQxQv...

The difference is striking.