top | item 38740141

(no title)

benchess | 2 years ago

This isn't running on one chip. It's running on 128, or two racks worth of their kit. https://news.ycombinator.com/item?id=38739106

This doesn't mean much without comparing $ or watts of GPU equivalents

discuss

order

razorguymania|2 years ago

GPUs can't scale single user performance beyond a certain limit. You can throw 100s of GPUs at it but the latency will never be as good.

tome|2 years ago

Thanks, I need to correct my earlier guess: I believe this demo is running on 9 GroqRacks (576 chips) and I think we may also have an 8 rack version in progress. I can't remember off the top of my head whether this deployment has pipelining of inferences or whether that's work in progress. We've tried a variety of different configurations to improve performance (both latency and throughput), which is possible because of the high level of flexibility and configurability of our architecture and compiler toolchain.

You're right that it is important to compare cost per token also, not just raw speed. Unfortunately I don't have those figures to hand but I think our customer offerings are price competitive with OpenAI's offerings. The biggest takeaway though is that we just don't believe GPU architectures can ever scale to the performance that we can get, at any cost.