top | item 39432384

(no title)

trsohmers | 2 years ago

Groq states in this article [0] that they used 576 chips to achieve these results, and continuing with your analysis, you also need to factor in that for each additional user you want to have requires a separate KV cache, which can add multiple more gigabytes per user.

My professional independent observer opinion (not based on my 2 years of working at Groq) would have me assume that their COGS to achieve these performance numbers would exceed several million dollars, so depreciating that over expected usage at the theoretical prices they have posted seems impractical, so from an actual performance per dollar standpoint they don’t seem viable, but do have a very cool demo of an insane level of performance if you throw cost concerns out the window.

[0]: https://www.nextplatform.com/2023/11/27/groq-says-it-can-dep...

discuss

tome|2 years ago

Thomas, I think for full disclosure you should also state that you left Groq to start a competitor (a competitor which doesn't have the world's lowest latency LLM engine nor a guarantee to match the cheapest per token prices, like Groq does.).

Anyone with a serious interest in the total cost of ownership of Groq's system is welcome to email contact@groq.com.

trsohmers|2 years ago

I thought that was clear through my profile, but yes, Positron AI is focused on providing the best performance per dollar while providing the best quality of service and capabilities rather than just focusing on a single metric of speed.

A guarantee to match the cheapest per token prices is sure a great way to lose a race to the bottom, but I do wish Groq (and everyone else trying to compete against NVIDIA) the greatest luck and success. I really do think that the great single batch/user performance by Groq is a great demo, but is not the best solution for a wide variety of applications, but I hope it can find its niche.

Aeolun|2 years ago

I think that just means it’s for people that really want it?

John doe and his friends will never have a need to have their fart jokes generated at this speed, and are more interested in low costs.

But we’d recently been doing call center operations and being able to quickly figure out what someone said was a major issue. You kind of don’t want your system to wait for a second before responding each time. I can imagine it making sense if it reduces the latency to 10ms there as well. Though you might still run up against the ‘good enough’ factor.

I guess few people want to spend millions to go from 1000ms to 10ms, but when they do they really want it.

nickpsecurity|2 years ago

What happened to Rex? Did it hit production or get abandoned?

It was also on my list of things to consider modifying for an AI accelerator. :)

trsohmers|2 years ago

Long story, but technically REX is still around but has not been able to continue to develop due to lack of funding and my cofounder and I needing to pay bills. We produced initial test silicon, but due to us having very little money after silicon bringup, most of our conversations turned to acquihire discussions.

There should be a podcast release (https://microarch.club/) in the near future that covers REX's history and a lot of lessons learned.