Groq runs Mixtral 8x7B-32k with 500 T/s

[+] eigenvalue|2 years ago|reply

I just want to say that this is one of the most impressive tech demos I’ve ever seen in my life, and I love that it’s truly an open demo that anyone can try without even signing up for an account or anything like that. It’s surreal to see the thing spitting out tokens at such a crazy rate when you’re used to watching them generate at one less than one fifth that speed. I’m surprised you guys haven’t been swallowed up by Microsoft, Apple, or Google already for a huge premium.

[+] tome|2 years ago|reply

Really glad you like it! We've been working hard on it.

[+] RecycledEle|2 years ago|reply

If I understand correctly, each chip has 200 MB of RAM, so it takes racks to run a single LLM.

That does not sound like progress to me.

We need single PCIe boards with dozens or hundreds of GB of RAM and processors that handle it well.

[+] elorant|2 years ago|reply

Perplexity Labs also has an open demo of Mixtral 8x7b although it's nowhere near as fast as this.

https://labs.perplexity.ai/

[+] RockyMcNuts|2 years ago|reply

ok... why tho? genuinely ignorant and extremely curious.

what's the TFLOPS/$ and TFLOPS/W and how does it compare with Nvidia, AMD, TPU?

from quick Googling I feel like Groq has been making these sorts of claims since 2020 and yet people pay a huge premium for Nvidia and Groq doesn't seem to be giving them much of a run for their money.

of course if you run a much smaller model than ChatGPT on similar or more powerful hardware it might run much faster but that doesn't mean it's a breakthrough on most models or use cases where latency isn't the critical metric?

[+] larodi|2 years ago|reply

why sell? it would be much more delightful to beat them on their own game?

[+] brcmthrowaway|2 years ago|reply

I have it on good authority Apple was very closing to acquiring Groq

[+] timomaxgalvin|2 years ago|reply

Sure, but the responses are very poor compared to MS tools.

[+] treesciencebot|2 years ago|reply

The main problem with the Groq LPUs is, they don't have any HBM on them at all. Just a miniscule (230 MiB) [0] amount of ultra-fast SRAM (20x faster than HBM3, just to be clear). Which means you need ~256 LPUs (4 full server racks of compute, each unit on the rack contains 8x LPUs and there are 8x of those units on a single rack) just to serve a single model [1] where as you can get a single H200 (1/256 of the server rack density) and serve these models reasonably well.

It might work well if you have a single model with lots of customers, but as soon as you need more than a single model and a lot of finetunes/high rank LoRAs etc., these won't be usable. Or for any on-prem deployment since the main advantage is consolidating people to use the same model, together.

[0]: https://wow.groq.com/groqcard-accelerator/

[1]: https://twitter.com/tomjaguarpaw/status/1759615563586744334

[+] matanyall|2 years ago|reply

Groq Engineer here, I'm not seeing why being able to scale compute outside of a single card/node is somehow a problem. My preferred analogy is to a car factory: Yes, you could build a car with say only one or two drills, but a modern automated factory has hundreds of drills! With a single drill, you could probably build all sorts of cars, but a factory assembly line is only able to make specific cars in that configuration. Does that mean that factories are inefficient?

You also say that H200's work reasonably well, and that's reasonable (but debatable) for synchronous, human interaction use cases. Show me a 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia.

[+] trsohmers|2 years ago|reply

Groq states in this article [0] that they used 576 chips to achieve these results, and continuing with your analysis, you also need to factor in that for each additional user you want to have requires a separate KV cache, which can add multiple more gigabytes per user.

My professional independent observer opinion (not based on my 2 years of working at Groq) would have me assume that their COGS to achieve these performance numbers would exceed several million dollars, so depreciating that over expected usage at the theoretical prices they have posted seems impractical, so from an actual performance per dollar standpoint they don’t seem viable, but do have a very cool demo of an insane level of performance if you throw cost concerns out the window.

[0]: https://www.nextplatform.com/2023/11/27/groq-says-it-can-dep...

[+] tome|2 years ago|reply

If you want low latency you have to be really careful with HBM, not only because of the delay involved, but also the non-determinacy. One of the huge benefits of our LPU architecture is that we can build systems of hundreds of chips with fast interconnect and we know the precise timing of the whole system to within a few parts per million. Once you start integrating non-deterministic components your latency guarantees disappear very quickly.

[+] pclmulqdq|2 years ago|reply

Groq devices are really well set up for small-batch-size inference because of the use of SRAM.

I'm not so convinced they have a Tok/sec/$ advantage at all, though, and especially at medium to large batch sizes which would be the groups who can afford to buy so much silicon.

I assume given the architecture that Groq actually doesn't get any faster for batch sizes >1, and Nvidia cards do get meaningfully higher throughput as batch size gets into the 100's.

[+] londons_explore|2 years ago|reply

> more than a single model and a lot of finetunes/high rank LoRAs

I can imagine a way might be found to host a base model and a bunch of LoRA's whilst using barely more ram than the base model alone.

The fine-tuning could perhaps be done in such a way that only perhaps 0.1% of the weights are changed, and for every computation the difference is computed not over the weights, but of the output layer activations.

[+] moralestapia|2 years ago|reply

>The main problem with the Groq LPUs is, they don't have any HBM on them at all. Just a miniscule (230 MiB) [0] amount of ultra-fast SRAM [...]

IDGAF about any of that, lol. I just want an API endpoint.

480 tokens/sec at $0.27 per million tokens? Sign me in, I don't care about their hardware, at all.

[+] imtringued|2 years ago|reply

I honestly don't see the problem.

"just to serve a single model" could be easily fixed by adding a single LPDDR4 channel per LPU. Then you can reload the model sixty times per second and serve 60 different models per second.

[+] karpathy|2 years ago|reply

Very impressive looking! Just wanted to caution it's worth being a bit skeptical without benchmarks as there are a number of ways to cut corners. One prominent example is heavy model quantization, which speeds up the model at a cost of model quality. Otherwise I'd love to see LLM tok/s progress exactly like CPU instructions/s did a few decades ago.

[+] tome|2 years ago|reply

Hi folks, I work for Groq. Feel free to ask me any questions.

(If you check my HN post history you'll see I post a lot about Haskell. That's right, part of Groq's compilation pipeline is written in Haskell!)

[+] michaelbuckbee|2 years ago|reply

Friendly fyi - I think this might just be a web interface bug but but I submitted a prompt with the Mixtral model and got a response (great!) then switched the dropdown to Llama and submitted the same prompt and got the exact same response.

It may be caching or it didn't change the model being queried or something else.

[+] itishappy|2 years ago|reply

Alright, I'll bite. Haskell seems pretty unique in the ML space! Any unique benefits to this decision, and would you recommend it for others? What areas of your project do/don't use Haskell?

[+] jart|2 years ago|reply

If I understand correctly, you're using specialized hardware to improve token generation speed, which is very latency bound on the speed of computation. However generating tokens only requires multiplying 1-dimensional matrices usually. If I enter a prompt with ~100 tokens then your service goes much slower. Probably because you have to multiply 2-dimensional matrices. What are you doing to improve the computation speed of prompt processing?

[+] mechagodzilla|2 years ago|reply

You all seem like one of the only companies targeting low-latency inference rather than focusing on throughput (and thus $/inference) - what do you see as your primary market?

[+] ppsreejith|2 years ago|reply

Thank you for doing this AMA

1. How many GroqCards are you using to run the Demo?

2. Is there a newer version you're using which has more SRAM (since the one I see online only has 230MB)? Since this seems to be the number that will drive down your cost (to take advantage of batch processing, CMIIW!)

3. Can TTS pipelines be integrated with your stack? If so, we can truly have very low latency calls!

*Assuming you're using this: https://www.bittware.com/products/groq/

[+] andy_xor_andrew|2 years ago|reply

are your accelerator chips designed in-house? or they're some specialized silicon or FPGPU or something that you wrote very optimized code for inference?

it's really amazing! the first time I tried the demo, I had to try a few prompts to believe it wasn't just an animation :)

[+] zawy|2 years ago|reply

When you start using Samsung 4 nm are you switching from SRAM to HDM? If yes, how's that going to affect all the metrics given that SRAM is so much faster? Someone said you'll eventually move to HDM because SRAM improvements is relatively stalled.

I read an article that indicated your Bill Of Materials compared to NVidia's is 10x to get 1/10 the latency, and 8x BOM for throughput if Nvidia optimizes for throughput? Does this seem accurate? That CAPEX is the primary drawback?

https://www.semianalysis.com/p/groq-inference-tokenomics-spe...

[+] UncleOxidant|2 years ago|reply

When will we be able to buy Groq accelerator cards that would be affordable for hobbyists?

[+] kkzz99|2 years ago|reply

How does the Groq PCIE Card work exactly? Does it use system ram to stream the model data to the card? How many T/s could one expect with e.g. 36000Mhz DDR4 Ram?

[+] unknown|2 years ago|reply

[deleted]

[+] tudorw|2 years ago|reply

As it works at inference do you think 'Representation Engineering ' could be applied to give a sort of fine-tuning ability? https://news.ycombinator.com/item?id=39414532

[+] karthityrion|2 years ago|reply

Hi. Are these ASICs only for LLMs or could they accelerate other kinds of models(vision) as well?

[+] amirhirsch|2 years ago|reply

It seems like you are making general purpose chips to run many models. Are we at a stage where we can consider taping out inference networks directly propagating the weights as constants in the RTL design?

Are chips and models obsoleted on roughly the same timelines?

[+] imiric|2 years ago|reply

Impressive demo!

However, the hardware requirements and cost make this inaccessible for anyone but large companies. When do you envision that the price could be affordable for hobbyists?

Also, while the CNN Vapi demo was impressive as well, a few weeks ago here[1] someone shared https://smarterchild.chat/. That also has _very_ low audio latency, making natural conversation possible. From that discussion it seems that https://www.sindarin.tech/ is behind it. Do we know if they use Groq LPUs or something else?

I think that once you reach ~50 t/s, real-time interaction is possible. Anything higher than that is useful for generating large volumes of data quickly, but there are diminishing returns as it's far beyond what humans can process. Maybe such speeds would be useful for AI-AI communication, transferring knowledge/context, etc.

So an LPU product that's only focused on AI-human interaction could have much lower capabilities, and thus much lower cost, no?

[1]: https://news.ycombinator.com/item?id=39180237

[+] tome|2 years ago|reply

> However, the hardware requirements and cost make this inaccessible for anyone but large companies. When do you envision that the price could be affordable for hobbyists?

For API access to our tokens as a service we guarantee to beat any other provider on cost per token (see https://wow.groq.com). In terms of selling hardware, we're focused on selling whole systems, and they're only really suitable for corporations or research institutions.

[+] stormfather|2 years ago|reply

>>50 t/s is absolutely necessary for real-time interaction with AI systems. Most of the LLM's output will be internal monologue and planning, performing RAG and summarization, etc, with only the final output being communicated to you. Imagine a blazingly fast GPT-5 that goes through multiple cycles of planning out how to answer you, searching the web, writing book reports, debating itself, distilling what it finds, critiquing and rewriting its answer, all while you blink a few times.

[+] dmw_ng|2 years ago|reply

Given the size of the Sindarin team (3 AFAICT), that mostly looks like a clever combination of existing tech. There are some speech APIs that offer word-by-word realtime transcription (Google has one), assuming most of the special sauce is very well thought out pipelining between speech recognition->LLM->TTS

(not to denigrate their awesome achievement, I would not be interested if I were not curious about how to reproduce their result!)

[+] charlie123hufft|2 years ago|reply

It's only faster sometimes, but when you ask it a complicated question or give it any type of pre-prompt to speak in a different way, then it still takes a while to load. Interesting but ultimately probably going to be a flop

[+] neilv|2 years ago|reply

If the page can't access certain fonts, it will fail to work, while it keeps retrying requests:

    https://fonts.gstatic.com/s/notosansarabic/[...]
    https://fonts.gstatic.com/s/notosanshebrew/[...]
    https://fonts.gstatic.com/s/notosanssc/[...]

(I noticed this because my browser blocks these de facto trackers by default.)

[+] SeanAnderson|2 years ago|reply

Sorry, I'm a bit naïve about all of this.

Why is this impressive? Can this result not be achieved by throwing more compute at the problem to speed up responses? Isn't the fact that there is a queue when under load just indicative that there's a trade-off between "# of request to process per unit of time" and "amount of compute to put into a response to respond quicker"?

https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/rel/do...

This chart from NVIDIA implies their H100 runs llama v2 70B at >500 tok/s.

[+] sebzim4500|2 years ago|reply

So this has nothing to do with `Grok`, the model provided by x.ai?

EDIT: Tried using it, very impressed with the speed.

[+] eurekin|2 years ago|reply

Jaw dropping. Both groq and mixtral.

I used following prompt:

Generate gitlab ci yaml file for a hybrid front-end/backend project. Fronted is under /frontend and is a node project, packaged with yarn, built with vite to the /backend/public folder. The backend is a python flask server

[+] charlie123hufft|2 years ago|reply

Nevermind, I stand corrected. Blown tf away after trying the demo MYSELF. It's instantaneous, the last time I used an LLM that fast was a proprietary model with a small dataset. Lighting fast but it wasn't smart enough. This is wild. But I don't understand why the demo was so bad and why the demo took so long to respond to his questions?

[+] Gcam|2 years ago|reply

Groq's API performance reaches close to this level of performance as well. We've benchmarked performance over time and >400 tokens/s has sustained - can see here https://artificialanalysis.ai/models/mixtral-8x7b-instruct (bottom of page for over time view)

[+] CuriouslyC|2 years ago|reply

This is pretty sweet. The speed is nice but what I really care about is you bringing the per token cost down compared with models on the level of mistral medium/gpt4. GPT3.5 is pretty close in terms of cost/token but the quality isn't there and GPT4 is overpriced. Having GPT4 quality at sub-gpt3.5 prices will enable a lot of things though.

[+] matanyal|2 years ago|reply

Hey y'all, we have a discord now for more discussion and announcements: https://discord.com/invite/TQcy5EBdCP

[+] deepsquirrelnet|2 years ago|reply

Incredible job. Feels dumb or obvious to say this, but this really changes the way I think of using it. The slow autoregression really sucks because it inhibits your ability to skim sections. For me, that creates an unnatural reading environment. This makes chatgpt feel antiqued.

[+] anybodyz|2 years ago|reply

I have this hooked up experimentally to my universal Dungeon Master simulator DungeonGod and it seems to work quite well.

I had been using Together AI Mixtral (which is serving the Hermes Mixtrals) and it is pretty snappy, but nothing close to Groq. I think the next closes that I've tested is Perplexity Labs Mixtral.

A key blocker in just hanging out a shingle for an open source AI project is the fear that anything that might scale will bankrupt you (or just be offline if you get any significant traction). I think we're nearing the phase that we could potentially just turn these things "on" and eat the reasonable inference fees to see what people engage with - with a pretty decently cool free tier available.

I'd add that the simulator does multiple calls to the api for one response to do analysis and function selection in the underlying python game engine, which Groq makes less of a problem as it's close to instant. This adds a pretty significant pause in the OpenAI version. Also since this simulator runs on Discord with multiple users, I've had problems in the past with 'user response storms' where the AI couldn't keep up. Also less of a problem with Groq.

[+] supercharger9|2 years ago|reply

Do they make money from LLM service or by selling hardware? Homepage is confusing without any reference to other products.

[+] ppsreejith|2 years ago|reply

Relevant thread from 5 months ago: https://news.ycombinator.com/item?id=37469434

I'm achieving consistent 450+ tokens/sec for Mixtral 8x7b 32k and ~200 tps for Llama 2 70B-4k.

As an aside, seeing that this is built with flutter Web, perhaps a mobile app is coming soon?

[+] tandr|2 years ago|reply

@tome Cannot sign up with sneakemail.com, snkml.com, snkmail, liamekaens.com etc... I pay for these services so my email is a bit more protected. Why do you insist on getting well-known email providers instead, datamining or something else?

[+] codedokode|2 years ago|reply

Is it normal that I have asked two networks (llama/mixtral) the same question ("tell me about most popular audio pitch detection algorithms") and they gave almost the same answer? Both answers start with "Sure, here are some of the most popular pitch detection algorithms used in audio signal processing" and end with "Each of these algorithms has its own strengths and weaknesses, and the choice of algorithm depends on the specific application and the characteristics of the input signal.". And the content is 95% the same. How can it be?

472 comments