top | item 45148237

Qwen3 30B A3B Hits 13 token/s on 4xRaspberry Pi 5

347 points| b4rtazz | 6 months ago |github.com | reply

161 comments

order
[+] dingdingdang|6 months ago|reply
Very impressive numbers.. wonder how this would scale on 4 relatively modern desktop PCs, like say something akin to a i5 8th Gen Lenovo ThinkCentre, these can be had for very cheap. But like @geerlingguy indicates - we need model compatibility to go up up up! As an example it would amazing to see something like fastsdcpu run distributed to democratize accessibility-to/practicality-of image gen models for people with limited budgets but large PC fleets ;)
[+] rthnbgrredf|6 months ago|reply
I think it is all well and good, but the most affordable option is probably still to buy a used MacBook with 16/32 or 64 GB (depending on the budget) unified memory and install Asahi Linux for tinkering.

Graphics cards with decent amount of memory are still massively overpriced (even used), big, noisy and draw a lot of energy.

[+] trebligdivad|6 months ago|reply
On my (single) AMD 3950x running entirely in CPU (llama -t32 -dev none), I was getting 14 tokens/s running Qwen3-Coder-30B-A3B-Instruct-IQ4_NL.gguf last night. Which is the best I've had out of a model that doesn't feel stupid.
[+] j45|6 months ago|reply
Connect a gpu into it with an eGPU chassis and you're running one way or the other.
[+] rao-v|6 months ago|reply
Nice! Cheap RK3588 boards come with 15GB of LPDDR5 RAM these days and have significantly better performance than the Pi 5 (and often are cheaper).

I get 8.2 tokens per second on a random orange pi board with Qwen3-Coder-30B-A3B at Q3_K_XL (~12.9GB). I need to try two of them in parallel ... should be significantly faster than this even at Q6.

[+] jerrysievert|6 months ago|reply
> a random orange pi board with Qwen3-Coder-30B-A3B at Q3_K_XL (~12.9GB)

fantastic! what are you using to run it, llama.cpp? I have a few extra opi5's sitting around that would love some extra usage

[+] ThatPlayer|6 months ago|reply
Is that using the NPU on that board? I know it's possible to use those too.
[+] echelon|6 months ago|reply
This is really impressive.

If we can get this down to a single Raspberry Pi, then we have crazy embedded toys and tools. Locally, at the edge, with no internet connection.

Kids will be growing up with toys that talk to them and remember their stories.

We're living in the sci-fi future. This was unthinkable ten years ago.

[+] striking|6 months ago|reply
I think it's worth remembering that there's room for thoughtful design in the way kids play. Are LLMs a useful tool for encouraging children to develop their imaginations or their visual or spatial reasoning skills? Or would these tools shape their thinking patterns to exactly mirror those encoded into the LLM?

I think there's something beautiful and important about the fact that parents shape their kids, leaving with them some of the best (and worst) aspects of themselves. Likewise with their interactions with other people.

The tech is cool. But I think we should aim to be thoughtful about how we use it.

[+] manmal|6 months ago|reply
An LLM in my kids‘ toys only over my cold, dead body. This can and will go very very wrong.
[+] fragmede|6 months ago|reply
If a raspberry pi can do all that, imagine the toys Bill Gates' grandkids have access to!

We're at the precipice of having a real "A Young Lady's Illustrated Primer" from The Diamond Age.

[+] 1gn15|6 months ago|reply
This is indeed incredibly sci fi. I still remember my ChatGPT moment, when I realized I could actually talk to a computer. And now it can run fully on an RPi, just as if the RPi itself has become intelligent and articulate! Very cool.
[+] bigyabai|6 months ago|reply
> Kids will be growing up with toys that talk to them and remember their stories.

What a radical departure from the social norms of childhood. Next you'll tell me that they've got an AI toy that can change their diaper and cook Chef Boyardee.

[+] taminka|6 months ago|reply

[deleted]

[+] behnamoh|6 months ago|reply
Everything runs on a π if you quantize it enough!

I'm curious about the applications though. Do people randomly buy 4xRPi5s that they can now dedicate to running LLMs?

[+] ryukoposting|6 months ago|reply
I'd love to hook my development tools into a fully-local LLM. The question is context window and cost. If the context window isn't big enough, it won't be helpful for me. I'm not gonna drop $500 on RPis unless I know it'll be worth the money. I could try getting my employer to pay for it, but I'll probably have a much easier time convincing them to pay for Claude or whatever.
[+] giancarlostoro|6 months ago|reply
Sometimes you buy a pi for one project start on it buy another for a different project, before you know it none are complete and you have ten Raspberry Pis lying around across various generations. ;)
[+] hhh|6 months ago|reply
I have clusters of over a thousand raspberry pi’s that have generally 75% of their compute and 80% of their memory that is completely unused.
[+] Zenst|6 months ago|reply
Depends on the model - if you have a sparse model with MoE, then you can divide it up into smaller nodes, your dense 30b models, I do not see them flying anytime soon.

Intel pro B50 in a dumpster PC would do you well better at this model (not enough ram for dense 30b alas) and get close to 20 tokens a second and so much cheaper.

[+] ugh123|6 months ago|reply
I think it serves a good test bed to test methods and models. We'll see if someday they can reduce it to 3... 2... 1 Pi5's that can match performance.
[+] blululu|6 months ago|reply
For $500 you may as well spend an extra $100 and get a Mac mini with an m4 chip and 256gb of ram and avoid the headaches of coordinating 4 machines.
[+] piecerough|6 months ago|reply
"quantize enough"

though at what quality?

[+] 6r17|6 months ago|reply
I mean at this point it's more of a "proof-of-work" with shared BP ; I would deff see some domotic hacker get this running - hell maybe i'll do this do if I have some spare time and want to make something like alexa with customized stuff - would still need text to speech and speech to text but that's not really the topic of his set-up ; even for pro use if that's really usable why not just spawn qwen on ARM if that's cheaper - there is a lot of way to read and leverage such bench
[+] tarruda|6 months ago|reply
I suspect you'd get similar numbers with a modern x86 mini PC that has 32GB of RAM.
[+] drbscl|6 months ago|reply
Distributed compute is cool, but $320 for 13 tokens/s on a tiny input prompt, 4 bit quantization, and 3B active parameter model is very underwhelming
[+] geerlingguy|6 months ago|reply
distributed-llama is great, I just wish it would work with more models. I've been happy with ease of setup and its ongoing maintenance compared to Exo, and performance vs llama.cpp RPC mode.
[+] alchemist1e9|6 months ago|reply
Any pointers to what is SOTA for cluster of hosts with CUDA GPUs but not enough vram for full weights, yet 10Gbit low latency interconnects?

If that problem gets solved, even if for only a batch approach that enables parallel batch inference resulting in high total token/s but low per session, and for bigger models, then it would he a serious game changer for large scale low cost AI automation without billions capex. My intuition says it should be possible, so perhaps someone has done it or started on it already.

[+] mmastrac|6 months ago|reply
Is the network the bottleneck here at all? That's impressive for a gigabit switch.
[+] kristianp|6 months ago|reply
Does the switch use more power than the 4 pis?
[+] poly2it|6 months ago|reply
Neat, but at this price scaling it's probably better to buy GPUs.
[+] bjt12345|6 months ago|reply
Does Distributed Llama use RDMA over Converged Ethernet or is this roadmapped? I've always wondered if RoCE and Ultra-Ethernet will trickle down into the consumer market.
[+] kosolam|6 months ago|reply
How is this technically done? How does it split the query and aggregates the results?
[+] varispeed|6 months ago|reply
So would 40x RPi 5 get 130 token/s?
[+] SillyUsername|6 months ago|reply
I imagine it might be limited by number of layers and you'll get diminishing returns as well at some point caused by network latency.
[+] reilly3000|6 months ago|reply
It has to be 2^n nodes and limited to one per attention head that the model has.
[+] VHRanger|6 months ago|reply
Most likely not because of NUMA bottlenecks
[+] ab_testing|6 months ago|reply
Would it work better on a used GPU?
[+] ineedasername|6 months ago|reply
This is highly usable in an enterprise setting when the task benefits from near-human level decision making and when $acceptable_latency < 1s meets decisions that can be expressed in natural language <= 13tk.

Meaning that if you can structure a range of situations and tasks clearly in natural language with a pseudo-code type of structure and fit it in model context then you can have an LLM perform a huge amount of work with Human-in-the-loop oversight & quality control for edge cases.

Think of office jobs, white colar work, where, business process documentation and employee guides and job aids already fully describe 40% to 80% of the work. These are the tasks most easily structured with scaffolding prompts and more specialized RLHF enriched data, and then perform those tasks more consistently.

This is what I decribe when I'm asked "But how will they do $X when they can't answer $Y without hallucinating?"

I explain the above capability, then I ask the person to do a brief thought experiment: How often have you heard, or yourself thought something like, "That is mindnumbingly tedious" and/or "a trained monkey could do it"?

In the end, I don't know anyone whose is aware of the core capabilities in the structured natural-language sense above, that doesn't see at a glance just how many jobs can easily go away.

I'm not smart enough to see where all the new jobs will be or certain there will be as many of them, if I did I'd start or invest in such businesses. But maybe not many new jobs get created, but then so what?

If the net productivity and output-- essentially the wealth-- of the global workforce remains the same or better with AI assistance and therefore fewer work hours, that means... What? Less work on average, per capita. More wealth, per work hour worked per Capita than before.

Work hours used to be longer, they can shorten again. The problem is getting there. To overcoming not just the "sure but it will only be the CEOs get wealthy" side of things to also the "full time means 40 hours a week minimum." attitude by more than just managers and CEOs.

It will also mean that our concept of the "proper wage" for unskilled labor that can't be automated will have to change too. Wait staff at restaurants, retail workers, countless low end service-workers in food and hospitality? They'll now be providing-- and giving up-- something much more valuable than white colar skills that are outdated. They'll be giving their time to what I've heard, and the term is jarring to my ears but it is what it is, I've heard it described as "embodied work". And I guess the term fits. And anyway I've long considered my time to be something I'll trade with a great deal more reluctance than my money, and so demand a lot money for it when it's required so I can use that money to buy more time (by not having to work) somewhere in the near future, even if it's just by covering my costs for getting groceries delivered instead of the time to go shopping myself.

Wow, this comment got away from me. But seeing Qwen3 30B level quality with 13tk/s on dirt cheap HW struck a deep chord of "heck, the global workforce could be rocked to the core for cheap+quality 13tk/s." And that alone isn't the sort of comment you can leave as a standalone drive-by on HN and have it be worth the seconds to write it. And I'm probably wrong on a little or a lot of this and seeing some ideas on how I'm wrong will be fun and interesting.

[+] mehdibl|6 months ago|reply

[deleted]

[+] hidelooktropic|6 months ago|reply
13/s is not slow. Q4 is not bad. The models that run on phones are never 30B or anywhere close to that.
[+] misternintendo|6 months ago|reply
At this speed this is only suitable for time insensitive applications..
[+] layer8|6 months ago|reply
I’d argue that chat is a time-sensitive application, and 13 tokens/s is significantly faster than I can read.
[+] daveed|6 months ago|reply
I mean it's a raspberry pi...