Very impressive numbers.. wonder how this would scale on 4 relatively modern desktop PCs, like say something akin to a i5 8th Gen Lenovo ThinkCentre, these can be had for very cheap. But like @geerlingguy indicates - we need model compatibility to go up up up! As an example it would amazing to see something like fastsdcpu run distributed to democratize accessibility-to/practicality-of image gen models for people with limited budgets but large PC fleets ;)
I think it is all well and good, but the most affordable option is probably still to buy a used MacBook with 16/32 or 64 GB (depending on the budget) unified memory and install Asahi Linux for tinkering.
Graphics cards with decent amount of memory are still massively overpriced (even used), big, noisy and draw a lot of energy.
On my (single) AMD 3950x running entirely in CPU (llama -t32 -dev none), I was getting 14 tokens/s running Qwen3-Coder-30B-A3B-Instruct-IQ4_NL.gguf last night. Which is the best I've had out of a model that doesn't feel stupid.
Nice! Cheap RK3588 boards come with 15GB of LPDDR5 RAM these days and have significantly better performance than the Pi 5 (and often are cheaper).
I get 8.2 tokens per second on a random orange pi board with Qwen3-Coder-30B-A3B at Q3_K_XL (~12.9GB). I need to try two of them in parallel ... should be significantly faster than this even at Q6.
I think it's worth remembering that there's room for thoughtful design in the way kids play. Are LLMs a useful tool for encouraging children to develop their imaginations or their visual or spatial reasoning skills? Or would these tools shape their thinking patterns to exactly mirror those encoded into the LLM?
I think there's something beautiful and important about the fact that parents shape their kids, leaving with them some of the best (and worst) aspects of themselves. Likewise with their interactions with other people.
The tech is cool. But I think we should aim to be thoughtful about how we use it.
This is indeed incredibly sci fi. I still remember my ChatGPT moment, when I realized I could actually talk to a computer. And now it can run fully on an RPi, just as if the RPi itself has become intelligent and articulate! Very cool.
> Kids will be growing up with toys that talk to them and remember their stories.
What a radical departure from the social norms of childhood. Next you'll tell me that they've got an AI toy that can change their diaper and cook Chef Boyardee.
I'd love to hook my development tools into a fully-local LLM. The question is context window and cost. If the context window isn't big enough, it won't be helpful for me. I'm not gonna drop $500 on RPis unless I know it'll be worth the money. I could try getting my employer to pay for it, but I'll probably have a much easier time convincing them to pay for Claude or whatever.
Sometimes you buy a pi for one project start on it buy another for a different project, before you know it none are complete and you have ten Raspberry Pis lying around across various generations. ;)
Depends on the model - if you have a sparse model with MoE, then you can divide it up into smaller nodes, your dense 30b models, I do not see them flying anytime soon.
Intel pro B50 in a dumpster PC would do you well better at this model (not enough ram for dense 30b alas) and get close to 20 tokens a second and so much cheaper.
I mean at this point it's more of a "proof-of-work" with shared BP ; I would deff see some domotic hacker get this running - hell maybe i'll do this do if I have some spare time and want to make something like alexa with customized stuff - would still need text to speech and speech to text but that's not really the topic of his set-up ; even for pro use if that's really usable why not just spawn qwen on ARM if that's cheaper - there is a lot of way to read and leverage such bench
distributed-llama is great, I just wish it would work with more models. I've been happy with ease of setup and its ongoing maintenance compared to Exo, and performance vs llama.cpp RPC mode.
Any pointers to what is SOTA for cluster of hosts with CUDA GPUs but not enough vram for full weights, yet 10Gbit low latency interconnects?
If that problem gets solved, even if for only a batch approach that enables parallel batch inference resulting in high total token/s but low per session, and for bigger models, then it would he a serious game changer for large scale low cost AI automation without billions capex. My intuition says it should be possible, so perhaps someone has done it or started on it already.
Does Distributed Llama use RDMA over Converged Ethernet or is this roadmapped? I've always wondered if RoCE and Ultra-Ethernet will trickle down into the consumer market.
This is highly usable in an enterprise setting when the task benefits from near-human level decision making and when $acceptable_latency < 1s meets decisions that can be expressed in natural language <= 13tk.
Meaning that if you can structure a range of situations and tasks clearly in natural language with a pseudo-code type of structure and fit it in model context then you can have an LLM perform a huge amount of work with Human-in-the-loop oversight & quality control for edge cases.
Think of office jobs, white colar work, where, business process documentation and employee guides and job aids already fully describe 40% to 80% of the work. These are the tasks most easily structured with scaffolding prompts and more specialized RLHF enriched data, and then perform those tasks more consistently.
This is what I decribe when I'm asked "But how will they do $X when they can't answer $Y without hallucinating?"
I explain the above capability, then I ask the person to do a brief thought experiment: How often have you heard, or yourself thought something like, "That is mindnumbingly tedious" and/or "a trained monkey could do it"?
In the end, I don't know anyone whose is aware of the core capabilities in the structured natural-language sense above, that doesn't see at a glance just how many jobs can easily go away.
I'm not smart enough to see where all the new jobs will be or certain there will be as many of them, if I did I'd start or invest in such businesses. But maybe not many new jobs get created, but then so what?
If the net productivity and output-- essentially the wealth-- of the global workforce remains the same or better with AI assistance and therefore fewer work hours, that means... What? Less work on average, per capita. More wealth, per work hour worked per Capita than before.
Work hours used to be longer, they can shorten again. The problem is getting there. To overcoming not just the "sure but it will only be the CEOs get wealthy" side of things to also the "full time means 40 hours a week minimum." attitude by more than just managers and CEOs.
It will also mean that our concept of the "proper wage" for unskilled labor that can't be automated will have to change too. Wait staff at restaurants, retail workers, countless low end service-workers in food and hospitality? They'll now be providing-- and giving up-- something much more valuable than white colar skills that are outdated. They'll be giving their time to what I've heard, and the term is jarring to my ears but it is what it is, I've heard it described as "embodied work". And I guess the term fits. And anyway I've long considered my time to be something I'll trade with a great deal more reluctance than my money, and so demand a lot money for it when it's required so I can use that money to buy more time (by not having to work) somewhere in the near future, even if it's just by covering my costs for getting groceries delivered instead of the time to go shopping myself.
Wow, this comment got away from me. But seeing Qwen3 30B level quality with 13tk/s on dirt cheap HW struck a deep chord of "heck, the global workforce could be rocked to the core for cheap+quality 13tk/s." And that alone isn't the sort of comment you can leave as a standalone drive-by on HN and have it be worth the seconds to write it. And I'm probably wrong on a little or a lot of this and seeing some ideas on how I'm wrong will be fun and interesting.
[+] [-] dingdingdang|6 months ago|reply
[+] [-] rthnbgrredf|6 months ago|reply
Graphics cards with decent amount of memory are still massively overpriced (even used), big, noisy and draw a lot of energy.
[+] [-] trebligdivad|6 months ago|reply
[+] [-] j45|6 months ago|reply
[+] [-] rao-v|6 months ago|reply
I get 8.2 tokens per second on a random orange pi board with Qwen3-Coder-30B-A3B at Q3_K_XL (~12.9GB). I need to try two of them in parallel ... should be significantly faster than this even at Q6.
[+] [-] jerrysievert|6 months ago|reply
fantastic! what are you using to run it, llama.cpp? I have a few extra opi5's sitting around that would love some extra usage
[+] [-] ThatPlayer|6 months ago|reply
[+] [-] echelon|6 months ago|reply
If we can get this down to a single Raspberry Pi, then we have crazy embedded toys and tools. Locally, at the edge, with no internet connection.
Kids will be growing up with toys that talk to them and remember their stories.
We're living in the sci-fi future. This was unthinkable ten years ago.
[+] [-] striking|6 months ago|reply
I think there's something beautiful and important about the fact that parents shape their kids, leaving with them some of the best (and worst) aspects of themselves. Likewise with their interactions with other people.
The tech is cool. But I think we should aim to be thoughtful about how we use it.
[+] [-] manmal|6 months ago|reply
[+] [-] fragmede|6 months ago|reply
We're at the precipice of having a real "A Young Lady's Illustrated Primer" from The Diamond Age.
[+] [-] 1gn15|6 months ago|reply
[+] [-] bigyabai|6 months ago|reply
What a radical departure from the social norms of childhood. Next you'll tell me that they've got an AI toy that can change their diaper and cook Chef Boyardee.
[+] [-] supportengineer|6 months ago|reply
[deleted]
[+] [-] taminka|6 months ago|reply
[deleted]
[+] [-] behnamoh|6 months ago|reply
I'm curious about the applications though. Do people randomly buy 4xRPi5s that they can now dedicate to running LLMs?
[+] [-] ryukoposting|6 months ago|reply
[+] [-] giancarlostoro|6 months ago|reply
[+] [-] hhh|6 months ago|reply
[+] [-] Zenst|6 months ago|reply
Intel pro B50 in a dumpster PC would do you well better at this model (not enough ram for dense 30b alas) and get close to 20 tokens a second and so much cheaper.
[+] [-] ugh123|6 months ago|reply
[+] [-] blululu|6 months ago|reply
[+] [-] piecerough|6 months ago|reply
though at what quality?
[+] [-] 6r17|6 months ago|reply
[+] [-] tarruda|6 months ago|reply
[+] [-] drbscl|6 months ago|reply
[+] [-] geerlingguy|6 months ago|reply
[+] [-] alchemist1e9|6 months ago|reply
If that problem gets solved, even if for only a batch approach that enables parallel batch inference resulting in high total token/s but low per session, and for bigger models, then it would he a serious game changer for large scale low cost AI automation without billions capex. My intuition says it should be possible, so perhaps someone has done it or started on it already.
[+] [-] mmastrac|6 months ago|reply
[+] [-] kristianp|6 months ago|reply
[+] [-] poly2it|6 months ago|reply
[+] [-] bjt12345|6 months ago|reply
[+] [-] rldjbpin|6 months ago|reply
[1] https://github.com/llm-d/llm-d/
[+] [-] kosolam|6 months ago|reply
[+] [-] magicalhippo|6 months ago|reply
More devices mean faster performance, leveraging tensor parallelism and high-speed synchronization over Ethernet.
The maximum number of nodes is equal to the number of KV heads in the model #70.
I found this[1] article nice for an overview of the parallelism modes.
[1]: https://medium.com/@chenhao511132/parallelism-in-llm-inferen...
[+] [-] varispeed|6 months ago|reply
[+] [-] SillyUsername|6 months ago|reply
[+] [-] reilly3000|6 months ago|reply
[+] [-] VHRanger|6 months ago|reply
[+] [-] ab_testing|6 months ago|reply
[+] [-] ineedasername|6 months ago|reply
Meaning that if you can structure a range of situations and tasks clearly in natural language with a pseudo-code type of structure and fit it in model context then you can have an LLM perform a huge amount of work with Human-in-the-loop oversight & quality control for edge cases.
Think of office jobs, white colar work, where, business process documentation and employee guides and job aids already fully describe 40% to 80% of the work. These are the tasks most easily structured with scaffolding prompts and more specialized RLHF enriched data, and then perform those tasks more consistently.
This is what I decribe when I'm asked "But how will they do $X when they can't answer $Y without hallucinating?"
I explain the above capability, then I ask the person to do a brief thought experiment: How often have you heard, or yourself thought something like, "That is mindnumbingly tedious" and/or "a trained monkey could do it"?
In the end, I don't know anyone whose is aware of the core capabilities in the structured natural-language sense above, that doesn't see at a glance just how many jobs can easily go away.
I'm not smart enough to see where all the new jobs will be or certain there will be as many of them, if I did I'd start or invest in such businesses. But maybe not many new jobs get created, but then so what?
If the net productivity and output-- essentially the wealth-- of the global workforce remains the same or better with AI assistance and therefore fewer work hours, that means... What? Less work on average, per capita. More wealth, per work hour worked per Capita than before.
Work hours used to be longer, they can shorten again. The problem is getting there. To overcoming not just the "sure but it will only be the CEOs get wealthy" side of things to also the "full time means 40 hours a week minimum." attitude by more than just managers and CEOs.
It will also mean that our concept of the "proper wage" for unskilled labor that can't be automated will have to change too. Wait staff at restaurants, retail workers, countless low end service-workers in food and hospitality? They'll now be providing-- and giving up-- something much more valuable than white colar skills that are outdated. They'll be giving their time to what I've heard, and the term is jarring to my ears but it is what it is, I've heard it described as "embodied work". And I guess the term fits. And anyway I've long considered my time to be something I'll trade with a great deal more reluctance than my money, and so demand a lot money for it when it's required so I can use that money to buy more time (by not having to work) somewhere in the near future, even if it's just by covering my costs for getting groceries delivered instead of the time to go shopping myself.
Wow, this comment got away from me. But seeing Qwen3 30B level quality with 13tk/s on dirt cheap HW struck a deep chord of "heck, the global workforce could be rocked to the core for cheap+quality 13tk/s." And that alone isn't the sort of comment you can leave as a standalone drive-by on HN and have it be worth the seconds to write it. And I'm probably wrong on a little or a lot of this and seeing some ideas on how I'm wrong will be fun and interesting.
[+] [-] shaaca|6 months ago|reply
[deleted]
[+] [-] YJfcboaDaJRDw|6 months ago|reply
[deleted]
[+] [-] mehdibl|6 months ago|reply
[deleted]
[+] [-] hidelooktropic|6 months ago|reply
[+] [-] misternintendo|6 months ago|reply
[+] [-] layer8|6 months ago|reply
[+] [-] daveed|6 months ago|reply