top | item 43886906

(no title)

Can the inference piece be partitioned over multiple hosts?

Edit: algorithmed or partitioned in a way that overcomes the network bottleneck

discuss

Maxious|10 months ago

> prima.cpp is a distributed implementation of llama.cpp that lets you run 70B-level LLMs on your everyday devices— laptops, desktops, phones, and tablets (GPU or no GPU, it’s all good). With it, you can run QwQ-32B, Qwen 2.5-72B, Llama 3-70B, or DeepSeek R1 70B right from your local home cluster!

https://github.com/Lizonghang/prima.cpp

happyPersonR|10 months ago

Pretty sure llama.cpp can already do that

TYMorningCoffee|10 months ago

I forgot to clarify dealing with the network bottleneck