top | item 43014499

(no title)

birktj | 1 year ago

I was wondering if anyone here has experimented with running a cluster of SBC for LLM inference? Ex. the Radxa ROCK 5C has 32GB of memory and also a NPU and only costs about 300 euros. I'm not super up to date on the architecture on modern LLMs, but as far as I understand you should be able to split the layers between multiple nodes? It is not that much data the needs to be sent between them, right? I guess you won't get quite the same performance as a modern mac or nvidia GPU, but it could be quite acceptable and possibly a cheap way of getting a lot of memory.

On the other hand I am wondering about what is the state of the art in CPU + GPU inference. Prompt processing is both compute and memory constrained, but I think token generation afterwards is mostly memory bound. Are there any tools that support loading a few layers at a time into a GPU for initial prompt processing and then switches to CPU inference for token generation? Last time I experimented it was possible to run some layers on the GPU and some on the CPU, but to me it seems more efficient to run everything on the GPU initially (but a few layers at a time so they fit in VRAM) and then switch to the CPU when doing the memory bound token generation.

discuss

order

Eisenstein|1 year ago

> I was wondering if anyone here has experimented with running a cluster of SBC for LLM inference? Ex. the Radxa ROCK 5C has 32GB of memory and also a NPU and only costs about 300 euros.

Look into RPC. Llama.cpp supports it.

* https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacp...

> Last time I experimented it was possible to run some layers on the GPU and some on the CPU, but to me it seems more efficient to run everything on the GPU initially (but a few layers at a time so they fit in VRAM) and then switch to the CPU when doing the memory bound token generation.

Moving layers over the PCIe bus to do this is going to be slow, which seems to be the issue with that strategy. I think it the key is to use MoE and be smart about which layers go where. This project seems to be doing that with great results:

* https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...