(no title)
birktj | 1 year ago
On the other hand I am wondering about what is the state of the art in CPU + GPU inference. Prompt processing is both compute and memory constrained, but I think token generation afterwards is mostly memory bound. Are there any tools that support loading a few layers at a time into a GPU for initial prompt processing and then switches to CPU inference for token generation? Last time I experimented it was possible to run some layers on the GPU and some on the CPU, but to me it seems more efficient to run everything on the GPU initially (but a few layers at a time so they fit in VRAM) and then switch to the CPU when doing the memory bound token generation.
Eisenstein|1 year ago
Look into RPC. Llama.cpp supports it.
* https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacp...
> Last time I experimented it was possible to run some layers on the GPU and some on the CPU, but to me it seems more efficient to run everything on the GPU initially (but a few layers at a time so they fit in VRAM) and then switch to the CPU when doing the memory bound token generation.
Moving layers over the PCIe bus to do this is going to be slow, which seems to be the issue with that strategy. I think it the key is to use MoE and be smart about which layers go where. This project seems to be doing that with great results:
* https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...