top | item 38706680

(no title)

chsasank | 2 years ago

This is basically fork of llama.cpp. I created a PR to see the diff and added my comments on it: https://github.com/ggerganov/llama.cpp/pull/4543

One thing that caught my interest is this line from their readme:

> PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers.

Apple's Metal/M3 is perfect for this because CPU and GPU share memory. No need to do any data transfers. Checkout mlx from apple: https://github.com/ml-explore/mlx

discuss

No comments yet.