(no title)
nbardy | 6 months ago
Look at VLLM. It's the top open source version of this.
But the idea is you can service 5000 or so people in parallel.
You get about 1.5-2x slowdown on per token speed per user, but you get 2000x-3000x throughput on the server.
The main insight is that memory bandwidth is the main bottleneck so if you batch requests and use a clever KV cache along with the batching you can drastically increase parallel throughput.
No comments yet.