(no title)
ansk
|
1 year ago
Is batched inference for LLMs memory bound? My understanding is that sufficiently large batched matmuls will be compute bound and flash attention has mostly removed the memory bottleneck in the attention computation. If so, the value proposition here -- as well as with other memorymaxxing startups like Groq -- is primarily on the latency side of things. Though my personal impression is that latency isn't really a huge issue right now, especially for text. Even OpenAI's voice models are (purportedly) able to be served with a latency which is a low multiple of network latency, and I expect there is room for improvement here as this is essentially the first generation of real-time voice LLMs.
txyx303|1 year ago
If you have an H100 doing 100 tokens/sec and you batch 1000 requests, you might be able to get to 100K tok/sec but each user's request will still be outputting 100 tokens/sec which will make the speed of the response stream the same. So if your output stream speed is slow, batching might not improve user experience, even if you can get a higher chip utilization / "overall" throughput.