(no title)
qeternity | 14 days ago
When an author is confused about something so elementary, I can’t trust anything else they write.
qeternity | 14 days ago
When an author is confused about something so elementary, I can’t trust anything else they write.
gchadwick|14 days ago
Reality is more complex. As context length grows your KV cache becomes large and will begin to dominate your total FLOPs (and hence bytes loaded). The issue with KV cache is you cannot batch it because only one user can use it, unlike static layer weights where you can reuse them across multiple users.
Emerging sparse attention techniques can greatly relieve this issue though the extent to which frontier labs deploy them is uncertain. Deepseek v3.2 uses sparse attention though I don't know off hand how much this reduces KV cache FLOPs and associated memory bandwidth.
zozbot234|14 days ago
This is not really correct given how input token caching works and the reality of subagent workloads. You could launch many parallel subagents sharing some portion of their input tokens and use batching for that task.
kouteiheika|14 days ago
Inference is memory-bound only at low batch sizes. At high batch sizes it becomes compute-bound. There's a certain threshold where stuffing more requests in a batch will slow down every request in isolation even though it may still increase the number of tokens/second across the whole batch for all request in aggregate.
qeternity|14 days ago
Also there does not exist any batch size > 1 where per-request throughput is equal to bs=1. Doing any batching at all will slow all intra-batch requests down.
xcodevn|14 days ago
throwdbaaway|13 days ago
You don't have to work for a frontier lab to know that. You just have to be GPU poor.