top | item 47022857

(no title)

gchadwick | 16 days ago

> If copying user tokens was the bottle neck, batching would not achieve any speed up.

Reality is more complex. As context length grows your KV cache becomes large and will begin to dominate your total FLOPs (and hence bytes loaded). The issue with KV cache is you cannot batch it because only one user can use it, unlike static layer weights where you can reuse them across multiple users.

Emerging sparse attention techniques can greatly relieve this issue though the extent to which frontier labs deploy them is uncertain. Deepseek v3.2 uses sparse attention though I don't know off hand how much this reduces KV cache FLOPs and associated memory bandwidth.

discuss

order

zozbot234|16 days ago

> The issue with KV cache is you cannot batch it because only one user can use it

This is not really correct given how input token caching works and the reality of subagent workloads. You could launch many parallel subagents sharing some portion of their input tokens and use batching for that task.

foobar10000|15 days ago

2 things:

1. Parallel investigation : the payoff form that is relatively small - starting K subagents assumes you have K independent avenues of investigation - and quite often that is not true. Somewhat similar to next-turn prediction using a speculative model - works well enough for 1 or 2 turns, but fails after.

2. Input caching is pretty much fixes prefill - not decode. And if you look at frontier models - for example open-weight models that can do reasoning - you are looking at longer and longer reasoning chains for heavy tool-using models. And reasoning chains will diverge very vey quickly even from the same input assuming a non-0 temp.