(no title)
kiratp | 5 months ago
Things they could do that would not technically contradict that:
- Quantize KV cache
- Data aware model quantization where their own evals will show "equivalent perf" but the overall model quality suffers.
Simple fact is that it takes longer to deploy physical compute but somehow they are able to serve more and more inference from a slowly growing pool of hardware. Something has to give...
cj|5 months ago
Is training compute interchangeable with inference compute or does training vs. inference have significantly different hardware requirements?
If training and inference hardware is pooled together, I could imagine a model where training simply fills in any unused compute at any given time (?)
kiratp|5 months ago
Also, if you pull too manny resources from training your next model to make inference revenue today, you’ll fall behind in the larger race.