top | item 44302932

(no title)

sethkim | 8 months ago

I run a batch inference/LLM data processing service and we do a lot of work around cost and performance profiling of (open-weight) models.

One odd disconnect that still exists in LLM pricing is the fact that providers charge linearly with respect to token consumption, but costs are actually quadratic with an increase in sequence length.

At this point, since a lot of models have converged around the same model architecture, inference algorithms, and hardware - the chosen costs are likely due to a historical, statistical analysis of the shape of customer requests. In other words, I'm not surprised to see costs increase as providers gather more data about real-world user consumption patterns.

discuss

diziet|8 months ago

Aren't advances in KV caching making compute cost not quite quadratic?