top | item 46655286

(no title)

jdcasale | 1 month ago

The math is obvious on this one. It's super well-documented that model performance on complex tasks scales (to some asymptote) with the amount of inference-time compute allocated.

LLM providers must dynamically scale inference-time compute based on current load because they have limited compute. Thus it's impossible for traffic spikes _not_ to cause some degradations in model performance (at least until/unless they acquire enough compute to saturate that asymptotic curve for every request under all demand conditions -- it does not seem plausible that they are anywhere close to this)

discuss

order

YetAnotherNick|1 month ago

Umm. I run multiple benchmark using APIs for my work and the inference time compute allotted has clear correlation with the metrics. But time of the day certainly isn't. If it is that straightforward people can prove very easily rather than relying on the anecdotes.

They either overprovision the server during low demand or they might dynamically provision servers based on load.

SOLAR_FIELDS|1 month ago

Yes, every time I see some variant of this come up (and believe me, this has been coming up since before the GPT3.5 days) there’s never any actual data demonstrating that it’s the case. As you say, it should be completely trivial to run the exact same prompt multiple times per day and capture the output to demonstrate this.

But no one ever seems to do that, they are rather content to “feel” that this is the case instead