(no title)
jdcasale | 1 month ago
LLM providers must dynamically scale inference-time compute based on current load because they have limited compute. Thus it's impossible for traffic spikes _not_ to cause some degradations in model performance (at least until/unless they acquire enough compute to saturate that asymptotic curve for every request under all demand conditions -- it does not seem plausible that they are anywhere close to this)
YetAnotherNick|1 month ago
They either overprovision the server during low demand or they might dynamically provision servers based on load.
SOLAR_FIELDS|1 month ago
But no one ever seems to do that, they are rather content to “feel” that this is the case instead
unknown|1 month ago
[deleted]