top | item 43970911

(no title)

vkkhare | 9 months ago

Isn't the inference cost of running these models at scale challenging? Currently it feels like small LLMs (1B-4B) are able to perform well for simpler agentic workfows. There are definitely some constraints but surely much easier than to pay for big clusters on cloud running for these tasks. I believe it distributes the cost more uniformly

discuss

bigyabai|9 months ago

It is very likely that you consume less power running a 1B LLM on an Nvidia supercluster than you do trying to download and run the same model on a smartphone. I don't think people understand just how fast the server hardware is compared to what is in their pocket.

We'll see companies push for tiny on-device models as a novelty, but even the best of those aren't very good. I firmly believe that GPUs are going to stay relevant even as models scale down, since they're still the fastest and most power-efficient solution.