top | item 43920922

(no title)

This is a good article on the "fog of war" for GPU inference. Modal has been doing a great job of aggregating and disseminating info on how to think about high quality AI inference. Learned some fun stuff -- thanks for posting it.

> the majority of organizations achieve less than 70% GPU Allocation Utilization when running at peak demand — to say nothing of aggregate utilization. This is true even of sophisticated players, like the former Banana serverless GPU platform, which operated at an aggregate utilization of around 20%.

Saw this sort of thing at my last job. Was very frustrating pointing this out to people only for them to respond with ¯\_(ツ)_/¯. I posted a much less tactful article (read: rant) than the one by Modal, but I think it still touches on a lot of the little things you need to consider when deploying AI models: https://thelisowe.substack.com/p/you-suck-at-deploying-ai-mo...

discuss

charles_irl|9 months ago

Nice article! I had to restrain myself from ranting on our blog :)