llama 7b in full precision requires 28GB of GPU ram. I know little about AI. Can I estimate that a 770 model takes ~2.8GB? It means vast devices can run the model locally.
Something like that, plus some overhead but likely < 3GB.
That being said though, half-precision would almost certainly work without a huge performance hit (so 2GB), 8bit/4bit quantitization could be done but usually don't work super well on the small parameter count models.
Also you can still batch it through but it'll be slower. In theory ~2GB including the overhead though...
It's worth noting that it's current performance though is only for a specific category of tests, and isn't the most useful model either so it's not chatGPT-on-your-phone just yet.
thewataccount|2 years ago
That being said though, half-precision would almost certainly work without a huge performance hit (so 2GB), 8bit/4bit quantitization could be done but usually don't work super well on the small parameter count models.
Also you can still batch it through but it'll be slower. In theory ~2GB including the overhead though...
It's worth noting that it's current performance though is only for a specific category of tests, and isn't the most useful model either so it's not chatGPT-on-your-phone just yet.
opyate|2 years ago
> Also you can still batch it through but it'll be slower.
Thanks!