A Studio Mac with an M1 Ultra (about 2800 USD used) is actually a really cost effective way to run in. Its total system power consumption is really low, even spitting out tokens at full tilt (<250W).
You can run a similarly sized model - Llama 2 70B - at the 'Q4_K_M' quantisation level, with 44 GB of memory [1]. So you can just about fit it on 2x RTX 3090 (which you can buy, used, for around $1100 each)
Of course, you can buy quite a lot of hosted model API access or cloud GPU time for that money.
RTX 3090 has 24GB of memory, a quantized llama70b takes around 60GB of memory. You can offload a few layers on the gpu, but most of them will run on the CPU with terrible speeds.
summarity|2 years ago
michaelt|2 years ago
Of course, you can buy quite a lot of hosted model API access or cloud GPU time for that money.
[1] https://huggingface.co/TheBloke/Llama-2-70B-GGUF
kkzz99|2 years ago
nullc|2 years ago
You can buy a 24GB gpu for $150-ish (P40).
trentnelson|2 years ago