top | item 39179430

(no title)

kuczmama | 2 years ago

Realistically, what hardware would be required to run this? I assumed a RTX 3090 would be enough?

discuss

A Studio Mac with an M1 Ultra (about 2800 USD used) is actually a really cost effective way to run in. Its total system power consumption is really low, even spitting out tokens at full tilt (<250W).

michaelt|2 years ago

You can run a similarly sized model - Llama 2 70B - at the 'Q4_K_M' quantisation level, with 44 GB of memory [1]. So you can just about fit it on 2x RTX 3090 (which you can buy, used, for around $1100 each)

Of course, you can buy quite a lot of hosted model API access or cloud GPU time for that money.

[1] https://huggingface.co/TheBloke/Llama-2-70B-GGUF

kkzz99|2 years ago

RTX 3090 has 24GB of memory, a quantized llama70b takes around 60GB of memory. You can offload a few layers on the gpu, but most of them will run on the CPU with terrible speeds.

nullc|2 years ago

You're not required to put the whole model in a single GPU.

You can buy a 24GB gpu for $150-ish (P40).

trentnelson|2 years ago

Can that be split across multiple GPUs? i.e. what if I have 4xV100-DGXS-32GBs?