I'm designing a new PC and I'd like to be able to run local models. It's not clear to me from posts online what the specs should be. Do I need 128gb of RAM? Or would a 16gb RTX 4060 be better? Or should I get a 4070 ti? If anyone could pint me toward some good guidelines I'd greatly appreciate it.
tharmas|1 year ago
Get as much VRAM as you can afford.
nVIDIA is also releasing new cards starting in late January 2025. The RTX 50 series.
qingcharles|1 year ago
Some good info here if you dig around:
https://www.reddit.com/r/LocalLLaMA/
moffkalast|1 year ago
The things you need: memory bandwidth, memory capacity, compute. The more of each the better. The 4060 generally has very poor bandwidth (worse than the 3060) due to its limited bus, but being able to offload more is still generally better.
32GB systems can load 8B models at fp16, 12B at 8 bits, 30B at 4 bits, 70B at 2 bits (roughly speaking). 64GB would be a good minimum if you want to use 70B at 4 bits. Without significant offloading it will be very slow though.
If you want to process long contexts in a decent amount of time it's best to run models with flash attention which requires you to have the KV cache on the GPU. It also lets you use 4 bit cache, which quadruples the amount of context you can fit.
Rastonbury|1 year ago
pulse7|1 year ago
A) 128GB RAM with the fastest Intel/AMD CPU, no GPU: you can run big/good models, but very slow (about 0.5 to 3 tokens/second)
B) Fastest Mac with 128GB/192GB: you can run big/good models with moderate speed (like 5-10 tokens/second)
C) 16/32GB RAM + RTX 4090 with 24GB VRAM: you can run smaller (but still good) models very fast - completely in VRAM (20-30 tokens/second)
pella|1 year ago