top | item 42807877

(no title)

opk | 1 year ago

Has anyone actually got this llama stuff to be usable on even moderate hardware? I find it just crashes because it doesn't find enough RAM. I've got 2G of VRAM on an AMD graphics card and 16G of system RAM and that doesn't seem to be enough. The impression I got from reading up was that it worked for most Apple stuff because the memory is unified and other than that, you need very expensive Nvidia GPUs with lots of VRAM. Are there any affordable options?

discuss

order

horsawlarway|1 year ago

Yes. Although I suspect my definition of "moderate hardware" doesn't really match yours.

I can run 2b-14b models just fine on the CPU on my laptop (framework 13 with 32gb ram). They aren't super fast, and the 14b models have limited context length unless I run a quantized version, but they run.

If you just want generation and it doesn't need to be fast... drop the $200 for 128gb of system ram, and you can run the vast majority of the available models (up to ~70b quantized). Note - it won't be quick (expect 1-2 tokens/second, sometimes less).

If you want something faster in the "low end" range still - look at picking up a pair of Nvidia p40s (~$400) which will give you 16gb of ram and be faster for 2b to 7b models.

If you want to hit my level for "moderate", I use 2x3090 (I bought refurbed for ~$1600 a couple years ago) and they do quite a bit of work. Ex - I get ~15t/s generation for 70b 4 quant models, and 50-100t/s for 7b models. That's plenty usable for basically everything I want to run at home. They're faster than the m2 pro I was issued for work, and a good chunk cheaper (the m2 was in the 3k range).

That said - the m1/m2 macs are generally pretty zippy here, I was quite surprised at how well they perform.

Some folks claim to have success with the k80s, but I haven't tried and while 24g vram for under $100 seems nice (even if it's slow), the linux compatibility issues make me inclined to just go for the p40s right now.

I run some tasks on much older hardware (ex - willow inference runs on an old 4gb gtx 970 just fine)

So again - I'm not really sure we'd agree on moderate (I generally spend ~$1000 every 4-6 years to build a machine to play games, and the machine you're describing would match the specs for a machine I would have built 12+ years ago)

But you just need literal memory. bumping to 32gb of system ram would unlock a lot of stuff for you (at low speeds) and costs $50. Bumping to 124gb only costs a couple hundred, and lets you run basically all of them (again - slowly).

zamadatix|1 year ago

2G is pretty low and the sizes things you can get to run fast on that set up probably aren't particularly attractive. "moderate hardware" varies but you can grab a 12 GB RTX 3060 on ebay for ~$200. You can get a lot more RAM for $200 but it'll be so slow compare the the GPU I'm not sure I'd recommend it if you actually want to use things like this interactively.

If "moderate hardware" is your average office PC then it's unlikely to be very usable. Anyone with a gaming GPU from the last several years should be workable though.

horsawlarway|1 year ago

I'll second this, actually - $250 for a 12gb rtx 3060 is probably a better buy than $400 for 2xp40s for 16gb.

It'd been a minute since I checked refurb prices and $250 for the rtx 3060 12gb is a good price.

Easier on the rest of the system than a 2x card setup, and is probably a drop in replacement.

basilgohar|1 year ago

I can run 7B models with Q4 quantization on a 7000 series AMD APU without GPU acceleration quite acceptably fast. This is with DDR5600 RAM which is the current roadblock for performance.

Larger models work but slow down. I do have 64GB of RAM but I think 32 could work. 16GB is pushing is, but should be possible if you don't have anything else open.

Memory requirements depend on numerous factors. 2GB VRAM is not enough for most GenAI stuff today.