top | item 42310367

(no title)

No one is running LLMs on consumer NVidia GPUs or apple MacBooks.

A dev, if they want to run local models, probably run something which just fits on a proper GPU. For everything else, everyone uses an API key from whatever because its fundamentaly faster.

IF a affordable intel GPU would be relevant faster for inferencing, is not clear at all.

A 4090 is at least double the speed of Apples GPU.

discuss

treprinum|1 year ago

4090 is 5x faster than M3 Max 128GB according to my tests but it can't even inference LLaMA-30B. The moment you hit that memory limit the inference is suddenly 30x slower than M3 Max. So a basic GPU with 128GB RAM would trash 4090 on those larger LLMs.

skirmish|1 year ago

Quantized 30B models should run in 24GB VRAM. A quick search found people doing that with good speed: [1]

    I have a 4090, PCIe 3x16, DDR4 RAM.
    
    oobabooga/text-generation-webui
    using exllama
    I can load 30B 4bit GPTQ models and use full 2048 context
    I get 30-40 tokens/s

[1] https://old.reddit.com/r/LocalLLaMA/comments/14gdsxe/optimal...

m00x|1 year ago

Do you have the code for that test?