top | item 41872686

(no title)

pajeets | 1 year ago

need a 3090 at least for that

discuss

llama.cpp and others can run purely on CPU[0]. Even production grade serving frameworks like vLLM[1].

There are a variety of other LLM inference implementations that can run on CPU as well.

pajeets|1 year ago

wait this is crazy

what model can i run on 1TB and how many tokens per second ?

for instance Nvidia Nemotron Llama 3.1 quantized at what speed ? ill get a GPU too but not sure how much VRAM I need for the best value for your buck