top | item 42880783

(no title)

acoard | 1 year ago

> Because memory bandwidth is the #1 bottleneck for inference, even more than capacity.

But there are a ton of models I can't run at all locally due to VRAM limitations. I'd take being able to run those models slower. I know there are some ways to get these running on CPU orders of magnitude slower, but ideally there's some sort of middle ground.

discuss

aurareturn|1 year ago

You can load giant models onto normal RAM such as on an Epyc system but they're still mostly bottlenecked by low memory bandwidth.

elcomet|1 year ago

You can offload tensors to the cpu memory. It will make your model run much slower but it will work