top | item 44855217

(no title)

zyx321 | 6 months ago

There's been some theories floating around that the 128gb version could be the best value for on-premise LLM inference. The RAM is split between CPU and GPU at a user-configurable ratio.

So this might be the holy grail of "good enough GPU" and "over 100GB of VRAM" if the rest of the system can keep up.

discuss

yencabulator|6 months ago

> The RAM is split between CPU and GPU at a user-configurable ratio.

I believe the fixed split thing is a historical remnant. These days, the OS can allocate memory for the GPU to use on the fly.

geerlingguy|6 months ago

Indeed it can be reallocated, needs a reboot though. I've gotten up to around 110 GB before running into OOM issues. I set it at 108 GB to provide a little headroom: https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...

zyx321|6 months ago

It's not a fixed split. I don't know if it's possible live, or if it requires a reboot, but it's not hardwired.

I want to know if it's possible. 4GB for Linux, a bit of room for the calculations, and then you can load a 122GB model entirely into VRAM.

How would that perform in real life? Someone please benchmark it!