(no title)
maartenh | 3 months ago
I bought a 12GB Nvidia card a year ago. In general I'm having a hard time to find the actual required hardware specs for any self hosted AI model. Any tips/suggestions/recommended resources for that?
maartenh | 3 months ago
I bought a 12GB Nvidia card a year ago. In general I'm having a hard time to find the actual required hardware specs for any self hosted AI model. Any tips/suggestions/recommended resources for that?
nsingh2|3 months ago
You'll also need to load inputs (images in this case) onto the GPU memory, and that depends on the image resolution and batch size.
selcuka|3 months ago
The Q4_K_S quantized version of Microsoft Fara 7B is a 5.8GB download. I'm pretty sure it would work on a 12GB Nvidia card. Even the Q8 one (9.5GB) could work.
BoredomIsFun|3 months ago
daemonologist|3 months ago
You're not finding hardware specs because there are a lot of variables at play - the degree to which the weights are quantized, how much space you want to set aside for the KV cache, extra memory needed for multimodal features, etc.
My rule of thumb is 1 byte per parameter to be comfortable (running a quantization with somewhere between 4.5 and 6 bits per parameter and leaving some room for the cache and extras), so 7 GB for 7 billion parameters. If you need a really large context you'll need more; if you want to push it you can get away with a little less.
samus|3 months ago
- If you have enough system RAM then your VRAM size almost doesn't matter as long as you're patient.
- For most models, running them at 16bit precision is a waste, unless you're fine-tuning. The difference to Q8 is negligible, Q6 is still very faithful. In return, they need less memory and get faster.
- Users obviously need to share computing resources with each other. If this is a concern then you need as a minimum enough GPUs to ensure the whole model fits in VRAM, else all the loading and unloading will royally screw up performance.
- Maximum context length is crucial to think about since it has to be stored in memory as well, preferably in VRAM. Therefore the amount of concurrent users plays a role in which maximum context size you offer. But it is also possible to offload it to system RAM or to quantize it.
Rule of thumb: budget 1.5*s where s is the model size at the quantization level you're using. Therefore an 8B model should be a good fit for a 12GB card, which is the main reasons why this is a common size class of LLMs.
rahimnathwani|3 months ago
https://huggingface.co/microsoft/Fara-7B/tree/main
If you want to find models which fit on your GPU, the easiest way is probably going to ollama.com/library
For a general purpose model, try this one, which should fit on your card:
https://ollama.com/library/gemma3:12b
If that doesn't work, the 4b version will definitely work.
jillesvangurp|3 months ago
I wish I had more time to play with this stuff. It's so hard to keep up with all this.
baq|3 months ago