I've been running the 'frontier' open-weight LLMs (mainly deepseek r1/v3) at home, and I find that they're best for asynchronous interactions. Give it a prompt and come back in 30-45 minutes to read the response. I've been running on a dual-socket 36-core Xeon with 768GB of RAM and it typically gets 1-2 tokens/sec. Great for research questions or coding prompts, not great for text auto-complete while programming.
christina97|2 months ago
tyre|2 months ago
mechagodzilla|2 months ago
Workaccount2|2 months ago
A less paranoid and much more economically efficient approach would be to just lease a server and run the models on that.
dimava|2 months ago
And you can only generate like $20 of tokens a month
Cloud tokens made on TPU will always be cheaper and waaay faster then anything you can make at home
oceanplexian|2 months ago
None of them will keep your data truly private and offline.