> I decided to explore self-hosting some of my non-critical applications
Self-hosting static or almost-static websites is now really easy with a Cloudflare front. I just closed my account on SmugMug and published my images locally using my NAS; this costs no extra money (is basically free) since the photos were already on the NAS, and the NAS is already powered on 24-7.
The NAS I use is an Asustor so it's not really Linux and you can't install what you want on it, but it has Apache, Python and PHP with Sqlite extension, which is more than enough for basic websites.
Cloudflare free is like magic. Response times are near instantaneous and setup is minimal. You don't even have to configure an SSL certificate locally, it's all handled for you and works for wildcard subdomains.
And of course if one puts a real server behind it, like in the post, anything's possible.
You could also use openVPN or wireguard and not have a man in the middle for no reason.
I have a VPN on a raspberry pi and with that I can connect to my self hosted cloud, dev/staging servers for projects, gitlab and etc when I’m not on my home network.
For the people who self-host LLMs at home: what use cases do you have?
Personally, I have some notes and bookmarks that I'd like to scrape, then have an LLM summarize, generate hierarchical tags, and store in a database. For the notes part at least, I wouldn't want to give them to another provider; even for the bookmarks, I wouldn't be comfortable passing my reading profile to anyone.
llama3.2 1b & 3b is really useful for quick tasks like creating some quick scripts from some text, then pasting them to execute as it's super fast & replaces a lot of temporary automation needs. If you don't feel like invest time into automation, sometimes you can just feed into an LLM.
This is one of the reason why recently I added floating chat to https://recurse.chat/ to quickly access local LLM.
For me at least the biggest feature of some self hosted LLMs is that you can get it then to be “uncensored”, you can get them to tell you dirty jokes or have the bias removed with controversial and politically incorrect subjects. Basically you have a freedom you won’t get from most of the main providers.
I run Mistral Large on 2xA6000. 9 times out of 10 the response is the same quality as GPT 4o. My employer does not allow the use of GPT for privacy related reasons. So I just use a private Mistral for that
I mostly use it to write some quick scripts or generate texts if it follows some pattern. Also, getting it up running with LM studio is pretty straightforward.
I've been enjoying fine-tuning various models with various data, for example 17 years of my own tweets, and then just cranking up the temperature and letting the model generate random crap that cracks me up. Is that practical? Is joy practical? I think there's a place for it.
All the self-hosted LLM and text-to-image models come with some restrictions trained into them [1]. However there are plenty of people who have made uncensored "forks" of these models where the restrictions have been "trained away" (mostly by fine-tuning).
You can find plenty of uncensored LLM models here:
[1]: I personally suspect that many LLMs are still trained on WebText, derivatives of WebText, or using synthetic data generated by LLMs trained on WebText. This might be why they feel so "censored":
>WebText was generated by scraping only pages linked to by Reddit posts that had received at least three upvotes prior to December 2017. The corpus was subsequently cleaned
The implications of so much AI trained on content upvoted by 2015-2017 redditors is not talked about enough.
It has a sanitised output. You might want to look for "abliterated" models, where the general performance might drop a bit but the guard-rails have been diminished.
I’m curious about how good the performance with local LLMs is on ‘outdated’ hardware like the author’s 2060. I have a desktop with a 2070 super that it could be fun to turn into an “AI server” if I had the time…
I've been playing with some LLMs like Llama 3 and Gemma on my 2080Ti. If it fits in GPU memory the inference speed is quite decent.
However I've found quality of smaller models to be quite lacking. The Llama 3.2 3B for example is much worse than Gemma2 9B, which is the one I found performs best while fitting comfortably.
Actual sentences are fine, but it doesn't follow prompts as well and it doesn't "understand" the context very well.
Quantization brings down memory cost, but there seems to be a sharp decline below 5 bits for those I tried. So a larger but heavily quantized model usually performs worse, at least with the models I've tried so far.
So with only 6GB of GPU memory I think you either have to accept the hit on inference speed by only partially offloading, or accept fairly low model quality.
Doesn't mean the smaller models can't be useful, but don't expect ChatGPT 4o at home.
That said if you got a beefy CPU then it can be reasonable to have it do a few of the layers.
Personally I found Gemma2 9B quantized to 6 bit IIRC to be quite useful. YMMV.
If you want to set up an AI server for your own use, it's exceedingly easy to install LM Studio and hit the "serve an API" button.
Testing performance this way, I got about 0.5-1.5 tokens per second with an 8GB 4bit quantized model on an old DL360 rack-mount server with 192GB RAM and 2 E5-2670 CPUs. I got about 20-50 tokens per second on my laptop with a mobile RTX 4080.
I am using an old laptop with a GTX 1060 6 GB VRAM to run a home server with Ubuntu and Ollama. Because of quantization Ollama can run 7B/8B models on an 8 year old laptop GPU with 6 GB VRAM.
Last time I tried a local llm was about a year ago with a 2070S and 3950x and the performance was quite slow for anything beyond phi 3.5 and the small models quality feels worse than what some providers offer for cheap or free so it doesn't seem worth it with my current hardware.
Edit: I've loaded llama 3.1 8b instruct GGUF and I got 12.61 tok/sec and 80tok/sec for 3.2 3b.
Why disable LVM for a smoother reboot experience? For encryption I get it since you need a key to mount, but all my setups have LVM or ZFS and I'd say my reboots are smooth enough.
Coolify is quite nice, have been running some things with the v4 beta.
It reminds a bit of making web sites with a page builder. Easy to install and click around to get something running without thinking too much about it fairly quickly.
Problems are quite similar also, training wheels getting stuck in the woods more easily, hehe.
V4 beta is working well for me. Also the new core dev Coolify hired mentioned in a Tweet this week that they're fixing up lots of bugs to get ready for V4 stable.
Can you use a selfhosted LLM that fits in 12 GB VRAM as a reasonable substitute for copilot in VSCode? And if so, can you give it documentation and other code repositories to make it better at a particular language and platform?
Technically, yes, but will yield poor results. We did it internally at big corp n+1 and it, frankly, blows. Other than menial tasks, it's good for nothing but a scout badge.
How is coolify different than ollama? is it better? worse? I like ollama because I can pull models and it exposes a rest api to me. which is great for development
Snark aside, even in Germany (where electricity is very expensive) it is more economical to self host than to pay for a subscription to any of the commercial providers.
I don’t know, it’s kind of amazing how good the lighter weight self hosted models are now.
Given a 16gb system with cpu inference only, I’m hosting gemma2 9b at q8 for llm tasks and SDXL turbo for image work and besides the memory usage creeping up for a second or so while i invoke a prompt, they’re basically undetectable in the background.
bambax|1 year ago
Self-hosting static or almost-static websites is now really easy with a Cloudflare front. I just closed my account on SmugMug and published my images locally using my NAS; this costs no extra money (is basically free) since the photos were already on the NAS, and the NAS is already powered on 24-7.
The NAS I use is an Asustor so it's not really Linux and you can't install what you want on it, but it has Apache, Python and PHP with Sqlite extension, which is more than enough for basic websites.
Cloudflare free is like magic. Response times are near instantaneous and setup is minimal. You don't even have to configure an SSL certificate locally, it's all handled for you and works for wildcard subdomains.
And of course if one puts a real server behind it, like in the post, anything's possible.
ghoomketu|1 year ago
Cloudflare is pretty strict about the Html to media ratio and might suspend or terminate your account if you are serving too many images.
I've read far too many horror stories about this on hn only so please make sure what you're doing is allowed by their TOS.
archerx|1 year ago
I have a VPN on a raspberry pi and with that I can connect to my self hosted cloud, dev/staging servers for projects, gitlab and etc when I’m not on my home network.
Reubend|1 year ago
taosx|1 year ago
Personally, I have some notes and bookmarks that I'd like to scrape, then have an LLM summarize, generate hierarchical tags, and store in a database. For the notes part at least, I wouldn't want to give them to another provider; even for the bookmarks, I wouldn't be comfortable passing my reading profile to anyone.
xyc|1 year ago
This is one of the reason why recently I added floating chat to https://recurse.chat/ to quickly access local LLM.
Here's a demo: https://x.com/recursechat/status/1846309980091330815
TechDebtDevin|1 year ago
archerx|1 year ago
Rick76|1 year ago
williamcotton|1 year ago
ein0p|1 year ago
laniakean|1 year ago
segalord|1 year ago
cma|1 year ago
theodric|1 year ago
m0wer|1 year ago
netdevnet|1 year ago
dtquad|1 year ago
You can find plenty of uncensored LLM models here:
https://ollama.com/library
[1]: I personally suspect that many LLMs are still trained on WebText, derivatives of WebText, or using synthetic data generated by LLMs trained on WebText. This might be why they feel so "censored":
>WebText was generated by scraping only pages linked to by Reddit posts that had received at least three upvotes prior to December 2017. The corpus was subsequently cleaned
The implications of so much AI trained on content upvoted by 2015-2017 redditors is not talked about enough.
Kudos|1 year ago
nubinetwork|1 year ago
exe34|1 year ago
seungwoolee518|1 year ago
However, Do I need to Install CUDA toolkit on host?
I haven't install CUDA toolkit when I use on Containerized platform (like docker)
thangngoc89|1 year ago
Nvidia driver + Nvidia container toolkit would do the job. You could check official instructions at [0]
[0] https://docs.nvidia.com/datacenter/cloud-native/container-to...
ossusermivami|1 year ago
hmcamp|1 year ago
varun_ch|1 year ago
magicalhippo|1 year ago
However I've found quality of smaller models to be quite lacking. The Llama 3.2 3B for example is much worse than Gemma2 9B, which is the one I found performs best while fitting comfortably.
Actual sentences are fine, but it doesn't follow prompts as well and it doesn't "understand" the context very well.
Quantization brings down memory cost, but there seems to be a sharp decline below 5 bits for those I tried. So a larger but heavily quantized model usually performs worse, at least with the models I've tried so far.
So with only 6GB of GPU memory I think you either have to accept the hit on inference speed by only partially offloading, or accept fairly low model quality.
Doesn't mean the smaller models can't be useful, but don't expect ChatGPT 4o at home.
That said if you got a beefy CPU then it can be reasonable to have it do a few of the layers.
Personally I found Gemma2 9B quantized to 6 bit IIRC to be quite useful. YMMV.
khafra|1 year ago
Testing performance this way, I got about 0.5-1.5 tokens per second with an 8GB 4bit quantized model on an old DL360 rack-mount server with 192GB RAM and 2 E5-2670 CPUs. I got about 20-50 tokens per second on my laptop with a mobile RTX 4080.
dtquad|1 year ago
whitefables|1 year ago
alias_neo|1 year ago
I use a Tesla P4 for ML stuff at home, it's equivalent to a 1080 Ti, and has a score of 7.1. A 2070 (they don't list the "super") is a 7.5.
For reference, 4060 Ti, 4070 Ti, 4080 and 4090 are 8.9, which is the highest score for a gaming graphics card.
taosx|1 year ago
Edit: I've loaded llama 3.1 8b instruct GGUF and I got 12.61 tok/sec and 80tok/sec for 3.2 3b.
nubinetwork|1 year ago
_blk|1 year ago
satvikpendem|1 year ago
j12a|1 year ago
It reminds a bit of making web sites with a page builder. Easy to install and click around to get something running without thinking too much about it fairly quickly.
Problems are quite similar also, training wheels getting stuck in the woods more easily, hehe.
raybb|1 year ago
whitefables|1 year ago
It was so easy to get other non-AI stuffs running!
sorenjan|1 year ago
0xedd|1 year ago
vincentclee|1 year ago
https://github.com/Syllo/nvtop
cranberryturkey|1 year ago
grahamj|1 year ago
unknown|1 year ago
[deleted]
eloycoto|1 year ago
ragebol|1 year ago
rglullis|1 year ago
CraigJPerry|1 year ago
Given a 16gb system with cpu inference only, I’m hosting gemma2 9b at q8 for llm tasks and SDXL turbo for image work and besides the memory usage creeping up for a second or so while i invoke a prompt, they’re basically undetectable in the background.
szundi|1 year ago
keriati1|1 year ago
thawab|1 year ago
unknown|1 year ago
[deleted]