Why host your own LLM?

[+] waffletower|2 years ago|reply

I chose to self-host for a variety of reasons. One, I had been using gpt-3.5-turbo for a hobby project and had an abysmal accessibility experience -- I received repeated 429s with serial requests far more sparse than the limits, at most frequent, 10 times less often than the recommended threshold. When I adjusted further the 429s kept returning. It reached a point where even llama.cpp models (on 20 cores) were definitely more performant. I received absolutely no response from customer support despite being an actual paying customer. I imagine OpenAI affiliates here will downvote this, but I found OpenAI's API accessibility to be one of the most terrible I have ever used.

[+] phillipcarter|2 years ago|reply

FWIW, we've been live with GPT-3.5 turbo since May and it's improved a LOT.

Latency is down consistently across the board and we haven't seen a single "429 model overloaded" error in the past month: https://twitter.com/_cartermp/status/1686894576202907651/

[+] justanotherunit|2 years ago|reply

Where do you host your model? I am looking around on where I can deploy one without ruining me financially.

[+] visarga|2 years ago|reply

I have had similar issues with the Azure GPT-3.5, it would respond fast sometimes, and just hang for minutes other times, not even a 429. Just blocked randomly.

[+] pmontra|2 years ago|reply

How much are you spending to self host a LLM?

[+] rmason|2 years ago|reply

Just listened to Lex Fridman's three hour interview with George Hotz this weekend. He spoke about his new company, Tiny Corp.

Tiny Corp. will be producing the Tiny Box that lets you host your own LLM at home using TinyGrad software.

The tinybox

738 FP16 TFLOPS

144 GB GPU RAM

5.76 TB/s RAM bandwidth

30 GB/s model load bandwidth (big llama loads in around 4 seconds)

AMD EPYC CPU

1600W (one 120V outlet)

Runs 65B FP16 LLaMA out of the box (using tinygrad, subject to software development risks)

$15,000

Hardware startups are extremely difficult. But Hotz's other company Comma.ai is already profitable so it is possible. I find the guy extremely encouraging and he is always doing interesting stuff.

[+] LoganDark|2 years ago|reply

Is $15,000 really an "at home" sort of price?

(If money is no object, why not grab an oxide.computer rack? Assuming you have three-phase power, of course...)

[+] doctorpangloss|2 years ago|reply

George Hotz misread the NVIDIA Inception pricing as per unit instead of as a rebate, believing the GPUs are 80% cheaper than they actually are.

[+] lawn|2 years ago|reply

Why buy this instead of building your own setup with X 4090s?

[+] outside1234|2 years ago|reply

What podcast (or otherwise) was this?

Was it this one? https://www.youtube.com/watch?v=dNrTrx42DGQ

[+] psyclobe|2 years ago|reply

Think of it like an ac unit, they are expensive but indispensable. I have always envisioned locally trained ai and now it is happening!

[+] bananapub|2 years ago|reply

why are you valuing George Hotz (or Lex Fridman's) opinions in this at all?

[+] naillo|2 years ago|reply

Imagine how many 4090s you could buy and run in a cluster for $15,000 though

[+] garciasn|2 years ago|reply

We host our own LLM because:

1. We are not permitted, by rule, to send client data to unapproved third parties.

2. I generally do not trust third parties with our data, even if it falls outside of #1. Just look at the hoopla with Zoom; do you really want OpenAI further solidifying their grip on the industry with your data?

3. We have the opportunity to refine the models ourselves to get better results than offered out of the box.

4. It's fun work to do and there's a confluence of a ton of new and existing nerdy technology to learn and use.

[+] zzleeper|2 years ago|reply

Stupid question from an outsider, but is it possible to grab a pretrained model (most likely one of the "camelids"), feed it your own data (I have about 1k documents that cannot leave my network) and use it for fast-but-sometimes-wrong information retrieval?

[+] weinzierl|2 years ago|reply

What is your experience with quality? Even if you don't have the option to use a third-party LLM the question is if your self-hosted solution is good enough so your users (employees) will accept it. While you can forbid external solutions in the end you can't force them to use your own solution - at least not in the long run.

I'm very curious what your experience is? Do you think to self-host is good enough so users will accept it?

[+] rig666|2 years ago|reply

I host an LLM because it's cheaper for my use case. To many people focus on how an LLM interfaces with users but I believe the best most reliable use for an LLM is for analyzing free form text and having it put that data into quantifiable fields or tagging. Things like this would have taken an interns or overseas laborers weeks to months to do can now finally be automated.

[+] unshavedyak|2 years ago|reply

I've had a similar thought. I want to feed LLMs (and friends) messy data from my house and let it un-mess as best it can. A big hurdle in managing home data (chat logs, emails, browser history, etc) is making use of it. i don't want to have to tag all of my data. LLMs seem really attractive for that to me.

I have this urge to toy with the idea but i also find "Prompt Engineering" to be very unattractive. It feels like something i'd have to re-tailor towards any new model i change to. Not very re-usable and difficult to make "just work".

[+] celestialcheese|2 years ago|reply

Exactly this. It's so fast to spin up classifiers now when it used to take weeks to get something working.

What LLMs are you using?

[+] hubraumhugo|2 years ago|reply

Absolutely, just look at the number of manual data entry jobs on Upwork. IMO one of the superpowers of LLMs is not generating text or images, it's understanding and transforming unstructured data.

[+] victorbjorklund|2 years ago|reply

What type of analysis do you do on the text? And how is the performance/cost of running vs more specialized models trained for the task?

[+] wcedmisten|2 years ago|reply

How accurate is an LLM for this task? I was thinking of using one for analyzing free form PDF text to find a specific element, but I was worried about hallucinations.

[+] jackthetab|2 years ago|reply

I assume asking for "quantifiable fields" is akin to requesting "return the data in JSON format", yes?

How do you do the tagging bits, though?

[+] andai|2 years ago|reply

Cheaper than GPT-3? Can you give a comparison of the costs?

[+] neilv|2 years ago|reply

It takes searching and experimenting to figure out what works, and to avoid some of the sketchier stuff (and to lean towards things you could legally use for a startup), but I'm pretty happy with my current home setup, on an old PC with RTX 3090 and 64GB main RAM.

8-bit quantized uncensored Llama 2 13B, doing 50 generated tokens/second, using CPU+GPU including 17GB of 3090's 24GB VRAM.

I also have quantized 70B running currently CPU-only, but I might later be able to speed that up with some CUDA or OpenCL offloading.

This is on Debian Stable (like usual), albeit currently with closed Nvidia CUDA stack, and necessarily with the closed Llama 2 that I can only fine-tune atop. (I'm hoping that some scientific/academic non-profit/govt effort will be able to muster fully open models in the future.)

One of the main reasons I picked Llama 2 was the relatively friendly licensing (and Meta is earning lots of goodwill with that). With this licensing, and the performance I'm getting, in theory, I could even shoestring bootstrap an indie startup with low online LLM demands, from a single consumer hardware box in the proverbial startup garage or kitchen table. (Though I'd try to get affordable cloud compute first.)

[+] thrwayaistartup|2 years ago|reply

I am about to start working on a non-profit project -- not a startup, but similar in terms of resources dedicated to the project and how we hope it will scale.

One of our big questions is whether it makes sense to rent or to buy for training/finetuning/RLHF. The advantage of renting is obvious: I don't think that this phase of the project will last very long, and if it turns out that the idea is a success we'll have no problem securing funding for perma-improvement infra.

The possible advantage of buying is that we would then have the hardware available for inference hosting. We do expect some amount of demand in perpetuity. Having that ongoing cost as small as possible would allow us to continue serving the "clients" we KNOW would benefit a lot from our service with minimal recurring revenue.

[+] rig666|2 years ago|reply

Just a suggestion but they have 4bit quantified models that are even smaller and faster that the 8 bit. Your average 13B 4bit model is only about 8-9gb of VRAM. Using this I bet you can get a much higher perimeter model on the 3090.

[+] dealuromanet|2 years ago|reply

Whoa, 50 tokens/second locally sounds amazing. Any recommendations on guides or documentation for setting up the stack to run on hardware like that?

[+] easygenes|2 years ago|reply

I’m pretty amazed by how good 13B models are since they’ve gotten the orca treatment. This new one released today has the best evaluation performance of all so far and is in some ways comparable or better than the original LLaMA-65b… a bit shocked by that.

https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B

https://huggingface.co/spaces/Open-Orca/OpenOrca-Platypus2-1...

[+] elorant|2 years ago|reply

I self host an LLM (Vicuna 13b) for two reasons. One is cost, and the second is privacy. I don’t want OpenAI or any other provider knowing what I'm working on because they could replicate it. I'm not saying that they would considering that there could be thousand of business cases for using an LLM, but why risk it. By running it locally I have one less thing to worry about.

[+] edfletcher_t137|2 years ago|reply

I built a simple, asynchronous, serialized API on top of llama.cpp for exactly this reason. https://github.com/edfletcher/llama.http/tree/master/example... It can run on low-resource VPSes even, if you have the patience for CPU inference (which will take awhile)!

[+] jmorgan|2 years ago|reply

One benefit of self-hosting LLMs is the wide range of fine-tuned models available, including uncensored models. A popular one over the last weeks was Llama 2 Uncensored by George Sung: https://ollama.ai/blog/run-llama2-uncensored-locally

A few more:

- Wizard Vicuna 13B uncensored

- Nous Hermes Llama 2

- WizardLM Uncensored llama2

[+] flemhans|2 years ago|reply

How could you _not_?

I'm only waiting to have decent LLMs able locally to be able to start really using them.

No way I'd feed my code, customer data, personal info, secrets, emails, etc., to some dubious cloud machine which is already excellent at exctracting valuable or juicy bits from what I'm feeding it.

[+] jonahbenton|2 years ago|reply

+1

I have a little quip I use to troll people who say in effect- but but contract law, your contract says your data is yours even when it is on someone's cloud servers- I say- I know, I know, but remember "possession is 9/10ths of the law."

I do believe in 99.(many 9s) cases no one admin with visibility cares about any given customer's stuff but if they do and if it matters, by then it's too late.

[+] gdsdfe|2 years ago|reply

I've been thinking about hosting my own LLM to see if I can hyper customized it to me basically, kinda like an AI companion. My main issue is building the hardware, there so much fluff in that space, it's hard to know what to get and what works well together

[+] ajcp|2 years ago|reply

- Intel Core i9-11900KF 3.5 GHz 8-Core Processor

- Corsair H150i PRO 47.3 CFM Liquid CPU Cooler

- MSI MPG Z590 GAMING EDGE WIFI ATX LGA1200 Motherboard

- G.Skill Ripjaws V 64 GB (4 x 16 GB) DDR4-3600 CL18 Memory

- Samsung 970 Evo Plus 1 TB M.2-2280 PCIe 3.0 X4 NVME Solid State Drive

- MSI GeForce RTX 3090 TI SUPRIM X 24G GeForce RTX 3090 Ti 24 GB Video Card

- Corsair Carbide Series 275R ATX Mid Tower Case

- Corsair RM1000x (2021) 1000 W 80+ Gold Certified Fully Modular ATX Power Supply

- Microsoft Windows 11 Pro

On this setup I've been able to run every model 13B and below with 0 issue. Even been able to fine-tune Llama 2 13B using my own data (emails, SMS, FB messages, WhatsApp, etc.) with pretty fun results!

[+] thatcherthorn|2 years ago|reply

I haven't tried self-hosting due to the hesitation around the general drab I've experienced in the past trying trying to host other ML models.

Find a repo. follow the install instructions. What is this weird error? A library issue..? Maybe it's my OS..?

It always seems to be tedious compared to open projects in other domains. Maybe that can't be solved.

[+] smcleod|2 years ago|reply

I replied to a comment in another post yesterday on this - https://news.ycombinator.com/item?id=37120346

Honestly the easiest way that “just works” is to use LM Studio which you can run locally https://lmstudio.ai/

Obviously you’ll have faster results if you have a fancy gaming GPU or something like the M2 Max/Ultra but you don’t need those to have a play and see if it interests you.

[+] vorticalbox|2 years ago|reply

I just use GTP4all [0] both the GUI and the python bindings[1]

[0] https://gpt4all.io/index.html [1] https://docs.gpt4all.io/

[+] vorpalhex|2 years ago|reply

Llama models have been pretty easy to host. StableDiffusion was a real nightmare when it came out (and still is at times).

Using docker has an initial threshold you have to get over but once you do, everything becomes very easy in it. How you end up using docker matters very little once you get the concepts.

[+] bheadmaster|2 years ago|reply

In my experience, people specialzed in machine learning are usually researchers and mathematicians, not engineers. Writing a package that will work on any random person's hardware and system is a non-trivial engineering task.

[+] _pdp_|2 years ago|reply

You should run your own LLM if you can. Just keep in mind that many hobby users simply cannot do that. They represent the majority of LLM users - not the majority of power users. These people will struggle to use LLMs without some technical support, not because they cannot learn, of course they can, mostly because it is not their priority. LLMs as a technology needs to be made more widely accessible by making it open-source, so folks can run their own instances if they decided to do so, but also hosting it and providing it as a service for those who simply do not have the skills or desire to run them themselves.

[+] ilaksh|2 years ago|reply

A few things holding me back for now:

- I use LLMs for code generation for a startup and they are not competitive for that yet.

- Most of the popular open models are non-commercial.

- The only practical way I know of to get large custom datasets for training is to have OpenAI's models generate them, and they forbid this in their terms of service.

Having something that's truly open and closer to GPT-4 for code generation will probably happen within less than a year (I hope) and will be a game changer for self-hosting.

[+] Bostonian|2 years ago|reply

ChatGPT is powerful, but it gives you different answers to the same question from one session to the next. And research found that overall performance can vary over time, sometimes for the worse. So you may host your own LLM for reproducibility.

I have not tried public LLMs myself. Do they give reproducible results?

[+] blackcat201|2 years ago|reply

I own my LLM not because I need it now but having the luxury to fall back if openai ran out of money

[+] sourcecodeplz|2 years ago|reply

I run Llama 7b with CPU only. It is fun when my Internet goes down and I have nothing else to do.

[+] kordlessagain|2 years ago|reply

They are a bit of an "offline" network, in a way.

[+] unknown|2 years ago|reply

[deleted]

[+] d_sem|2 years ago|reply

Whether or not its "better" to host your own, the positive effects from the open source community at trimming down the parts of state-of-the-art LLM that matter and improving efficiency will be good for the community.

[+] talham|2 years ago|reply

thanks for writing the article: any recommended links on HOW to host your own LLM?

132 comments