top | item 44136107

(no title)

m4r71n | 9 months ago

What is everyone using their local LLMs for primarily? Unless you have a beefy machine, you'll never approach the level of quality of proprietary models like Gemini or Claude, but I'm guessing these smaller models still have their use cases, just not sure what those are.

discuss

Rotundo|9 months ago

Not everyone is comfortable with sending their data and/or questions and prompts to an external party.

DennisP|9 months ago

Especially now that a court has ordered OpenAI to keep records of it all.

https://www.adweek.com/media/a-federal-judge-ordered-openai-...

diggan|9 months ago

I'm currently experimenting with Devstral for my own local coding agent I've slowly built together. It's in many ways nicer than Codex in that 1) full access to my hardware so can start VMs, make network requests and everything else I can do, which Codex cannot and 2) it's way faster both in initial setup, working through things and creating a patch.

Of course, it still isn't at the same level as Codex itself, the model Codex is using is just way better so of course it'll get better results. But Devstral (as I currently use it) is able to make smaller changes and refactors, and I think if I evolve the software a bit more, can start making larger changes too.

brandall10|9 months ago

Why are you comparing it to Codex and not Claude Code, which can do all those things?

And why not just use Openhands, which it was designed around which I presume can also do all those things?

barnabee|9 months ago

I generally try a local model first for most prompts. It's good enough surprisingly often (over 50% for sure). Every time I avoid using a cloud service is a win.

ativzzz|9 months ago

I think that the future of local LLMs is delegation. You give it a prompt and it very quickly identifies what should be used to solve the prompt.

Can it be solved locally with locally running MCPs? Or maybe it's a system API - like reading your calendar or checking your email. Otherwise it identifies the best cloud model and sends the prompt there.

Basically Siri if it was good

eddythompson80|9 months ago

I completely disagree. I don't see the current status quo fundamentally changing.

That idea makes so much sense on paper, but until you start implementing it that you realized why no one does it (including Siri). "Some tasks are complex and better suited for complex giant model, but small models are perfectly capable of running simple limited task" makes a ton of sense, but the component best equipped at evaluating that decision is the smarter component of your system. At which point, you might as well have had it run the task.

It's like assigning the intern to triage your work items.

When actually implementing the application with that approach, every time you encounter an "AI-miss" you would (understandably) blame the small model, and eventually give up and delegate yet-another-scnario to the cloud model.

Eventually you feel you're artificially handcuffing yourself compared to literally every body else trying to ship something utilizing a 1b model. You have the worst of all options, crappy model with lots of hiccups yet it's still (by far) the most resource intensive part of your application making the whole thing super heavy and you are delegating more and more to the cloud model.

The local LLM scenario is going to be entirely driven by privacy concerns (around which there is no option. It's not like an E2EE LLM API could exist) or cost concerns if you believe you can run it cheaper.

drillsteps5|9 months ago

I avoid using cloud whenever I can on principle. For instance, OpenAI recently indicated that they are working on some social network-like service for ChatGPT users to share their chats.

Running it locally helps me understand how these things work under the hood, which raises my value on the job market. I also play with various ideas which have LLM on the backend (think LLM-powered Web search, agents, things of that nature), I don't have to pay cloud providers, and I already had a gaming rig when LLaMa was released.

moffkalast|9 months ago

> unless you have a beefy machine

The average person in r/locallama has a machine that would make r/pcmasterrace users blush.

rollcat|9 months ago

An Apple M1 is decent enough for LMs. My friend wondered why I got so excited about it when it came out five years ago. It wasn't that it was particularly powerful - it's decent. What it did was to set a new bar for "low end".

ijk|9 months ago

General local inference strengths:

- Experiments with inference-level control; can't do the Outlines / Instructor stuff with most API services, can't do the advanced sampling strategies, etc. (They're catching up but they're 12 months behind what you can do locally.)

- Small, fast, finetuned models; _if you know what your domain is sufficiently to train a model you can outperform everything else_. General models usually win, if only due to ease of prompt engineering, but not always.

- Control over which model is being run. Some drift is inevitable as your real-world data changes, but when your model is also changing underneath you it can be harder to build something sustainable.

- More control over costs; this is the classic on-prem versus cloud decision. Most cases you just want to pay for the cloud...but we're not in ZIRP anymore and having a predictable power bill can trump sudden unpredictable API bills.

In general, the move to cloud services was originally a cynical OpenAI move to keep GPT-3 locked away. They've built up a bunch of reasons to prefer the in-cloud models (heavily subsidized fast inference, the biggest and most cutting edge models, etc.) so if you need the latest and greatest right now and are willing to pay, it's probably the right business move for most businesses.

This is likely to change as we get models that can reasonably run on edge devices; right now it's hard to build an app or a video game that incidentally uses LLM tech because user revenue is unlikely to exceed inference costs without a lot of careful planning or a subscription. Not impossible, but definitely adds business challenges. Small models running on end-user devices opens up an entirely new level of applications in terms of cost-effectiveness.

If you need the right answer, sometimes only the biggest cloud API model is acceptable. If you've got some wiggle room on accuracy and can live with sometimes getting a substandard response, then you've got a lot more options. The trick is that the things that an LLM is best at are always going to be things where less than five nines of reliability are acceptable, so even though the biggest models have more reliability, an average there are many tasks where you might be just fine with a small fast model that you have more control over.

teleforce|9 months ago

This is an excellent example of local LLM application [1].

It's an AI-driven chat system designed to support students in the Introduction to Computing course (ECE 120) at UIUC, offering assistance with course content, homework, or troubleshooting common problems.

It serves as an educational aid integrated into the course’s learning environment using UIUC Illinois Chat system [2].

Personally I've found it's really useful that it provides the details portions of course study materials for examples slides that's directly related to the discussions so the students can check the sources veracity of the answers provided by the LLM.

It seems to me that RAG is the killer feature for local LLM [3]. It directly addressed the main pain point of LLM hallucinations and help LLMs stick to the facts.

[1] Introduction to Computing course (ECE 120) Chatbot:

https://www.uiuc.chat/ece120/chat

[2] UIUC Illinois Chat:

https://uiuc.chat/

[3] Retrieval-augmented generation [RAG]:

https://en.wikipedia.org/wiki/Retrieval-augmented_generation

staticcaucasian|9 months ago

Does this actually need to be local? Since the chat bot is open to the public and I assume the course material used for RAG all on this page (https://canvas.illinois.edu/courses/54315/pages/exam-schedul...) all stays freely accessible - I clicked a few links without being a student - I assume a pre-prompted larger non-local LLM would outperform the local instance. Though, you can imagine an equivalent course with all of its content ACL-gated/'paywalled' could benefit from local RAG, I guess.

ozim|9 months ago

You still can get decent stuff out of local ones.

Mostly I use it for testing tools and integrations via API not to spend money on subscriptions. When I see something working I switch it to proprietary one to get best results.

nomel|9 months ago

If you're comfortable with the API, all the services provide pay-as-you-go API access that can be much cheaper. I've tried local, but the time cost of getting it to spit out something reasonable wasn't worth the literal pennies the answers from the flagship would cost.

qingcharles|9 months ago

If you look on localllama you'll see most of the people there are really just trying to do NSFW or other questionable or unethical things with it.

The stuff you can run on reasonable home hardware (e.g. a single GPU) isn't going to blow your mind. You can get pretty close to GPT3.5, but it'll feel dated and clunky compared to what you're used to.

Unless you have already spent big $$ on a GPU for gaming, I really don't think buying GPUs for home makes sense, considering the hardware and running costs, when you can go to a site like vast.ai and borrow one for an insanely cheap amount to try it out. You'll probably get bored and be glad you didn't spend your kids' college fund on a rack of H100s.

kbelder|9 months ago

There's some other reasons to run local LLMs. If it's on my PC, I can preload the context with, say, information about all the members of my family. Their birthdays, hobbies, favorite things. I can load in my schedule, businesses I frequent. I can connect it to local databases on my machine. All sorts of things that can make it a useful assistant, but that I would never upload into a cloud service.

mixmastamyk|9 months ago

Shouldn't the (MoE) mixture of experts approach allow one to conserve memory by working on specific problem type at a time?

> (MoE) divides an AI model into separate sub-networks (or "experts"), each specializing in a subset of the input data, to jointly perform a task.

ijk|9 months ago

Sort of, but the "experts" aren't easily divisible in a conceptually interpretable way so the naive understanding of MoE is misleading.

What you typically end up with in memory constrained environments is that the core shared layers are in fast memory (VRAM, ideally) and the rest are in slower memory (system RAM or even a fast SSD).

MoE models are typically very shallow-but-wide in comparison with the dense models, so they end up being faster than an equivalent dense model, because they're ultimately running through fewer layers each token.

cratermoon|9 months ago

I have a large repository of notes, article drafts, and commonplace book-type stuff. I experimented a year or so ago with a system using RAG to "ask myself" what I have to say about various topics. (I suppose nowadays I would use MCP instead of RAG?) I was not especially impressed by the results with the models I was able to run: long-winded responses full of slop and repetition, irrelevant information pulled in from notes that had some semantically similar ideas, and such. I'm certainly not going to feed the contents of my private notebooks to any of the AI companies.

notfromhere|9 months ago

You'd still use RAG, just use MCP to more easily connect an LLM to your RAG pipeline