The Coming of Local LLMs

[+] yacine_|2 years ago|reply

I was able to run a LLaMa on my personal machine to run some labeling on my documents, as a test of its capabilities. It was instruct tune. 30b parameters

4 example labels, and I had a binary classifier in seconds. Sure, semantic text classifiers were possible for a while, but making it accessible changes everything. Giving anyone who can use a spreadsheet the power of a local LLM (or, basically free LLMs) can make them much, much more productive. A lot of office work is clicking through sheets and doing manual labeling.

It's truly wild what is becoming accessible! Really excited to see the next gen software that the open community comes up with :)

[+] rcme|2 years ago|reply

LLMs as general purpose classifiers is a really big deal, especially because you can give them fuzzy instructions. I know people are worried about LLMs and spam, but I think LLMs may provide an opportunity to elevate online discourse by being more efficient at filtering out spam and low quality commentary.

[+] tyingq|2 years ago|reply

Agree, though plain old Bayesian classifiers have been able to handle some significant portion of that office work for a long time. And not much ever came from it for everyday stuff outside of spam filters.

Maybe both the buzz factor and broader applicability means it's more likely to happen this go around?

[+] bob1029|2 years ago|reply

> Sure, semantic text classifiers were possible for a while, but making it accessible changes everything.

Binary classification can actually take you all the way in terms of classification if you are clever with set theory. It's also one of the most traceable & deterministic ways to understand how the natural language is being interpreted at each step.

The amount of performance required to run something like an SVM is laughable compared to what is required to run even baby-tier LLMs. If you can reduce the cost of running models to a <1ms invocation over a few megabytes of black box, you can easily test thousands of these per-user-query. Re-training and iterating is much more enjoyable for these reasons. You also don't need any GPUs for this.

At the end of the day, the quality of your data will be the biggest issue with older techniques. LLMs can bandaid all sorts of weird things that crop up in the real world and aren't present in the training data. SVMs cannot tolerate requests delivered in the format of Shakespeare (if unexpected). In a well-controlled domain, you would probably be able to get away with much cheaper options that are also more flexible.

[+] czbond|2 years ago|reply

Can you explain what you did set up wise for your test? I'm following this "space" but the exact, simple pipelines are eluding me.

[+] ticviking|2 years ago|reply

The big thing for me and many others is the ability to use the tool without sending NDA data to a 3rd party.

The potential amplifying power of that is enormous.

[+] alden5|2 years ago|reply

What makes it so much better than normal text classification for me is it doesn't require tons of training data to accurately classify text. using it to parse craigslist posts which i might find interesting showed very promising results although it's fairly slow on my base m1 machine.

[+] syntaxing|2 years ago|reply

Super curious how you did this! Doesn’t 30B model require a hefty computer to run locally (assuming you’re tuning a non-quantized version)

[+] hospitalJail|2 years ago|reply

What model are you using? What program are you using?

Curious how you run the model then interface with it.

[+] cs702|2 years ago|reply

I expect we will see the biggest jump in performance if (when) consumer-grade coprocessors like mobile GPUs start incorporating attention layers as a primitive building block at the hardware level, e.g., with instructions and memory layouts engineered specifically to make ultra-low-precision (say, 4-bit) transformer layers as compute- and memory-efficient as possible on consumer devices. That seems almost inevitable to me.

[+] 1827162|2 years ago|reply

I found this to be very liberating, that I can finally type whatever I want into the LLM, without the possibility of the government knowing what I am writing. Just being able to do that, and have the watchful eye of the state not being able to monitor you is amazing.

[+] seydor|2 years ago|reply

you have to check your screen's firmware for that

[+] anentropic|2 years ago|reply

Apple should get working on a version of the Neural Engine that is useful for these models, and remove the 3GB size limit [1] to take full advantage of the 'unified' memory architecture. Game changer.

Waste of die space currently (on Macbook at least, I'm sure they find uses for it in the iPhone)

[1] https://github.com/smpanaro/more-ane-transformers/blob/main/...

[+] leetharris|2 years ago|reply

It's not a waste on Mac, it will dynamically switch between GPU and NPU whenever CoreML is called. There are a decent amount of applications that use CoreML.

But I do agree it should be improved!

[+] Torkel|2 years ago|reply

Things are moving so fast in this space I felt like I needed a [March] in the title on this one :)

[+] turnsout|2 years ago|reply

Too real–early March even felt different from late March. :D

[+] whimsicalism|2 years ago|reply

It appears there is this genre of articles pretending that LLAMA or its RL-HF tuned variants are somehow even close to an alternative to ChatGPT. Spending more than a few moments interacting even with the larger instruct-tuned variants of these models quickly dispels that idea. Why do these takes around open-source AI remain so popular? What is the driving force?

I've posted this before, but it seems like this genre is just getting more and more popular - and more and more untethered from any actual metrics of how good these models are.

[+] emrah|2 years ago|reply

It's great they got LLMs running on resource constrained devices but are they any good? Or I should ask, with the limited resources they get, what good are they for?

[+] kbrkbr|2 years ago|reply

From my experience with llama.cpp and oobaboogas webui I can say they are amazing, at least on my gaming pc. I’m absolutely astonished at the speed and quality of llama, alpaca, galactica and vicuna (the >10B parameters ones).

Make no mistake, it’s for tinkerers that do not expect each prompt to be answered human like.

I see them as creativity and thought testing tools, also knowledge exploratory.

[+] jrm4|2 years ago|reply

As I see these things come out, it feels like there's not a lot of discussion on which hardware (that isn't one of the fancy new Macs?) As in, there might be a lot of graphics cards out there that could be used here? Is it only Nvidia still, is AMD a possibility? Maybe I'm missing something on how the tech works?

[+] seydor|2 years ago|reply

This has a list of models and their VRAM requirements

https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_...

[+] boppo1|2 years ago|reply

30B llama needs a 3090 or 4090. 13B I think you can get away with a 3/4080. If you have 64 gigs of ram and a beefy CPU you can run even 65B, but boy it's slow.

13B is pretty meh, but 30B is great, if not quite Chatgpt. But I can ask it why my highschool geometry teacher was such a cunt and it will happily discuss the matter without reservation. Very therapeutic.

[+] unknown|2 years ago|reply

[deleted]

[+] layer8|2 years ago|reply

It would be nice to be able to run an LLM-driven spamassassin on a VPS for acceptable cost.

[+] londons_explore|2 years ago|reply

I don't think it will help. Actual friends occasionally send me mail that says "test" from a random account. And spammers do too... There is no way to seperate them.

[+] imaurer|2 years ago|reply

Tracking repos and resources for running LLMs locally here:

https://github.com/imaurer/awesome-decentralized-llm

[+] turnsout|2 years ago|reply

Also, this continuously-updated spreadsheet is extremely helpful: https://docs.google.com/spreadsheets/d/1O5KVQW1Hx5ZAkcg8AIRj...

[+] gigel82|2 years ago|reply

I don't understand why people are so excited to build this big thing on top of Llama, which is closed source, severely license restricted and we now know for a fact that Meta is going after users with the legal hammer.

I'm sure if we'd pool resources together we could build a truly open alternative worthy of building on top of.

[+] FL33TW00D|2 years ago|reply

I wrote a similar blog post in Nov 22: https://fleetwood.dev/posts/a-case-for-client-side-machine-l...

[+] la64710|2 years ago|reply

One simple thing these LLM models cannot do yet .. that is to simply point a LLM to a URL and it will start scraping - ie follow the hyperlinks and start consuming the content. I am not an AI guy but I guess this has to do with the context limitations of most model? How did they train OpenAI with all internet data till 2021? This I think will be a most popular feature for LLM models and I seriously hope it is OSS whenever it comes out.

[+] marijnz|2 years ago|reply

This already works with ChatGPT plugins: https://openai.com/blog/chatgpt-plugins

[+] waffletower|2 years ago|reply

While many NLP related Apple ML job listings have been added since this article was written, there were several recent listings at the time of its writing. While I feel that Apple does not focus well on intangible technologies, products that can't be readily carried, worn and given their boutique product development fetish focus, I have some hope that they can overcome this bias somewhat, and see how behind they are.

[+] simonw|2 years ago|reply

"I think we’re going to eventually see a demo showing an open source model running on an iPhone as well"

I have Kevin Kwok's SheepyT running on my iPhone right now - it uses GPT-J, which is an openly licensed LLM by EleutherAI.

https://twitter.com/antimatter15/status/1644456371121954817

[+] monkeydust|2 years ago|reply

Wait till these are embedded into soft toys.

'You are a koala who plays with 5-7 year olds, you are friendly natured and curious and like to ask questions'

[+] jjtheblunt|2 years ago|reply

Like Teddy from A.I. the movie?

https://www.youtube.com/watch?v=YRsICbxDEiI

[+] txomon|2 years ago|reply

Out of doubt, which seems to be spreading around the internet. The LLaMa model weights weren't "leaked" AFAIK but rather explicitly given access to to researchers, isn't it right?

I know the article goes on to speak about something else, but I'm not sure why this claim that the LLaMa model weights were leaked, as in unintendenly made available is being done.

[+] ftxbro|2 years ago|reply

My understanding is that researchers could ask for access to weights, but then also they were leaked so that anyone could get them without asking. There is another layer, where Facebook seems to accept it on some level (I mean they don't have a choice anymore anyway); they put a cheeky comment in the open pull request instead of closing it.

[+] turmeric_root|2 years ago|reply

The model weights were only shared by FB to people who applied for research access. Github repos containing links to the model weights have been taken down by FB.

[+] causi|2 years ago|reply

LLAMA isn't there and probably never will be, but the possibility of running something equivalent to ChatGPT has certainly made me reconsider my GPU purchases. I wonder if in the end will it be Nvidia's CUDA advantage or AMD's larger amount of memory that will end up being more important when we do get it.

[+] croes|2 years ago|reply

Local LLMs remove the data protection problem but open the door for malicious use on a larger scale.

[+] binkHN|2 years ago|reply

This is wonderful. As hardware and software continues to improve, everything seems to find a way to run on ever smaller devices. Guess your own pocket-AGI is not too far away after all.

[+] 1827162|2 years ago|reply

By the way, I was thinking of something along the lines of a powerful FPGA with direct access to large quantities of very fast NAND flash, likely many chips in parallel, which will save having to load the model into RAM..... So it will be able to directly run from NAND flash, which opens up the possibility of using very large models???

Power consumption would not be an issue if it's used sporadically throughout the day, it's not like it needs to run continuously?

There is still the issue of NAND flash read disturb, which I haven't fully looked into yet.

204 comments