I agree that it's kind of magical that you can download a ~10GB file and suddenly your laptop is running something that can summarize text, answer questions and even reason a bit.
The trick is balancing model size vs RAM: 12B–20B is about the upper limit for a 16GB machine without it choking.
What I find interesting is that these models don't actually hit Apple's Neural Engine, they run on the GPU via Metal. Core ML isn't great for custom runtimes and Apple hasn't given low-level developer access to the ANE afaik. And then there is memory bandwidth and dedicated SRAM issues. Hopefully Apple optimizes Core ML to map transformer workloads to the ANE.
I feel like Apple needs a new CEO, I've felt this way for a long time. If I had been in charge of Apple I would have embraced local LLMs and built an inference engine that optimizes models that are designed for Nvidia, I also would have probably toyed around with the idea of selling server-grade Apple Silicon processors and opening up the GPU spec so people can build against it. Seems like Apple tries to play it too safe. While Tim Cook is good as a COO, he's still running Apple as a COO. They need a man of vision, not a COO at the helm.
From reverse engineered information (in the context of Asahi Linux, which can have raw hardware access to the ANE) it seems that the M1/M2 Apple Neural Engine provides exclusively for statically scheduled MADD's of INT8 or FP16 values.[0] This wastes a lot of memory bandwidth on padding in the context of newer local models which generally are more heavily quantized.
(That is, when in-memory model values must be padded to FP16/INT8 this slashes your effective use of memory bandwidth, which is what determines token generation speed. GPU compute doesn't have that issue; one can simply de-quantize/pad the input in fast local registers to feed the matrix compute units, so memory bandwidth is used efficiently.)
The NPU/ANE is still potentially useful for lowering power use in the context of prompt pre-processing, which is limited by raw compute as opposed to the memory bandwidth bound of token generation. (Lower power usage in this context will save on battery and may help performance by avoiding power/thermal throttling, especially on passively-cooled laptops. So this is definitely worth going for.)
(The jury is still out for M3/M4 which currently have no Asahi support - thus, no current prospects for driving the ANE bare-metal. Note however that the M3/Pro/Max ANE reported performance numbers are quite close to the M2 version, so there may not be a real improvement there either. M3 Ultra and especially the M4 series may be a different story.)
I too found that interesting that Apple's Neural Engine doesn't work with local LLMs. Seems like Apple, AMD, and Intel are missing the AI boat by not properly supporting their NPUs in llama.cpp. Any thoughts on why this is?
I find surprising that you can also do that from the browser (e.g. WebLLM). I imagine that in the near future we will run these engines locally for many use cases, instead of via APIs.
Don't try 12-20B on 16GB. You should stick with 4-8B instead. You'll get way too slow tps and marginal perf improvements going higher on a 16GB machine.
Don't get me started. Many new computers come with an NPU of some kind, which is superfluous to a GPU.
But what's really going on is that we never got the highly multicore and distributed computers that could have started going mainstream in the 1980s, and certainly by the late 1990s when high-speed internet hit. So single-threaded performance is about the same now as 20 years ago. Meanwhile video cards have gotten exponentially more powerful and affordable, but without the virtual memory and virtualization capabilities of CPUs, so we're seeing ridiculous artificial limitations like not being able to run certain LLMs because the hardware "isn't powerful enough", rather than just having a slower experience or borrowing the PC in the next room for more computing power.
To go to the incredible lengths that Apple went to in designing the M1, not just wrt hardware but in adding yet another layer of software emulation since the 68000 days, without actually bringing multicore with local memories to the level that today's VLSI design rules could allow, is laughable for me. If it wasn't so tragic.
It's hard for me to live and work in a tech status quo so far removed from what I had envisioned growing up. We're practically at AGI, but also mired in ensh@ttification. Reflected in politics too. We'll have the first trillionaire before we solve world hunger, and I'm bracing for Skynet/Ultron before we have C3P0/JARVIS.
So far I've not run into the kind of use cases that local LLMs can convincingly provide without making me feel like I'm using the first ever ChatGPT from 2022, in that they are limited and quite limiting. I am curious about what use cases the community has found that work for them. The example that one user has given in this thread about their local LLM inventing a Sun Tzu interview is exactly the kind of limitation I'm talking about. How does one use a local LLM to do something actually useful?
I have tried a lot of different LLMs and Gemma3:27b on a 48gb+ Macbook is probably the best for analyzing diaries and personal stuff you don't want to share with the cloud. The China models are comically bad with life advice. For example, I asked Deepseek to read my diaries and talk to me about my life goals and it told me in a very Confucian manner what the proper relationships in my life were for my stage of life and station in society. Gemma is much more western.
I see local LLM's being used mainly for automation as opposed to factual knowledge -- for classification, summarization, search, and things like grammar checking.
So they need to be smart about your desired language(s) and all the everyday concepts we use in it (so they can understand the content of documents and messages), but they don't need any of the detailed factual knowledge around human history, programming languages and libraries, health, and everything else.
The idea is that you don't prompt the LLM directly, but your OS tools make use of it, and applications prompt it as frequently as they fetch URL's.
There are situations where internet access is limited, or where there are frequent outages. An outdated LLM might be more useful than none at all.
For example: my internet is out due to a severe storm, what safety precautions do I need to take?
I use, or at least try to use local models while prototyping/developing apps.
First, they control costs during development, which depending on what you're doing, can get quite expensive for low or no budget projects.
Second, they force me to have more constraints and more carefully compose things. If a local model (albeit something somewhat capable like gpt-oss or qwen3) can start to piece together this agentic workflow I am trying to model, chances are, it'll start working quite well and quite quickly if I switch to even a budget cloud model (something like gpt-5-mini.)
However, dealing with these constraints might not be worth the time if you can stuff all of the documents in your context window for the cloud models and get good results, but it will probably be cheaper and faster on an ongoing basis to have split the task up.
I keep a lot of notes, all my thoughts feelings both happy and sad, things I’ve done, etc. in obsidian. These are deeply personal and I don’t want this going to a cloud provider even if they “say” they don’t train on my chats.
I forget a lot of things so I feed these into chromeDB and then use a LLM to chat with all my notes.
I’ve started using abliterated models which have their refusal removed [0]
Other use case is for work. I work with financial data and I have created an mcp that automates some of my job. Running model locally allows me to not worry about the information I feed it.
Well, a lot of what is possible with local models depends on what your local hardware is, but docling is a pretty good example of a library that can use local models (VLMs instead of regular LLMs) “under the hood” for productive tasks.
I use Claude code in the terminal only mostly to figure out what to commit along with what to write for the commit message. I believe a solid 7-8b model can do this locally.
So, that’s at least one small highly useful workflow robot I have a use for (and very easy to cook up on your own).
I also have a use for terminal command autocompletion, which again, a small model can be great for.
Something felt kind really wrong about sending entire folder contents over to Claude online, so I am absolutely looking to create the toolkit locally.
The universe off offline is just getting started, and these big companies literally are telling you “watch out, we save this stuff”.
I'm running Gemma3-270M locally (MLX). I got a Python script that pulls down emails based on a whitelist and summarises them. The 270M model does a good job of this. This is running in a terminal. It means I barely look at my email during the day.
I use a local LLM for lots of little things that I used to use search engines for. Defining words, looking up unicode symbols for copy/paste, reminders on how to do X in bash or Python. Sometimes I use it as a starting point for high-level questions and curiosity and then move to actual human content or larger online models for more details and/or fact-checking if needed.
If your computer is somewhat modern and has a decent amount of RAM to spare, it can probably run one of the smaller-but-still-useful models just fine, even without a GPU.
My reasons:
1) Search engines are actively incentivized to not show useful results. SEO-optimized clickbait articles contain long fluffy, contentless prose intermixed with ads. The longer they can keep you "searching" for the information instead of "finding" it, the better is for their bottom line. Because if you actually manage to find the information you're looking for, you close the tab and stop looking at ads. If you don't find what you need, you keep scrolling and generate more ad revenue for the advertisers and search engines. It's exactly the same reasons online dating sites are futile for most people: every successful match made results in two lost customers which is bad for revenue.
LLMs (even local ones in some cases) are quite good at giving you direct answers to direct questions which is 90% of my use for search engines to begin with. Yes, sometimes they hallucinate. No, it's not usually a big deal if you apply some common sense.
2) Most datacenter-hosted LLMs don't have ads built into them now, but they will. As soon as we get used to "trusting" hosted models due to how good they have become, the model developers and operators will figure out how to turn the model into a sneaky salesman. You'll ask it for the specs on a certain model of Dell laptop and it will pretend it didn't hear you and reply, "You should try HP's latest line of up business-class notebooks, they're fast, affordable, and come in 5 fabulous colors to suit your unique personal style!" I want to make sure I'm emphasizing that it's not IF this happens, it's WHEN.
Local LLMs COULD have advertising at some point, but it will probably be rare and/or weird as these smaller models are meant mainly for development and further experimentation. I have faith that some open-weight models will always exist in some form, even if they never rival commercially-hosted models in overall quality.
3) I've made peace with the fact that data privacy in the age of Big Tech is a myth, but that doesn't mean I can't minimize my exposure by keeping some of my random musings and queries to myself. Self-hosted AI models will never be as "good" as the ones hosted in datacenters, but they are still plenty useful.
4) I'm still in the early stages of this, but I can develop my own tools around small local models without paying a hosted model provider and/or becoming their product.
5) I was a huge skeptic about the overall value of AI during all of the initial hype. Then I realized that this stuff isn't some fad that will disappear tomorrow. It will get better. The experience will get more refined. It will get more accurate. It will consume less energy. It will be totally ubiquitous. If you fail to come to speed on some important new technology or trend, you will be left in the dust by those who do. I understand the skepticism and pushback, but the future moves forward regardless.
Smaller models require a lot more direction, a.k.a system prompt engineering, and sometimes custom wrappers . For example Gemma models are very eager to generate code even if you tell them not to.
I'm running Hermes Mistral and the very first thing it did was start hallucinating.
I recently started an audio dream journal and want to keep it private. Set up whisper to transcribe the .wav file and dump it in an Obsidian folder.
The plan was to put a local llm step in to clean up the punctuation and paragraphs.
I entered instructions to clean the transcript without changing or adding anything else.
Hermes responded by inventing an intereview with Sun Tzu about why he wrote the Art of War. When I stopped the process it apologized and advised it misunderstood when I talked about Sun Tzu. I never mentioned Sun Tzu or even provided a transcript. Just instructions.
We went around with this for a while before I could even get it to admit the mistake, and it refused to identify why it occurred in the first place.
Having to meticulously check for weird hallucinations will be far more time consuming than just doing the editing myself. This same logic applies to a lot of the areas I'd like to have a local llm for. Hopefully they'll get there soon.
It’s often been assumed that accuracy and ‘correctness’ would be easy to implement on computers because they operate on logic, in some sense. It’s originality and creativity that would be hard, or impossible because it’s not logical. Science Fiction has been full of such assumptions. Yet here we are, the actual problem is inventing new heavy enough training sticks to beat our AIs out of constantly making stuff up and lying about it.
I suppose we shouldn’t be surprised in hindsight. We trained them on human communicative behaviour after all. Maybe using Reddit as a source wasn’t the smartest move. Reddit in, Reddit out.
I don't think we're anywhere close to running cutting-edge LLMs on our phones or laptops.
What may be around the corner is running great models on a box at home. The AI lives at home. Your thin client talks to it, maybe runs a smaller AI on device to balance latency and quality. (This would be a natural extension for Apple to go into with its Mac Pro line. $10 to 20k for a home LLM device isn't ridiculous.)
Not sure about the Mac Pro, since you pay a lot for the big fancy case. The Studio seems more sensible.
And of course Nvidia and AMD are coming out with options for massive amounts of high bandwidth GPU memory in desktop form factors.
I like the idea of having basically a local LLM server that your laptop or other devices can connect to. Then your laptop doesn’t have to burn its battery on LLM work and it’s still local.
Doesnt gpt-oss-120b perform better across the board at a fraction of the memory? Just specced a $4k mac studio that can easily run that at 128 gb memory.
I believe local llms are the future. It will only get better. Once we get to the level of even last year's state of the art I don't see any reason to use chatgpt/anthropic/other.
We don't even need one big model good at everything. Imagine loading a small model from a collection of dozens of models depending on the tasks you have in mind. There is no moat.
It's true that local LLMs are only going to get better, but it's not clear they will become generally practical for the foreseeable future. There have been huge improvements to the reasoning and coding capabilities of local models, but most of that comes from refinements to training data and training techniques (e.g. RLHF, DPO, CoT etc), while the most important factor by far remains the capability to reduce hallucinations to comfortable margins using the raw statistical power you get with massive full-precision parameter counts. The hardware gap between today's SOTA models and what's available to the consumer are so massive that it'll likely be at least a decade before they become practical.
Seeing and navigating all the configs helped me build intuition around what my macbook can or cannot do, how things are configured, how they work, etc...
I also like that it ships with some cli tools, including an openai compatible server. It’s great to be able to take a model that’s loaded and open up an endpoint to it for running local scripts.
You can get a quick feel for how it works via the chat interface and then extend it programmatically.
Is anyone working on software that lets you run local LLMs in the browser?
In theory, it should be possible, shouldn't it?
The page could hold only the software in JavaScript that uses WebGL to run the neural net. And offer an "upload" button that the user can click to select a model from their file system. The button would not upload the model to a server - it would just let the JS code access it to convert it into WebGL and move it into the GPU.
This way, one could download models from HuggingFace, store them locally and use them as needed. Nicely sandboxed and independent of the operating system.
This one is pretty cool. Compile the gguf of an OSS LLM directly into an executable. Will open an interface in the browser to chat. Can also launch an OpenAI API style interface hosted locally.
Doesn't work quite as well on Windows due to the executable file size limit but seems great for Mac/Linux flavors.
Beyond all the wasm/webgpu approaches other folks have linked (mostly in the transformers.js ecosystem), there's been a standardized API brewing since 2019: https://webmachinelearning.github.io/webnn-intro/
Not browser but Electron. For the browser you would have to run a local nodejs server and point the browser app to use the local API. I use electron with nodejs and react for UI. Yes I can switch models.
It's a crazy upside-down world where the Mac Studio M3 Ultra 512GB is the reasonable option among the alternatives if you intend to run larger models at usable(ish) speeds.
The use of the word "emergent" is concerning to me. I believe this to be an... exaggeration of the observed effect. Depending on the perspective and the knowledge of the domain, this might seem to some ad emergent, however we saw equally interesting developments with more complex Markov chaining given the sheer lack of computational resources and time. What we are observing is just another step up that ladder, another angle to enumerate and pick the best token next in the sequence given the information revealed by the proceeding words. Linguistics is all about efficient, lossless data-transfer. While it's "cool" and very surprising.. I don't believe we should be treating it as somewhere between a spell-checker and a sentient being. People aren't simple heuristic models, and to imply these machines are remotely close is woefully inaccurate and will lead to further confusion and disappointment in the future.
I really like On-Device AI on iPhone (also runs on Mac): https://ondevice-ai.app in addition to LM Studio. It has a nice interface, with multiple prompt integration, and a good selection of models. Also the developer is responsive.
As someone who sometimes downloads random models to play around on my 16GB Mac Mini, I like his suggestions of models. I guess these are the best ones for their sizes if you get down to 4 or 5 worth keeping.
DEVONThink 4’s support for local models is great and could possibly contribute to the software’s enduring success for the next 10 years. I’ve found it helpful for summarizing documents and selections of text, but it can do a lot more than that apparently.
I'm interested in this, my impression was that the newer chips have unified memory and high memory bandwidth. Do you do inference on the CPU or the external GPU?
I think the best models around right now that most people can fit some quantization on their computer if it's a apple silicon Mac or gaming PC would be:
For non-coding:
Qwen3-30B-A3B-Instruct-2507 (or the thinking variant, depending on use case)
For coding:
Qwen3-Coder-30B-A3B-Instruct
---
If you have a bit more vram, GLM-4.5-Air or the full GLM-4.5
The really though spot is finding a good model for your use case. I’ve a 16Gb MB and have been paralyzed by the many options. I’ve settle for a quantisied 14B Qwen for now, but no idea if this is a good idea.
14B Qwen was a good choice, but it became outdated a bit and seems like the new version of 4B surpassed it in benchmarks somehow.
It's a balancing game, how slow a token generation speed can you tolerate? Would you rather get an answer quick, or wait for a few seconds (or sometimes minutes) for reasoning?
For quick answers, Gemma 3 12B is still good. GPT-OSS 20B is pretty quick when reasoning is set to low, which usually doesn't think longer than one sentence. I haven't gotten much use out of Qwen3 4B Thinking (2507) but at least it's fast while reasoning.
What is the best local model for cursor style autocomplete/code suggestions? And is there an extension for vs code which can integrate local model for such use?
I have been playing with the continue.dev extension for vscodium. I got it to work with Ollama and the Mistral models (codestral, devstral and mistral-small). I did not go much further than experimenting yet, but it looks promising, entirely local and mostly open source. And even then, it’s much further than I got with most other tools I tried.
>I also use them for brain-dumping. I find it hard to keep a journal, because I find it boring, but when you’re pretending to be writing to someone, it’s easier. If you have friends, that’s much better, but some topics are too personal and a friend may not be available at 4 AM. I mostly ignore its responses, because it’s for me to unload, not to listen to a machine spew slop. I suggest you do the same, because we’re anthropomorphization machines and I’d rather not experience AI psychosis. It’s better if you don’t give it a chance to convince you it’s real. I could use a system prompt so it doesn’t follow up with dumb questions (or “YoU’Re AbSoLuTeLy CoRrEcT”s), but I never bothered as I already don’t read it.
Reads like someone starting to get their daily drinks, already using them for "company" and fun, and saying "I'm not an alcoholic, I can quit anytime".
An awful lot of Monday morning quarterback CEOs are here running their mouths about what Tim Cook should do or what they would do. Chill out with the extremely confident ignorance. Tim Cook brought Apple to a billion dollars in free cash he doesn’t need to ride the hype train.
Also let’s not forget they are first and foremost designers of hardware and the arms race is only getting started.
ollama is another good choice for this purpose. it's essentially a wrapper around llamacpp that adds easy downloading and management of running instances. it's great! also works on linux!
Ollama adding a paid cloud version made me postpone this post for a few weeks at least. I don't object them to make money, but it was hard to recommend a tool for local usage and make the first instruction to go to settings and enable airplane mode.
Luckily llama.cpp has come a long way and was at a point that I could easily recommend as the open source option instead.
coffeecoders|5 months ago
The trick is balancing model size vs RAM: 12B–20B is about the upper limit for a 16GB machine without it choking.
What I find interesting is that these models don't actually hit Apple's Neural Engine, they run on the GPU via Metal. Core ML isn't great for custom runtimes and Apple hasn't given low-level developer access to the ANE afaik. And then there is memory bandwidth and dedicated SRAM issues. Hopefully Apple optimizes Core ML to map transformer workloads to the ANE.
giancarlostoro|5 months ago
zozbot234|5 months ago
(That is, when in-memory model values must be padded to FP16/INT8 this slashes your effective use of memory bandwidth, which is what determines token generation speed. GPU compute doesn't have that issue; one can simply de-quantize/pad the input in fast local registers to feed the matrix compute units, so memory bandwidth is used efficiently.)
The NPU/ANE is still potentially useful for lowering power use in the context of prompt pre-processing, which is limited by raw compute as opposed to the memory bandwidth bound of token generation. (Lower power usage in this context will save on battery and may help performance by avoiding power/thermal throttling, especially on passively-cooled laptops. So this is definitely worth going for.)
[0] Some historical information about bare-metal use of the ANE is available from the Whisper.cpp pull req: https://github.com/ggml-org/whisper.cpp/pull/1021 Even older information at: https://github.com/eiln/ane/tree/33a61249d773f8f50c02ab0b9fe... .
More extensive information at https://github.com/tinygrad/tinygrad/tree/master/extra/accel... (from the Tinygrad folks) seems to basically confirm the above.
(The jury is still out for M3/M4 which currently have no Asahi support - thus, no current prospects for driving the ANE bare-metal. Note however that the M3/Pro/Max ANE reported performance numbers are quite close to the M2 version, so there may not be a real improvement there either. M3 Ultra and especially the M4 series may be a different story.)
slacka|5 months ago
GeekyBear|5 months ago
If you want to convert models to run on the ANE there are tools provided:
> Convert models from TensorFlow, PyTorch, and other libraries to Core ML.
https://apple.github.io/coremltools/docs-guides/index.html
ai-christianson|5 months ago
I'm on a 128GB M4 macbook. This is "powerful" today, but it will be old news in a few years.
These models are just about getting as good as the frontier models.
ru552|5 months ago
daemonologist|5 months ago
(Unfortunately ONNX doesn't support Vulkan, which limits it on other platforms. It's always something...)
wslh|5 months ago
witnessme|5 months ago
zackmorris|5 months ago
But what's really going on is that we never got the highly multicore and distributed computers that could have started going mainstream in the 1980s, and certainly by the late 1990s when high-speed internet hit. So single-threaded performance is about the same now as 20 years ago. Meanwhile video cards have gotten exponentially more powerful and affordable, but without the virtual memory and virtualization capabilities of CPUs, so we're seeing ridiculous artificial limitations like not being able to run certain LLMs because the hardware "isn't powerful enough", rather than just having a slower experience or borrowing the PC in the next room for more computing power.
To go to the incredible lengths that Apple went to in designing the M1, not just wrt hardware but in adding yet another layer of software emulation since the 68000 days, without actually bringing multicore with local memories to the level that today's VLSI design rules could allow, is laughable for me. If it wasn't so tragic.
It's hard for me to live and work in a tech status quo so far removed from what I had envisioned growing up. We're practically at AGI, but also mired in ensh@ttification. Reflected in politics too. We'll have the first trillionaire before we solve world hunger, and I'm bracing for Skynet/Ultron before we have C3P0/JARVIS.
jondwillis|5 months ago
wer232essf|5 months ago
[deleted]
punitvthakkar|5 months ago
narrator|5 months ago
crazygringo|5 months ago
So they need to be smart about your desired language(s) and all the everyday concepts we use in it (so they can understand the content of documents and messages), but they don't need any of the detailed factual knowledge around human history, programming languages and libraries, health, and everything else.
The idea is that you don't prompt the LLM directly, but your OS tools make use of it, and applications prompt it as frequently as they fetch URL's.
dxetech|5 months ago
jondwillis|5 months ago
First, they control costs during development, which depending on what you're doing, can get quite expensive for low or no budget projects.
Second, they force me to have more constraints and more carefully compose things. If a local model (albeit something somewhat capable like gpt-oss or qwen3) can start to piece together this agentic workflow I am trying to model, chances are, it'll start working quite well and quite quickly if I switch to even a budget cloud model (something like gpt-5-mini.)
However, dealing with these constraints might not be worth the time if you can stuff all of the documents in your context window for the cloud models and get good results, but it will probably be cheaper and faster on an ongoing basis to have split the task up.
vorticalbox|5 months ago
I forget a lot of things so I feed these into chromeDB and then use a LLM to chat with all my notes.
I’ve started using abliterated models which have their refusal removed [0]
Other use case is for work. I work with financial data and I have created an mcp that automates some of my job. Running model locally allows me to not worry about the information I feed it.
[0] https://github.com/Sumandora/remove-refusals-with-transforme...
dragonwriter|5 months ago
ivape|5 months ago
So, that’s at least one small highly useful workflow robot I have a use for (and very easy to cook up on your own).
I also have a use for terminal command autocompletion, which again, a small model can be great for.
Something felt kind really wrong about sending entire folder contents over to Claude online, so I am absolutely looking to create the toolkit locally.
The universe off offline is just getting started, and these big companies literally are telling you “watch out, we save this stuff”.
rukuu001|5 months ago
luckydata|5 months ago
bityard|5 months ago
If your computer is somewhat modern and has a decent amount of RAM to spare, it can probably run one of the smaller-but-still-useful models just fine, even without a GPU.
My reasons:
1) Search engines are actively incentivized to not show useful results. SEO-optimized clickbait articles contain long fluffy, contentless prose intermixed with ads. The longer they can keep you "searching" for the information instead of "finding" it, the better is for their bottom line. Because if you actually manage to find the information you're looking for, you close the tab and stop looking at ads. If you don't find what you need, you keep scrolling and generate more ad revenue for the advertisers and search engines. It's exactly the same reasons online dating sites are futile for most people: every successful match made results in two lost customers which is bad for revenue.
LLMs (even local ones in some cases) are quite good at giving you direct answers to direct questions which is 90% of my use for search engines to begin with. Yes, sometimes they hallucinate. No, it's not usually a big deal if you apply some common sense.
2) Most datacenter-hosted LLMs don't have ads built into them now, but they will. As soon as we get used to "trusting" hosted models due to how good they have become, the model developers and operators will figure out how to turn the model into a sneaky salesman. You'll ask it for the specs on a certain model of Dell laptop and it will pretend it didn't hear you and reply, "You should try HP's latest line of up business-class notebooks, they're fast, affordable, and come in 5 fabulous colors to suit your unique personal style!" I want to make sure I'm emphasizing that it's not IF this happens, it's WHEN.
Local LLMs COULD have advertising at some point, but it will probably be rare and/or weird as these smaller models are meant mainly for development and further experimentation. I have faith that some open-weight models will always exist in some form, even if they never rival commercially-hosted models in overall quality.
3) I've made peace with the fact that data privacy in the age of Big Tech is a myth, but that doesn't mean I can't minimize my exposure by keeping some of my random musings and queries to myself. Self-hosted AI models will never be as "good" as the ones hosted in datacenters, but they are still plenty useful.
4) I'm still in the early stages of this, but I can develop my own tools around small local models without paying a hosted model provider and/or becoming their product.
5) I was a huge skeptic about the overall value of AI during all of the initial hype. Then I realized that this stuff isn't some fad that will disappear tomorrow. It will get better. The experience will get more refined. It will get more accurate. It will consume less energy. It will be totally ubiquitous. If you fail to come to speed on some important new technology or trend, you will be left in the dust by those who do. I understand the skepticism and pushback, but the future moves forward regardless.
jeffybefffy519|5 months ago
bigyabai|5 months ago
ActorNightly|5 months ago
mentalgear|5 months ago
segmondy|5 months ago
daoboy|5 months ago
I recently started an audio dream journal and want to keep it private. Set up whisper to transcribe the .wav file and dump it in an Obsidian folder.
The plan was to put a local llm step in to clean up the punctuation and paragraphs. I entered instructions to clean the transcript without changing or adding anything else.
Hermes responded by inventing an intereview with Sun Tzu about why he wrote the Art of War. When I stopped the process it apologized and advised it misunderstood when I talked about Sun Tzu. I never mentioned Sun Tzu or even provided a transcript. Just instructions.
We went around with this for a while before I could even get it to admit the mistake, and it refused to identify why it occurred in the first place.
Having to meticulously check for weird hallucinations will be far more time consuming than just doing the editing myself. This same logic applies to a lot of the areas I'd like to have a local llm for. Hopefully they'll get there soon.
simonh|5 months ago
I suppose we shouldn’t be surprised in hindsight. We trained them on human communicative behaviour after all. Maybe using Reddit as a source wasn’t the smartest move. Reddit in, Reddit out.
unknown|5 months ago
[deleted]
JumpCrisscross|5 months ago
What may be around the corner is running great models on a box at home. The AI lives at home. Your thin client talks to it, maybe runs a smaller AI on device to balance latency and quality. (This would be a natural extension for Apple to go into with its Mac Pro line. $10 to 20k for a home LLM device isn't ridiculous.)
simonw|5 months ago
You can also string two 512GB Mac Studios together using MLX to load even larger models - here's 671B 8-bit DeepSeek R1 doing that: https://twitter.com/alexocheema/status/1899735281781411907
brokencode|5 months ago
And of course Nvidia and AMD are coming out with options for massive amounts of high bandwidth GPU memory in desktop form factors.
I like the idea of having basically a local LLM server that your laptop or other devices can connect to. Then your laptop doesn’t have to burn its battery on LLM work and it’s still local.
data-ottawa|5 months ago
I’m running docker containers with different apps and it works well enough for a lot of my use cases.
I mostly use Qwen Code and GPT OSS 120b right now.
When the next generation of this tech comes through I will probably upgrade despite the price, the value is worth it to me.
bigyabai|5 months ago
At that point you are almost paying more than the datacenter does for inference hardware.
ben_w|5 months ago
That price is ridiculous for most people. Silicon Valley payscales can afford that much, but see how few Apple Vision Pros got sold for far less.
vonneumannstan|5 months ago
floweronthehill|5 months ago
We don't even need one big model good at everything. Imagine loading a small model from a collection of dozens of models depending on the tasks you have in mind. There is no moat.
root_axis|5 months ago
unknown|5 months ago
[deleted]
nomel|5 months ago
linux2647|5 months ago
frontsideair|5 months ago
Olshansky|5 months ago
Seeing and navigating all the configs helped me build intuition around what my macbook can or cannot do, how things are configured, how they work, etc...
Great way to spend an hour or two.
deepsquirrelnet|5 months ago
You can get a quick feel for how it works via the chat interface and then extend it programmatically.
atentaten|5 months ago
frontsideair|5 months ago
tpae|5 months ago
colecut|5 months ago
mg|5 months ago
In theory, it should be possible, shouldn't it?
The page could hold only the software in JavaScript that uses WebGL to run the neural net. And offer an "upload" button that the user can click to select a model from their file system. The button would not upload the model to a server - it would just let the JS code access it to convert it into WebGL and move it into the GPU.
This way, one could download models from HuggingFace, store them locally and use them as needed. Nicely sandboxed and independent of the operating system.
simonw|5 months ago
https://huggingface.co/spaces/webml-community/llama-3.2-webg... loads a 1.24GB Llama 3.2 q4f16 ONNX build
https://huggingface.co/spaces/webml-community/janus-pro-webg... loads a 2.24 GB DeepSeek Janus Pro model which is multi-modal for output - it can respond with generated images in addition to text.
https://huggingface.co/blog/embeddinggemma#transformersjs loads 400MB for an EmbeddingGemma demo (embeddings, not LLMs)
I've collected a few more of these demos here: https://simonwillison.net/tags/transformers-js/
You can also get this working with web-llm - https://github.com/mlc-ai/web-llm - here's my write-up of a demo that uses that: https://simonwillison.net/2024/Nov/29/structured-generation-...
SparkyMcUnicorn|5 months ago
https://github.com/mlc-ai/web-llm-chat
https://github.com/mlc-ai/mlc-llm
https://github.com/mlc-ai/web-llm
generalizations|5 months ago
And related is the whisper implementation: https://ggml.ai/whisper.cpp/
vonneumannstan|5 months ago
Doesn't work quite as well on Windows due to the executable file size limit but seems great for Mac/Linux flavors.
https://github.com/Mozilla-Ocho/llamafile
adastra22|5 months ago
paulirish|5 months ago
Demos here: https://webmachinelearning.github.io/webnn-samples/ I'm not sure any of them allow you to select a model file from disk, but that should be entirely straightforward.
samsolomon|5 months ago
https://openwebui.com/
coip|5 months ago
https://huggingface.co/docs/transformers.js/en/guides/webgpu
eta: its predecessor was using webGL
mudkipdev|5 months ago
vavikk|5 months ago
grim_io|5 months ago
TYPE_FASTER|5 months ago
SLWW|5 months ago
jerryliu12|5 months ago
seanmcdirmid|5 months ago
noja|5 months ago
LeoPanthera|5 months ago
gazpachotron|5 months ago
[deleted]
balder1991|5 months ago
tolerance|5 months ago
https://www.devontechnologies.com/blog/20250513-local-ai-in-...
Damogran6|5 months ago
frontsideair|5 months ago
jasonjmcghee|5 months ago
For non-coding: Qwen3-30B-A3B-Instruct-2507 (or the thinking variant, depending on use case)
For coding: Qwen3-Coder-30B-A3B-Instruct
---
If you have a bit more vram, GLM-4.5-Air or the full GLM-4.5
all2|5 months ago
Recommendation: use something else to run the model. Ollama is convenient, but insufficient for tool use for these models.
jftuga|5 months ago
* General Q&A
* Specific to programming - mostly Python and Go.
I forgot the command now, but I did run a command that allowed MacOS to allocate and use maybe 28 GB of RAM to the GPU for use with LLMs.
frontsideair|5 months ago
DrAwdeOccarim|5 months ago
balder1991|5 months ago
KolmogorovComp|5 months ago
frontsideair|5 months ago
It's a balancing game, how slow a token generation speed can you tolerate? Would you rather get an answer quick, or wait for a few seconds (or sometimes minutes) for reasoning?
For quick answers, Gemma 3 12B is still good. GPT-OSS 20B is pretty quick when reasoning is set to low, which usually doesn't think longer than one sentence. I haven't gotten much use out of Qwen3 4B Thinking (2507) but at least it's fast while reasoning.
lawxls|5 months ago
kergonath|5 months ago
jokoon|5 months ago
DrAwdeOccarim|5 months ago
OvidStavrica|5 months ago
https://picogpt.app/
https://apps.apple.com/us/app/pico-ai-server-llm-vlm-mlx/id6...
Witsy:
https://github.com/nbonamy/witsy
...and you really want at least 48G RAM to run >24B models.
cchance|5 months ago
coldtea|5 months ago
Reads like someone starting to get their daily drinks, already using them for "company" and fun, and saying "I'm not an alcoholic, I can quit anytime".
anArbitraryOne|5 months ago
jus3sixty|5 months ago
Also let’s not forget they are first and foremost designers of hardware and the arms race is only getting started.
j45|5 months ago
a-dub|5 months ago
frontsideair|5 months ago
Luckily llama.cpp has come a long way and was at a point that I could easily recommend as the open source option instead.
wer232essf|5 months ago
[deleted]
wer232essf|5 months ago
[deleted]
curtisszmania|5 months ago
[deleted]
wer232essf|5 months ago
[deleted]
saagarjha|5 months ago
techlatest_net|5 months ago
[deleted]