(no title)
wcallahan | 6 months ago
A few things I noticed: - it’s only fast with with small context windows and small total token context; once more than ~10k tokens you’re basically queueing everything for a long time - MCPs/web search/url fetch have already become a very important part of interacting with LLMs; when they’re not available the LLM utility is greatly diminished - a lot of CLI/TUI coding tools (e.g., opencode) were not working reliably offline at this time with the model, despite being setup prior to being offline
That’s in addition to the other quirks others have noted with the OSS models.
XCSme|6 months ago
I think 99% of web searches lead to the same 100-1k websites. I assume it's only a few GBs to have a copy of those locally, thus this raises copyright concerns.
Aurornis|6 months ago
LLMs call out to external websites when something isn’t commonly represented in training data, like specific project documentation or news events.
conradev|6 months ago
diggan|6 months ago
Even though LM Studio uses llama.cpp as a runtime, the performance differs between them. With LM Studio 0.3.22 Build 2 with CUDA Llama.cpp (Linux) v1.45.0 runtime I get ~86 tok/s on a RTX Pro 6000, while with llama.cpp compiled from 1d72c841888 (Aug 7 10:53:21 2025) I get ~180 tok/s, almost 100 more per second, both running lmstudio-community/gpt-oss-120b-GGUF.
fouc|6 months ago
MoonObserver|6 months ago
summarity|6 months ago
torginus|6 months ago
gigatexal|6 months ago
Im spec’ing out a Mac Studio with 512GB ram because I can window shop and wish but I think the trend for local LLMs is getting really good.
Do we know WHY openAI even released them?
diggan|6 months ago
Regulations and trying to earn good will of developers using local LLMs, something that was slowly eroding since it was a while ago (GPT2 - 2019) they released weights to the public.
Epa095|6 months ago
lavezzi|6 months ago
Enterprises can now deploy them on AWS and GCP.
mich5632|6 months ago
zackify|6 months ago
Yes all this has been known since the M4 came out. The memory bandwidth is too low.
Try using it with real tasks like cline or opencode and the context length is too long and slow to be practical
Aurornis|6 months ago
The M4 Max with 128GB of RAM (the part used in the comment) has over 500GB/sec of memory bandwidth.
radarsat1|6 months ago
woleium|6 months ago