top | item 44822195

(no title)

wcallahan | 6 months ago

I just used GPT-OSS-120B on a cross Atlantic flight on my MacBook Pro (M4, 128GB RAM).

A few things I noticed: - it’s only fast with with small context windows and small total token context; once more than ~10k tokens you’re basically queueing everything for a long time - MCPs/web search/url fetch have already become a very important part of interacting with LLMs; when they’re not available the LLM utility is greatly diminished - a lot of CLI/TUI coding tools (e.g., opencode) were not working reliably offline at this time with the model, despite being setup prior to being offline

That’s in addition to the other quirks others have noted with the OSS models.

discuss

order

XCSme|6 months ago

I know there was a downloadable version of Wikipedia (not that large). Maybe soon we'll have a lot of data stored locally and expose it via MCP, then the AIs can do "web search" locally.

I think 99% of web searches lead to the same 100-1k websites. I assume it's only a few GBs to have a copy of those locally, thus this raises copyright concerns.

Aurornis|6 months ago

The mostly static knowledge content from sites like Wikipedia is already well represented in LLMs.

LLMs call out to external websites when something isn’t commonly represented in training data, like specific project documentation or news events.

conradev|6 months ago

Are you using Ollama or LMStudio/llama.cpp? https://x.com/ggerganov/status/1953088008816619637

diggan|6 months ago

> LMStudio/llama.cpp

Even though LM Studio uses llama.cpp as a runtime, the performance differs between them. With LM Studio 0.3.22 Build 2 with CUDA Llama.cpp (Linux) v1.45.0 runtime I get ~86 tok/s on a RTX Pro 6000, while with llama.cpp compiled from 1d72c841888 (Aug 7 10:53:21 2025) I get ~180 tok/s, almost 100 more per second, both running lmstudio-community/gpt-oss-120b-GGUF.

fouc|6 months ago

What was your iogpu.wired_limit_mb set to? By default only ~70% or ~90GB of your RAM will be available to your GPU cores unless you change your wired limit setting.

MoonObserver|6 months ago

M2 Max processor. I saw 60+ tok/s on short conversations, but it degraded to 30 tok/s as the conversation got longer. Do you know what actually accounts for this slowdown? I don’t believe it was thermal throttling.

summarity|6 months ago

Physics: You always have the same memory bandwidth. The longer the context, the more bits will need to pass through the same pipe. Context is cumulative.

torginus|6 months ago

Inference takes quadratic amount of time wrt context size

gigatexal|6 months ago

M3 Max 128GB here and it’s mad impressive.

Im spec’ing out a Mac Studio with 512GB ram because I can window shop and wish but I think the trend for local LLMs is getting really good.

Do we know WHY openAI even released them?

diggan|6 months ago

> Do we know WHY openAI even released them?

Regulations and trying to earn good will of developers using local LLMs, something that was slowly eroding since it was a while ago (GPT2 - 2019) they released weights to the public.

Epa095|6 months ago

If the new gpt 5 is actually better, then this oss version is not really a threat to Openai's income stream, but it can be a threat to their competitors.

lavezzi|6 months ago

> Do we know WHY openAI even released them?

Enterprises can now deploy them on AWS and GCP.

mich5632|6 months ago

I think this the difference between compute bound pre-fill (a cpu has a high bandwidth/compute ratio), vs decode. The time to first token is below 0.5s - even for a 10k context.

zackify|6 months ago

You didn’t even mention how it’ll be on fire unless you use low power mode.

Yes all this has been known since the M4 came out. The memory bandwidth is too low.

Try using it with real tasks like cline or opencode and the context length is too long and slow to be practical

Aurornis|6 months ago

> Yes all this has been known since the M4 came out. The memory bandwidth is too low.

The M4 Max with 128GB of RAM (the part used in the comment) has over 500GB/sec of memory bandwidth.

radarsat1|6 months ago

How long did your battery last?!

woleium|6 months ago

planes have power sockets now, but i do wonder how much jet fuel a whole plane of gpus would consume in electricity (assuming the system can handle it, which seems unlikely) and air conditioning.