Open models by OpenAI

[+] cco|7 months ago|reply

The lede is being missed imo.

gpt-oss:20b is a top ten model (on MMLU (right behind Gemini-2.5-Pro) and I just ran it locally on my Macbook Air M3 from last year.

I've been experimenting with a lot of local models, both on my laptop and on my phone (Pixel 9 Pro), and I figured we'd be here in a year or two.

But no, we're here today. A basically frontier model, running for the cost of electricity (free with a rounding error) on my laptop. No $200/month subscription, no lakes being drained, etc.

I'm blown away.

[+] int_19h|7 months ago|reply

I tried 20b locally and it couldn't reason a way out of a basic river crossing puzzle with labels changed. That is not anywhere near SOTA. In fact it's worse than many local models that can do it, including e.g. QwQ-32b.

[+] captainregex|7 months ago|reply

I’m still trying to understand what is the biggest group of people that uses local AI (or will)? Students who don’t want to pay but somehow have the hardware? Devs who are price conscious and want free agentic coding?

Local, in my experience, can’t even pull data from an image without hallucinating (Qwen 2.5 VI in that example). Hopefully local/small models keep getting better and devices get better at running bigger ones

It feels like we do it because we can more than because it makes sense- which I am all for! I just wonder if i’m missing some kind of major use case all around me that justifies chaining together a bunch of mac studios or buying a really great graphics card. Tools like exo are cool and the idea of distributed compute is neat but what edge cases truly need it so badly that it’s worth all the effort?

[+] dongobread|7 months ago|reply

How up to date are you on current open weights models? After playing around with it for a few hours I find it to be nowhere near as good as Qwen3-30B-A3B. The world knowledge is severely lacking in particular.

[+] datadrivenangel|7 months ago|reply

Now to embrace jevon's paradox and expand usage until we're back to draining lakes so that your agentic refrigerator can simulate sentience.

[+] decide1000|7 months ago|reply

The model is good and runs fine but if you want to be blown away again try Qwen3-30A-A3B-2507. It's 6gb bigger but the response is comparable or better and much faster to run. Gpt-oss-20B gives me 6 tok/sec while Qwen3 gives me 37 tok/sec. Qwen3 is not a reasoning model tho.

[+] parhamn|7 months ago|reply

I just tested 120B from the Groq API on agentic stuff (multi-step function calling, similar to claude code) and it's not that good. Agentic fine-tuning seems key, hopefully someone drops one soon.

[+] turnsout|7 months ago|reply

The environmentalist in me loves the fact that LLM progress has mostly been focused on doing more with the same hardware, rather than horizontal scaling. I guess given GPU shortages that makes sense, but it really does feel like the value of my hardware (a laptop in my case) is going up over time, not down.

Also, just wanted to credit you for being one of the five people on Earth who knows the correct spelling of "lede."

[+] mathiaspoint|7 months ago|reply

It's really training not inference that drains the lakes.

[+] jwr|7 months ago|reply

gpt-oss:20b is the best performing model on my spam filtering benchmarks (I wrote a despammer that uses an LLM).

These are the simplified results (total percentage of correctly classified E-mails on both spam and ham testing data):

gpt-oss:20b 95.6%

gemma3:27b-it-qat 94.3%

mistral-small3.2:24b-instruct-2506-q4_K_M 93.7%

mistral-small3.2:24b-instruct-2506-q8_0 92.5%

qwen3:32b-q4_K_M 89.2%

qwen3:30b-a3b-q4_K_M 87.9%

gemma3n:e4b-it-q4_K_M 84.9%

deepseek-r1:8b 75.2%

qwen3:30b-a3b-instruct-2507-q4_K_M 73.0%

I'm quite happy, because it's also smaller and faster than gemma3.

[+] Cicero22|7 months ago|reply

Where did you get the top ten from?

https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro

Are you discounting all of the self reported scores?

[+] npn|7 months ago|reply

It is not a frontier model. It's only good for benchmarks. Tried some tasks and it is even worse than gemma 3n.

[+] MattSayar|7 months ago|reply

What's your experience with the quality of LLMs running on your phone?

[+] vonneumannstan|7 months ago|reply

>no lakes being drained

When you imagine a lake being drained to cool a datacenter do you ever consider where the water used for cooling goes? Do you imagine it disappears?

[+] latexr|7 months ago|reply

I tried their live demo. It suggests three prompts, one of them being “How many R’s are in strawberry?” So I clicked that, and it answered there are three! I tried it thrice with the same result.

It suggested the prompt. It’s infamous because models often get it wrong, they know it, and still they confidently suggested it and got it wrong.

[+] raideno|7 months ago|reply

How much ram is in your Macbook Air M3 ? I have the 16Gb version and i was wondering whether i'll be able to run it or not.

[+] black3r|7 months ago|reply

can you please give an estimate how much slower/faster is it on your macbook compared to comparable models running in the cloud?

[+] lend000|7 months ago|reply

For me the game changer here is the speed. On my local Mac I'm finally getting token counts that are faster than I can process the output (~96 tok/s), and the quality has been solid. I had previously tried some of the distilled qwen and deepseek models and they were just way too slow for me to seriously use them.

[+] snthpy|7 months ago|reply

For me the biggest benefit of open weights models is the ability to fine tune and adapt to different tasks.

[+] SergeAx|7 months ago|reply

Did you mean "120b"? I am running 20b model locally right now, and it is pretty mediocre. Nothing near Gemini 2.5 Pro, which is my daily driver.

[+] benreesman|7 months ago|reply

You're going to freak out when you try the Chinese ones :)

[+] syntaxing|7 months ago|reply

Interesting, these models are better than the new Qwen releases?

[+] bakies|7 months ago|reply

on your phone?

[+] foundry27|7 months ago|reply

Model cards, for the people interested in the guts: https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7...

In my mind, I’m comparing the model architecture they describe to what the leading open-weights models (Deepseek, Qwen, GLM, Kimi) have been doing. Honestly, it just seems “ok” at a technical level:

- both models use standard Grouped-Query Attention (64 query heads, 8 KV heads). The card talks about how they’ve used an older optimization from GPT3, which is alternating between banded window (sparse, 128 tokens) and fully dense attention patterns. It uses RoPE extended with YaRN (for a 131K context window). So they haven’t been taking advantage of the special-sauce Multi-head Latent Attention from Deepseek, or any of the other similar improvements over GQA.

- both models are standard MoE transformers. The 120B model (116.8B total, 5.1B active) uses 128 experts with Top-4 routing. They’re using some kind of Gated SwiGLU activation, which the card talks about as being "unconventional" because of to clamping and whatever residual connections that implies. Again, not using any of Deepseek’s “shared experts” (for general patterns) + “routed experts” (for specialization) architectural improvements, Qwen’s load-balancing strategies, etc.

- the most interesting thing IMO is probably their quantization solution. They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool. But we’ve also got Unsloth with their famous 1.58bit quants :)

All this to say, it seems like even though the training they did for their agentic behavior and reasoning is undoubtedly very good, they’re keeping their actual technical advancements “in their pocket”.

[+] ClassAndBurn|7 months ago|reply

Open models are going to win long-term. Anthropics' own research has to use OSS models [0]. China is demonstrating how quickly companies can iterate on open models, allowing smaller teams access and augmentation to the abilities of a model without paying the training cost.

My personal prediction is that the US foundational model makers will OSS something close to N-1 for the next 1-3 iterations. The CAPEX for the foundational model creation is too high to justify OSS for the current generation. Unless the US Gov steps up and starts subsidizing power, or Stargate does 10x what it is planned right now.

N-1 model value depreciates insanely fast. Making an OSS release of them and allowing specialized use cases and novel developments allows potential value to be captured and integrated into future model designs. It's medium risk, as you may lose market share. But also high potential value, as the shared discoveries could substantially increase the velocity of next-gen development.

There will be a plethora of small OSS models. Iteration on the OSS releases is going to be biased towards local development, creating more capable and specialized models that work on smaller and smaller devices. In an agentic future, every different agent in a domain may have its own model. Distilled and customized for its use case without significant cost.

Everyone is racing to AGI/SGI. The models along the way are to capture market share and use data for training and evaluations. Once someone hits AGI/SGI, the consumer market is nice to have, but the real value is in novel developments in science, engineering, and every other aspect of the world.

[0] https://www.anthropic.com/research/persona-vectors > We demonstrate these applications on two open-source models, Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct.

[+] x187463|7 months ago|reply

Running a model comparable to o3 on a 24GB Mac Mini is absolutely wild. Seems like yesterday the idea of running frontier (at the time) models locally or on a mobile device was 5+ years out. At this rate, we'll be running such models in the next phone cycle.

[+] lukax|7 months ago|reply

Inference in Python uses harmony [1] (for request and response format) which is written in Rust with Python bindings. Another OpenAI's Rust library is tiktoken [2], used for all tokenization and detokenization. OpenAI Codex [3] is also written in Rust. It looks like OpenAI is increasingly adopting Rust (at least for inference).

[1] https://github.com/openai/harmony

[2] https://github.com/openai/tiktoken

[3] https://github.com/openai/codex

[+] deviation|7 months ago|reply

So this confirms a best-in-class model release within the next few days?

From a strategic perspective, I can't think of any reason they'd release this unless they were about to announce something which totally eclipses it?

[+] ticulatedspline|7 months ago|reply

Even without an imminent release it's a good strategy. They're getting pressure from Qwen and other high performing open-weight models. without a horse in the race they could fall behind in an entire segment.

There's future opportunity in licensing, tech support, agents, or even simply to dominate and eliminate. Not to mention brand awareness, If you like these you might be more likely to approach their brand for larger models.

[+] FergusArgyll|7 months ago|reply

Thursday

https://manifold.markets/Bayesian/on-what-day-will-gpt5-be-r...

[+] winterrx|7 months ago|reply

GPT-5 coming Thursday.

[+] famouswaffles|7 months ago|reply

Even before today, the last week or so, it's been clear for a couple reasons, that GPT-5's release was imminent.

[+] bredren|7 months ago|reply

Undoubtedly. It would otherwise reduce the perceived value of their current product offering.

The question is how much better the new model(s) will need to be on the metrics given here to feel comfortable making these available.

Despite the loss of face for lack of open model releases, I do not think that was a big enough problem t undercut commercial offerings.

[+] logicchains|7 months ago|reply

> I can't think of any reason they'd release this unless they were about to announce something which totally eclipses it

Given it's only around 5 billion active params it shouldn't be a competitor to o3 or any of the other SOTA models, given the top Deepseek and Qwen models have around 30 billion active params. Unless OpenAI somehow found a way to make a model with 5 billion active params perform as well as one with 4-8 times more.

[+] unknown|7 months ago|reply

[deleted]

[+] semitones|7 months ago|reply

You hit the nail on the head!!!

[+] henriquegodoy|7 months ago|reply

Seeing a 20B model competing with o3's performance is mind blowing like just a year ago, most of us would've called this impossible - not just the intelligence leap, but getting this level of capability in such a compact size.

I think that the point that makes me more excited is that we can train trillion-parameter giants and distill them down to just billions without losing the magic. Imagine coding with Claude 4 Opus-level intelligence packed into a 10B model running locally at 2000 tokens/sec - like instant AI collaboration. That would fundamentally change how we develop software.

[+] timmg|7 months ago|reply

Orthogonal, but I just wanted to say how awesome Ollama is. It took 2 seconds to find the model and a minute to download and now I'm using it.

Kudos to that team.

[+] simonw|7 months ago|reply

Just posted my initial impressions, took a couple of hours to write them up because there's a lot in this release! https://simonwillison.net/2025/Aug/5/gpt-oss/

TLDR: I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs. Will be interesting to see if independent benchmarks resolve in that direction as well.

The 20B model runs on my Mac laptop using less than 15GB of RAM.

[+] benreesman|7 months ago|reply

I'm a well-known OpenAI hater, but there's haters and haters, and refusing to acknowledge great work is the latter.

Well done OpenAI, this seems like a sincere effort to do a real open model with competitive performance, usable/workable licensing, a tokenizer compatible with your commercial offerings, it's a real contribution. Probably the most open useful thing since Whisper that also kicked ass.

Keep this sort of thing up and I might start re-evaliating how I feel about this company.

[+] sadiq|7 months ago|reply

Looks like Groq (at 1k+ tokens/second) and Fireworks are already live on openrouter: https://openrouter.ai/openai/gpt-oss-120b

$0.15M in / $0.6-0.75M out

edit: Now Cerebras too at 3,815 tps for $0.25M / $0.69M out.

[+] artembugara|7 months ago|reply

Disclamer: probably dumb questions

so, the 20b model.

Can someone explain to me what I would need to do in terms of resources (GPU, I assume) if I want to run 20 concurrent processes, assuming I need 1k tokens/second throughput (on each, so 20 x 1k)

Also, is this model better/comparable for information extraction compared to gpt-4.1-nano, and would it be cheaper to host myself 20b?

[+] sabaimran|7 months ago|reply

Super excited to see these released!

Major points of interest for me:

- In the "Main capabilities evaluations" section, the 120b outperform o3-mini and approaches o4 on most evals. 20b model is also decent, passing o3-mini on one of the tasks.

- AIME 2025 is nearly saturated with large CoT

- CBRN threat levels kind of on par with other SOTA open source models. Plus, demonstrated good refusals even after adversarial fine tuning.

- Interesting to me how a lot of the safety benchmarking runs on trust, since methodology can't be published too openly due to counterparty risk.

Model cards with some of my annotations: https://openpaper.ai/paper/share/7137e6a8-b6ff-4293-a3ce-68b...

[+] matznerd|7 months ago|reply

thanks openai for being open ;) Surprised there are no official MLX versions and only one mention of MLX in this thread. MLX basically converst the models to take advntage of mac unified memory for 2-5x increase in power, enabling macs to run what would otherwise take expensive gpus (within limits).

So FYI to any one on mac, the easiest way to run these models right now is using LM Studio (https://lmstudio.ai/), its free. You just search for the model, usually 3rd party groups mlx-community or lmstudio-community have mlx versions within a day or 2 of releases. I go for the 8-bit quantizations (4-bit faster, but quality drops). You can also convert to mlx yourself...

Once you have it running on LM studio, you can chat there in their chat interface, or you can run it through api that defaults to http://127.0.0.1:1234

You can run multiple models that hot swap and load instantly and switch between them etc.

Its surpassingly easy, and fun.There are actually a lot of cool niche models comings out, like this tiny high-quality search model released today as well (and who released official mlx version) https://huggingface.co/Intelligent-Internet/II-Search-4B

Other fun ones are gemma 3n which is model multi-modal, larger one that is actually solid model but takes more memory is the new Qwen3 30b A3B (coder and instruct), Pixtral (mixtral vision with full resolution images), etc. Look forward to playing with this model and see how it compares.

[+] IceHegel|7 months ago|reply

Listed performance of ~5 points less than o3 on benchmarks is pretty impressive.

Wonder if they feel the bar will be raised soon (GPT-5) and feel more comfortable releasing something this strong.

[+] bluecoconut|7 months ago|reply

I was able to get gpt-oss:20b wired up to claude code locally via a thin proxy and ollama.

It's fun that it works, but the prefill time makes it feel unusable. (2-3 minutes per tool-use / completion). Means a ~10-20 tool-use interaction could take 30-60 minutes.

(This editing a single server.py file that was ~1000 lines, the tool definitions + claude context was around 30k tokens input, and then after the file read, input was around ~50k tokens. Definitely could be optimized. Also I'm not sure if ollama supports a kv-cache between invocations of /v1/completions, which could help)

[+] HanClinto|7 months ago|reply

Holy smokes, there's already llama.cpp support:

https://github.com/ggml-org/llama.cpp/pull/15091

[+] Leary|7 months ago|reply

GPQA Diamond: gpt-oss-120b: 80.1%, Qwen3-235B-A22B-Thinking-2507: 81.1%

Humanity’s Last Exam: gpt-oss-120b (tools): 19.0%, gpt-oss-120b (no tools): 14.9%, Qwen3-235B-A22B-Thinking-2507: 18.2%

[+] CraigJPerry|7 months ago|reply

I just tried it on open router but i was served by cerebras. Holy... 40,000 tokens per second. That was SURREAL.

I got a 1.7k token reply delivered too fast for the human eye to perceive the streaming.

n=1 for this 120b model but id rank the reply #1 just ahead of claude sonnet 4 for a boring JIRA ticket shuffling type challenge.

EDIT: The same prompt on gpt-oss, despite being served 1000x slower, wasn't as good but was in a similar vein. It wanted to clarify more and as a result only half responded.

[+] mythz|7 months ago|reply

Getting great performance running gpt-oss on 3x A4000's:

    gpt-oss:20b = ~46 tok/s

More than 2x faster than my previous leading OSS models:

    mistral-small3.2:24b = ~22 tok/s 
    gemma3:27b           = ~19.5 tok/s

Strangely getting nearly the opposite performance running on 1x 5070 Ti:

    mistral-small3.2:24b = ~39 tok/s 
    gpt-oss:20b          = ~21 tok/s

Where gpt-oss is nearly 2x slow vs mistral-small 3.2.

876 comments