How up to date are you on current open weights models? After playing around with it for a few hours I find it to be nowhere near as good as Qwen3-30B-A3B. The world knowledge is severely lacking in particular.
dolphin3.0-llama3.1-8b Q4_K_S [4.69 GB on disk]: correct in <2 seconds
deepseek-r1-0528-qwen3-8b Q6_K [6.73 GB]: correct in 10 seconds
gpt-oss-20b MXFP4 [12.11 GB] low reasoning: wrong after 6 seconds
gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3 minutes !
Yea yea it's only one question of nonsense trivia. I'm sure it was billions well spent.
It's possible I'm using a poor temperature setting or something but since they weren't bothered enough to put it in the model card I'm not bothered to fuss with it.
I think your example reflects well on oss-20b, not poorly. It (may) show that they've been successful in separating reasoning from knowledge. You don't _want_ your small reasoning model to waste weights memorizing minutiae.
> gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3 minutes !
To be fair, this is not the type of questions that benefit from reasoning, either the model has this info in it's parametric memory or it doesn't. Reasoning won't help.
Not true:
During World War II the Imperial Japanese Navy referred to Midway Island in their communications as “Milano” (ミラノ). This was the official code word used when planning and executing operations against the island, including the Battle of Midway.
Right... knowledge is one of the things (the one thing?) that LLMs are really horrible at, and that goes double for models small enough to run on normal-ish consumer hardware.
Shouldn't we prefer to have LLMs just search and summarize more reliable sources?
Try to push your point to absurd you see why; hint - to analyze data pulled by tools you need knowledge already baked in. You have very limited context, you cannot just pull and pull data.
Nomadeon|6 months ago
Answer on Wikipedia: https://en.wikipedia.org/wiki/Battle_of_Midway#U.S._code-bre...
dolphin3.0-llama3.1-8b Q4_K_S [4.69 GB on disk]: correct in <2 seconds
deepseek-r1-0528-qwen3-8b Q6_K [6.73 GB]: correct in 10 seconds
gpt-oss-20b MXFP4 [12.11 GB] low reasoning: wrong after 6 seconds
gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3 minutes !
Yea yea it's only one question of nonsense trivia. I'm sure it was billions well spent.
It's possible I'm using a poor temperature setting or something but since they weren't bothered enough to put it in the model card I'm not bothered to fuss with it.
anorwell|6 months ago
sailingparrot|6 months ago
To be fair, this is not the type of questions that benefit from reasoning, either the model has this info in it's parametric memory or it doesn't. Reasoning won't help.
bigmanhank|6 months ago
12.82 tok/sec 140 tokens 7.91s to first token
openai/gpt-oss-20b
seba_dos1|6 months ago
unknown|6 months ago
[deleted]
nojito|6 months ago
pxc|6 months ago
Shouldn't we prefer to have LLMs just search and summarize more reliable sources?
notachatbot123|6 months ago
iamnotagenius|6 months ago
kmacdough|6 months ago
Small models are going to be particularly poor when used outside of their intended purpose. They have to omit something.