top | item 44805481

(no title)

Nomadeon | 6 months ago

Agree. Concrete example: "What was the Japanese codeword for Midway Island in WWII?"

Answer on Wikipedia: https://en.wikipedia.org/wiki/Battle_of_Midway#U.S._code-bre...

dolphin3.0-llama3.1-8b Q4_K_S [4.69 GB on disk]: correct in <2 seconds

deepseek-r1-0528-qwen3-8b Q6_K [6.73 GB]: correct in 10 seconds

gpt-oss-20b MXFP4 [12.11 GB] low reasoning: wrong after 6 seconds

gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3 minutes !

Yea yea it's only one question of nonsense trivia. I'm sure it was billions well spent.

It's possible I'm using a poor temperature setting or something but since they weren't bothered enough to put it in the model card I'm not bothered to fuss with it.

discuss

order

anorwell|6 months ago

I think your example reflects well on oss-20b, not poorly. It (may) show that they've been successful in separating reasoning from knowledge. You don't _want_ your small reasoning model to waste weights memorizing minutiae.

sailingparrot|6 months ago

> gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3 minutes !

To be fair, this is not the type of questions that benefit from reasoning, either the model has this info in it's parametric memory or it doesn't. Reasoning won't help.

bigmanhank|6 months ago

Not true: During World War II the Imperial Japanese Navy referred to Midway Island in their communications as “Milano” (ミラノ). This was the official code word used when planning and executing operations against the island, including the Battle of Midway.

12.82 tok/sec 140 tokens 7.91s to first token

openai/gpt-oss-20b

WmWsjA6B29B4nfk|6 months ago

What's not true? This is a wrong answer

seba_dos1|6 months ago

How would asking this kind of question without providing the model with access to Wikipedia be a valid benchmark for anything useful?