top | item 42774819

(no title)

These benchmarks have even the small models absolutely demolishing Sonnet-3.5, which doesn't reflect my subjective experience.

It still seems to me that these models are 'dumb' and often don't understand what I'm asking, where claude's intuition is much stronger.

I feel r1 14b even feels weaker than qwen 2.5 14b

Primary use-case is web technology / coding. Maybe I'm prompting it incorrectly?

discuss

Workaccount2|1 year ago

There is a frustrating gap between benchmarks and real world ability.

O1 or even O3 might be able to crack academic level math problems, but I still wouldn't trust it to correctly fill out a McDonalds application using a PDF of my resume and a calendar of my availability.

pclmulqdq|1 year ago

A lot of that has to do with certainty. The GPTs and Claudes will be replacing graudate-level research assistant jobs and other jobs that are very high skill but have soft success criteria long before they replace travel agents, which have low skill but very hard criteria for success.

Havoc|1 year ago

The reasoning models are much better suited to questions that have answers and a conclusion to arrive at. Ie exactly what benchmarks ask. Rather than make me a todo list app or whatever.

It’s a bit like you get instruct tuned models and you get chat tuned ones. It’s not really one worse than the other just aimed at different uses

cpldcpu|1 year ago

These benchmarks are mostly focused on math, which benefits a lot from an improved CoT and is also less sensitive to having "reduced knowledge" in smaller model.

Vibes are important in this case...