top | item 42002045

(no title)

chaxor | 1 year ago

Also importantly, they do have a 'not attempted' or 'do not know' type of response, though how it is used is not really well discussed in the article.

As it has been for decades now, the 'Nan' type of answer in NLP is important, adds great capability, and is often glossed over.

discuss

bcherry|1 year ago

a little glossed over, but they do point out that most important improvement o1 has over gpt-4o is not it's "correct" score improving from 38% to 42% but actually it's "not attempted" going from 1% to 9%. The improvement is even more stark for o1-mini vs gpt-4o-mini: 1% to 28%.

They don't really describe what "success" would look like but it seems to me like the primary goal is to minimize "incorrect", rather than to maximize "correct". the mini models would get there by maximizing "not attempted" with the larger models having much higher "correct". Then both model sizes could hopefully reach 90%+ "correct" when given access to external lookup tools.