top | item 43343033

(no title)

noddybear | 11 months ago

This is true - there are simpler benchmarks that can saturate planning for these models. We were motivated to create a broader spectrum eval, to test multiple capabilities at once and remain viable into the future.

discuss

noosphr|11 months ago

That's fair enough, but you should test other frontier model types to see if the benchmark makes sense for them.

For example the shortest path benchmark is largely useless when you look at reasoning models - since they have the equivalent of scratch paper to work through their answers the limitation became their context length rather than any innate ability to reason.