top | item 43096398

(no title)

nicebyte | 1 year ago

How did you draw that conclusion from reading the contents of the link? This is a benchmark.

> We evaluate model performance and find that frontier models are still unable to solve the majority of tasks.

discuss

order

No comments yet.