top | item 47086239

(no title)

phatfish | 9 days ago

Does anyone know what this "APEX-Agents benchmark for long time horizon investment banking, consulting and legal work" actually evaluates?

That sounds so broad that creating a meaningful benchmark is probably as difficult as creating an AI that actually "solves" those domains.

discuss

order

No comments yet.