top | item 47079728

(no title)

girvo | 10 days ago

Benchmarks are basically straight up meaningless at this point in my experience. If they mattered and were the whole story, those Chinese open models would be stomping the competition right now. Instead they're merely decent when you use them in anger for real work.

I'll withhold judgement until I've tried to use it.

discuss

phatfish|10 days ago

Does anyone know what this "APEX-Agents benchmark for long time horizon investment banking, consulting and legal work" actually evaluates?

That sounds so broad that creating a meaningful benchmark is probably as difficult as creating an AI that actually "solves" those domains.

avereveard|10 days ago

What's your opinion of glm5 if you had a chance to use it

girvo|10 days ago

I haven’t yet, though I will be this weekend!