top | item 46956008

(no title)

agreed, and i'd go further - the harness is where evaluation actually happens, not in some separate benchmark suite. rhe model doesn't know if it succeeded at a web task. the harness has to verify DOM state, check that the right element was clicked, confirm the page transitioned correctly. right now most harnesses just check "did the model say it was done" which is why pass rates on benchmarks don't translate to production reliability. the interesting harness work is building verification into the loop itself, not as an afterthought.

discuss

No comments yet.