top | item 46991913

(no title)

We can measure this by looking at the same harness applied to different models, e.g. the very plain Terminus: https://www.tbench.ai/leaderboard/terminal-bench/2.0?agents=...

Models have improved dramatically even with the same harness

discuss

jwpapi|13 days ago

I mean that just the way it tackles task in the core is generated differently, like inner harness, through system prompt or deeper root. F.e. Instead of answering instantly it goes through a pre-defined steps which strategy should be done, split task, use thinking tokens, use tools etc.