Impressive that they publish and acknowledge the (tiny, but existent) drop in performance on SWE-Bench Verified between Opus 4.5 to 4.6. Obviously such a small drop in a single benchmark is not that meaningful, especially if it doesn't test the specific focus areas of this release (which seem to be focused around managing larger context).But considering how SWE-Bench Verified seems to be the tech press' favourite benchmark to cite, it's surprising that they didn't try to confound the inevitable "Opus 4.6 Releases With Disappointing 0.1% DROP on SWE-Bench Verified" headlines.
epolanski|24 days ago
I had two different PRs with some odd edge case (thankfully catched by tests), 4.5 kept running in circles, kept creating test files and running `node -e` or `python 3` scripts all over and couldn't progress.
4.6 thought and thought in both cases around 10 minutes and found a 2 line fix for a very complex and hard to catch regression in the data flow without having to test, just thinking.
SubiculumCode|24 days ago
tedsanders|24 days ago