(no title)
aluzzardi | 2 days ago
Models are evolving fast. If your experience is older than a few months, I encourage you to try again.
I mean this with the best intentions: it's seriously mind boggling. We started doing this with Sonnet 4.0 and the relevance was okay at best. Then in September we shifted to Sonnet 4.5 and it's been night and day.
Every single model released since then (Opus 4.5, 4.6) has meaningfully improved the quality of results
whoami4041|2 days ago
shad42|2 days ago
But it's night and day to fix your CI when someone (in this case an agent) already dug into the logs, the code of the test and propose options to fix. We have several customers asking us to automate the rest (all the way to merge code), but we haven't done it for the reasons you mention. Although I am sure we'll get there sometimes this year.