Interesting, I've had similar issues. It seems to be very clumsy when using its internal tooling. I've seen diffs where it accidentally garbled significant amounts of code, which it then had to go in and manually fix. It's also introduced bugs into features that it wasn't supposed to be touching, and when I asked it why it was making changes to I the other code, it answered that it had failed to copy-paste since large blocks of code correctly.
sheepscreek|9 days ago
The problem is again that you can’t ever fully trust an agent did exactly what you asked for and in the exact manner that you had hoped.
It works just like you’re dealing with a human companion. Trust takes time to build. Over the period you realize the other individuals weaknesses and support them there.
What makes it a bit challenging right now is the pace of innovation. By the time we get used to a model’s personality, a new update comes out that alters it in unknown ways. Now you’re back to square one.
I’ve been experimenting with asking one frontier model to check on another’s work. That’s proven to be better than doing nothing. Usually they’ll have some genuinely useful feedback.