top | item 40023553

(no title)

htormey | 1 year ago

I don’t trust anecdotes on twitter because every time I’ve tried an agent that’s been hyped up it’s been more expensive and time consuming than just using GitHub co pilot with Claude/ChatGPT and putting up a PR myself.

Hence I’m skeptical of people making claims about a product I can’t try out myself. It’s unclear if the tasks they are doing and the way they are using Agents is relevant to the work I do. Which is usually working on a team of engineers shipping code on a complex code base.

For AI I tend to put a lot more weight in benchmarks, such as SWE-bench, which is why I wrote an article about:

https://www.stepchange.work/blog/why-do-ai-software-engineer...

SWE-bench is mostly small python tasks evaluated solely by unit tests which require less than 15 line changes to a single file. Most of those it fails at and the ones it gets right it ignores all sorts of libraries and conventions used in the rest of the code base.

I’m Optimistic that agents will eventually agents will improve dramatically in a few years but today Devin is not good at making larger changes that build on one another like features.

discuss

No comments yet.