top | item 39982902

(no title)

htormey | 1 year ago

AI software engineers like Devin and SWE-agent are frequently compared to human software engineers. However SWE-bench, the benchmark upon which this comparison is made, only applies to Python tasks, most of which involve making single-file changes of 15 lines or less and relies solely on unit tests to evaluate their correctness. My aim is to give you a framework to assess if AI's progress against this benchmark is relevant to your organization's work.

discuss

No comments yet.