(no title)
wbharding | 2 years ago
We hope it leads dev teams, and AI Assistant builders, to adopt measurement & incentives that promote reused code over newly added code. Especially for those poor teams whose managers think LoC should be a component of performance evaluations (around 1 in 3, according to GH research), the current generation of code assistants make it dangerously easy to hit tab, commit, and seed future tech debt. As Adam Tornhill eloquently put it on Twitter, "the main challenge with AI assisted programming is that it becomes so easy to generate a lot of code that shouldn't have been written in the first place."
That said, our research significance is currently limited in that it does not directly measure what code was AI-authored -- it only charts the correlation between code quality over the last 4 years and the proliferation of AI Assistants. We hope GitHub (or other AI Assistant companies) will consider partnering with us on follow-up research to directly measure code quality differences in code that is "completely AI suggested," "AI suggested with human change," and "written from scratch." We would also like the next iteration of our research to directly measure how bug frequency is changing with AI usage. If anyone has other ideas for what they'd like to see measured, we welcome suggestions! We endeavor to publish a new research paper every ~2 months.
oooyay|2 years ago
imo, this is just replacing one silly measure with another. Code reuse can be powerful within a code base but I've witnessed it cause chaos when it spans code bases. That's to say, it can be both useful and inappropriate/chaotic and the result largely depends on judgement.
I'd rather us start grading developers based on the outcomes of software. For instance, their organizational impact compared to their resource footprint or errors generated by a service that are not derivative of a dependent service/infra. A programmer is responsible for much more than just they code they right; the modern programmer is a purposefully bastardized amalgamation of:
- Quality Engineer / Tester
- Technical Product Manager
- Project Manager
- Programmer
- Performance Engineer
- Infrastructure Engineer
Edit: Not to say anything of your research; I'm glad there are people who care so deeply about code quality. I just think we should be thinking about how to grade a bit differently.
zemo|2 years ago
> Not to say anything of your research
The second statement isn't true just because you want it to be true. The first statement renders it untrue.
> I'd rather us start grading developers based on the outcomes of software. For instance, ... errors generated by a service
yeah you should click through and read the whitepaper and not just the summary. The authors talk about similar ideas. For example, from the paper:
> The more Churn becomes commonplace, the greater the risk of mistakes being deployed to production. If the current pattern continues into 2024, more than 7% of all code changes will be reverted within two weeks, double the rate of 2021. Based on this data, we expect to see an increase in Google DORA's "Change Failure Rate" when the “2024 State of Devops” report is released later in the year, contingent on that research using data from AI-assisted developers in 2023.
The authors are describing one measurable signal while openly expressing interest in the topics you're mentioning. The thing is: what's in this paper is a leading indicator, while what you're talking about is a lagging indicator. There's not really a clear hypothesis as to why, for example, increased code churn would reduce the number of production incidents, the mean time to resolution of dealing with incidents, etc.
lolinder|2 years ago
So, would a more accurate title for this be "New research shows code quality has declined over the last four years"? Did you do anything to control for other possible explanations, like the changing tech economy?
nephrenka|2 years ago
There is actual AI benchmarking data in the Refactoring vs Refuctoring paper: https://codescene.com/hubfs/whitepapers/Refactoring-vs-Refuc...
That paper benchmarked the performance of the most popular LLMs on refactoring tasks on real-world code. The study found that the AI only delivered functionally correct refactorings in 37% of the cases.
AI-assisted coding is genuinely useful, but we (of course) need to keep skilled humans in the loop and set realistic expectations beyond any marketing hype.