top | item 39168841

(no title)

wbharding | 2 years ago

Original research author here. It's exciting to find so many thinking about long-term code quality! The 2023 increase in churned & duplicated (aka copy/pasted) code, alongside the reduction in moved code, was certainly beyond what we expected to find.

We hope it leads dev teams, and AI Assistant builders, to adopt measurement & incentives that promote reused code over newly added code. Especially for those poor teams whose managers think LoC should be a component of performance evaluations (around 1 in 3, according to GH research), the current generation of code assistants make it dangerously easy to hit tab, commit, and seed future tech debt. As Adam Tornhill eloquently put it on Twitter, "the main challenge with AI assisted programming is that it becomes so easy to generate a lot of code that shouldn't have been written in the first place."

That said, our research significance is currently limited in that it does not directly measure what code was AI-authored -- it only charts the correlation between code quality over the last 4 years and the proliferation of AI Assistants. We hope GitHub (or other AI Assistant companies) will consider partnering with us on follow-up research to directly measure code quality differences in code that is "completely AI suggested," "AI suggested with human change," and "written from scratch." We would also like the next iteration of our research to directly measure how bug frequency is changing with AI usage. If anyone has other ideas for what they'd like to see measured, we welcome suggestions! We endeavor to publish a new research paper every ~2 months.

discuss

order

oooyay|2 years ago

> We hope it leads dev teams, and AI Assistant builders, to adopt measurement & incentives that promote reused code over newly added code.

imo, this is just replacing one silly measure with another. Code reuse can be powerful within a code base but I've witnessed it cause chaos when it spans code bases. That's to say, it can be both useful and inappropriate/chaotic and the result largely depends on judgement.

I'd rather us start grading developers based on the outcomes of software. For instance, their organizational impact compared to their resource footprint or errors generated by a service that are not derivative of a dependent service/infra. A programmer is responsible for much more than just they code they right; the modern programmer is a purposefully bastardized amalgamation of:

- Quality Engineer / Tester

- Technical Product Manager

- Project Manager

- Programmer

- Performance Engineer

- Infrastructure Engineer

Edit: Not to say anything of your research; I'm glad there are people who care so deeply about code quality. I just think we should be thinking about how to grade a bit differently.

zemo|2 years ago

> this is just replacing one silly measure with another

> Not to say anything of your research

The second statement isn't true just because you want it to be true. The first statement renders it untrue.

> I'd rather us start grading developers based on the outcomes of software. For instance, ... errors generated by a service

yeah you should click through and read the whitepaper and not just the summary. The authors talk about similar ideas. For example, from the paper:

> The more Churn becomes commonplace, the greater the risk of mistakes being deployed to production. If the current pattern continues into 2024, more than 7% of all code changes will be reverted within two weeks, double the rate of 2021. Based on this data, we expect to see an increase in Google DORA's "Change Failure Rate" when the “2024 State of Devops” report is released later in the year, contingent on that research using data from AI-assisted developers in 2023.

The authors are describing one measurable signal while openly expressing interest in the topics you're mentioning. The thing is: what's in this paper is a leading indicator, while what you're talking about is a lagging indicator. There's not really a clear hypothesis as to why, for example, increased code churn would reduce the number of production incidents, the mean time to resolution of dealing with incidents, etc.

lolinder|2 years ago

> That said, our research significance is currently limited in that it does not directly measure what code was AI-authored -- it only charts the correlation between code quality over the last 4 years and the proliferation of AI Assistants

So, would a more accurate title for this be "New research shows code quality has declined over the last four years"? Did you do anything to control for other possible explanations, like the changing tech economy?

nephrenka|2 years ago

> our research significance is currently limited in that it does not directly measure what code was AI-authored

There is actual AI benchmarking data in the Refactoring vs Refuctoring paper: https://codescene.com/hubfs/whitepapers/Refactoring-vs-Refuc...

That paper benchmarked the performance of the most popular LLMs on refactoring tasks on real-world code. The study found that the AI only delivered functionally correct refactorings in 37% of the cases.

AI-assisted coding is genuinely useful, but we (of course) need to keep skilled humans in the loop and set realistic expectations beyond any marketing hype.