top | item 28281995

(no title)

bhuga | 4 years ago

Is there equivalent empirical data from real programmers?

That is to say, you have code prompts here, let Copilot fill in the gaps, and rate that code. Is there a study that uses the same prompts with a selection of programmers to see if they do better or worse?

I'm curious because in my testing of copilot, it often writes garbage. But if I'm being honest, often, so do I.

I feel like Twitter's full of cheap shots against copilot's bad outputs, but many of them don't seem to be any worse than common errors. I would really like to see how copilot stands up to the existing human competition, especially on axes of security, which are a bit more objectively measurable than general "quality".

discuss

order

kiwih|4 years ago

Yes, the work definitely lends itself towards the question "is this better or worse than an equivalent human developer?" This is quite a difficult question to answer, although I agree that simply giving a large number of humans the same prompts could be insightful. However, then you would be rating against an aggregate of humans, rather than an individual (i.e. this is "the" copilot). Also, knowing research, you would really be comparing against a random corpus of student answers, as it is usually students that would be participating in a study such as this.

Nonetheless, we think that simply having a quantification of Copilot's outputs is useful, as it can definitely provide an indicator of how risky it might be to provide the tool to an inexperienced developer that might be tempted to accept every suggestion.

laumars|4 years ago

Rather than comparing against students in lab conditions, I’d be more interested to see it compare to students with access to Stack Overflow et al vs students with access to just Co Pilot. Ie is a junior developer more likely to trust bad suggestions online vs bad suggestions made by Co Pilot?