top | item 36177691

(no title)

alan-stark | 2 years ago

The abstract says ..we present metrics from our large-scale deployment of CodeCompose that shows its impact on Meta's internal code authoring experience over a 15-day time window, where 4.5 million suggestions were made by CodeCompose. Quantitative metrics reveal that (i) CodeCompose has an acceptance rate of 22% across several languages, and (ii) 8% of the code typed by users of CodeCompose is through accepting code suggestions from CodeCompose. Qualitative feedback indicates an overwhelming 91.5% positive reception for CodeCompose.

In other terms, out of 4.5 million suggestions about 80% were off, yet there is 91% positive reception. That's 3.6 million rejected suggestions that potentially distracted programmers from doing their work. Yet users are happy. Is there a contradiction in these figures?

discuss

order

alan-stark|2 years ago

Reading these answers reminded me why I love HN - actually thoughtful perspectives :) Guess a lot boils down to two variables - (a) suggestion UX quality and (b) definition of 'rejection' event. I skimmed through the paper and it turns out that 91% figure is based on feedback from 70 people and anonymous feedback wasn't allowed. So, 'overwhelming 91% favorable' can be paraphrased to `64 people out of the total 16k user base said they liked it'. Would be interesting to see indirect metrics like retention on day 15.

idiotsecant|2 years ago

Quite an insightful comment. In an institution that large it's surprising there were only 64 brown nosers. I expect out of 16k captive audience employees you could probably get 64 people to give a positive opinion of replacing paychecks with meta store scrip.

moonchrome|2 years ago

It's easy to :

- anticipate when the suggestions are likely to be useless and not even bother

- scan the proposals to see if they are what you want in cases it's useful

It's a boilerplate generator and you're happy when it saves you tedious mental effort.

pydry|2 years ago

>It's a boilerplate generator and you're happy when it saves you tedious mental effort.

On the other hand the person trying to track down a subtle bug afterwards might be a little less happy at having to wade through oceans of boilerplate.

fnordpiglet|2 years ago

I’d say it’s hard to argue with the positive impression of the engineer using it. If they find it’s suggestions helpful it’s not a distraction, it’s helpful.

Using GitHub copilot daily I find it’s suggestions often nonsense but interesting to see regardless. Often for boilerplate it’s spot on and it saves me dozens of lines of typing. But it also suggests stuff on every key stroke many of which I just type through, similar to intellisense. Assuming Metas code thingy is better, I would find myself in that 91%, as I’m already there with what’s available to the general public.

My only gripe, fwiw, with copilot in vscode is it interferes with intellisense. Often I want to see the code completion from both, but copilot jumps in before intellisense and the intellisense never renders and I use it as an inline api reference. Sometimes it’s so frustrating I have to turn off copilot. But, copilot is generally useful enough that I reenable it once I’ve understood the api stuff I’m unsure of. There’s some escape backspace period dance I can do that sometimes let’s intellisense win. I’ve not dug deeply enough into vscode configuration to know if there’s some parameter to tweak the race conditions. I’d note that when intellisense renders first copilot still renders its suggestions but the other way doesn’t work.

rychco|2 years ago

I treat it the same way I do pre-LLM LSP suggestions, which is basically inline documentation lookup. ‘Oh what was that function name for inserting something at the end? PushB- no, InsertAft- no, App - end! Yea that’s it’

In this case it gave me 3 suggestions but I only accepted 1. I could see this taking 5-10 suggestions for an LLM to when it’s not something as straightforward as a function name. It’s still very useful despite this low acceptance rate

pavlov|2 years ago

I think the 8% number better explains why users were so overwhelmingly happy. Assuming the suggestions in general are not distractingly wrong, then 8% of code automatically written is a decent amount of time saved researching solutions.

layer8|2 years ago

But only 22% are accepted for those 8%, which means that the 78% code suggestions that are not accepted correspond to an equivalent of over 28% of all code written. Not sure that having to spend the time evaluating an additional 28% of code in vain amounts to an overall win.

Though I guess the success rates when using Stack Overflow aren’t too dissimilar.

visarga|2 years ago

Interesting that 91% find it useful but only 8% of the code is generated by LLM. This is even with a LLM tuned on the internal codebase. This will give a mild boost but not replace anyone.

cloudking|2 years ago

Have you tried GitHub Copilot? You don't have to accept the code suggestions, so they don't really distract you or get in the way once you get used to the UX.

tablatom|2 years ago

I find them extremely distracting. Evaluating a suggestion is, for me, an entirely different mental process from the creative process I’m in the middle of. The tagline that copilot helps you stay in the flow is very much not my experience.

I am well aware that others are having a different experience with it.

skybrian|2 years ago

It’s a different system, but it seems interesting to compare with what Google does for code review suggestions [1].

> The final model was calibrated for a target precision of 50%. That is, we tuned the model and the suggestions filtering, so that 50% of suggested edits on our evaluation dataset are correct. In general, increasing the target precision reduces the number of shown suggested edits, and decreasing the target precision leads to more incorrect suggested edits. Incorrect suggested edits take the developers time and reduce the developers’ trust in the feature. We found that a target precision of 50% provides a good balance.

Also, it seems like if the suggestions are too good then they’ll be blindly trusted and if they’re too bad they’ll be ignored?

Where to set the balance likely depends on the UI. For a web search, how many results do you click on?

[1] https://ai.googleblog.com/2023/05/resolving-code-review-comm...

seanmcdirmid|2 years ago

A lot of time suggestions are provided but not used because you already knew the answer and typed fast enough not to take it.

afro88|2 years ago

Think of it like traditional code completion. It's mostly wrong but still useful. You either type through it, or tab/arrow to select the correct completion.

AI code completion (like Github Copilot) is like this. Still a time saver overall, even with a low acceptance rate.

YetAnotherNick|2 years ago

If you take random question from stack overflow, my guess is that 80% of them don't have correct answer, yet I am very happy stackoverflow exists.

Mountain_Skies|2 years ago

I've had Bing provide me with code from SO that was from the question, which was code that was explicitly stated to not work and the poster wanted to know what was wrong with it. Bing's AI didn't understand this and claimed it was a solution.

joshuamorton|2 years ago

The UX is really important, and the paper covers it, this is super spiffy tab completion, so even if it's wrong a lot, reading is faster than typing, and having something autofill `x if x is not None else ''` correctly even 5% of the time is nice.

meling|2 years ago

I was thinking the same; it feels the acceptance rate is a bit low, but maybe not… I wonder what the numbers are for Copilot?

anigbrowl|2 years ago

It's not like programmers normally get everything right first time.

6510|2 years ago

if it makes you an 1.2x dev?