top | item 34556688

Diff Models – A New Way to Edit Code

237 points| sadiq | 3 years ago |carper.ai | reply

199 comments

order
[+] pavlov|3 years ago|reply
Somehow these GitHub-trained ML code assistants sadden me.

My idea of enjoyable high-quality programming isn’t to dip a spoon into an ocean of soup made of other people’s random design decisions and bugs accumulated over fifteen years, hoping to get a spoonful without hidden crunchy insect bits.

I know the soup is nutritious and healthy 98% of the time, and eating it saves so much time compared to preparing a filet mignon myself. But it’s still brown sludge.

[+] credit_guy|3 years ago|reply
Take a look at the average faces of women across different countries [1]. They are all strikingly beautiful.

By averaging, a lot of imperfections get diluted away.

Like in Anna Karenina "happy families are all alike, unhappy ones are each in its own way". The defects are idiosyncratic, the commonalities are good.

[1] https://fstoppers.com/portraits/average-faces-women-around-w...

[+] pjc50|3 years ago|reply
Given all the discussion about "supply chain security", heading in this direction is surprising. I guess it means we automate away the creative part and leave the humans to the duller work of validation. Everyone's going to become a software tester.
[+] rileymat2|3 years ago|reply
I agree with you, however, the work I see is not that.

What I see is a person who copy and pastes crap around until it works and calls it a day. I think code assistants can and will compete with them.

[+] boredemployee|3 years ago|reply
Since everyone has different goals and opinions on this, etc, many people will see it in a different way.

I love to solve _problems_ and to help people with it, but sometimes I just hate to write code to solve them. I wish my computer could have a clear picture of the solution that is in my mind so I didnt have to write a single line of code, so I could focus on the creative part of the problem solving

[+] GuB-42|3 years ago|reply
I am not much into ML code assistants either, though it may change in the future as technology becomes better and more reliable.

But I don't buy the "joy of writing code" argument. Coding is all about making a computer work for you, and I think that taming AIs to be more efficient without letting it introduce random crap will become both important and enjoyable. I think the techniques we have now are too crude for that, but it will improve. Keep in mind that even if you are writing C, you are already at high level, using libraries and compilers other people wrote, bugs included.

Now there is a certain charm being close to "hands on" programming, but if that's the case, go get an Amiga and make a few demos. It won't pay the bills, but it can be fun.

[+] netr0ute|3 years ago|reply
I don't know about this. I would actually expect the opposite, where the soup is just fine 98% of the time and sweet the last 2% because that's the "good stuff" that you can get creative on because the AI doesn't know how to help you.
[+] derefr|3 years ago|reply
> an ocean of soup made of other people’s random design decisions and bugs accumulated over fifteen years

As a human programmer, is this not what your own brain looks like? What are you doing to the information you take in that allows you to avoid regurgitating the "crunchy insect bits" of your own training corpus?

[+] gfodor|3 years ago|reply
These tools don’t introduce design decisions or other things that really constitute most of the “art” of programming. They just help you with the lowest grain bits of moving data around. This is probably a temporary condition but your concern here seems misplaced given where the tools are at today.
[+] carlbarrdahl|3 years ago|reply
What if you could inspire the assistant with code you like and it would generate in that style? For example choose a few repos with code-bases you want to mimic, give it a set of instructions (and perhaps structure), and it generates code for it.

Maybe something like GPT, style transfer, and OpenAPI combined.

[+] indeyets|3 years ago|reply
Well, these are not tools for the art-level programming. But it helps to improve productivity of commercial programming a lot. Different genre
[+] nbardy|3 years ago|reply
This isn't how large language models work. Deep Features are much more rich than this. It's not random the models have their own sense of taste, and you can easily control it with comments specifying what you care about in the code you are going to write.
[+] SergeAx|3 years ago|reply
It's okay, you don't have to use AI assistants to program.
[+] neximo64|3 years ago|reply
And yet it is so useful. It is just an assistant. It's quite unlike soup, since you can easily alter it.
[+] nikau|3 years ago|reply
Is a bold assumption to assume most code these days isn't just a series of copy paste fragments from stack overflow anyway.
[+] Zetobal|3 years ago|reply
Eh... I don't have the desire to do all the plumbing in my house and neither in my code.
[+] williamcotton|3 years ago|reply
How much of your identity is made up of “programmer”? Are you proud or hesitant to tell people you’re a programmer? Do you identify as a “painter” or anything else? How often do you compare yourself to other programmers and feel bad?
[+] RjQoLCOSwiIKfpm|3 years ago|reply
Prepare for household appliances - washing machines etc. - doing strange things randomly.

Prepare for the same thing with electronics which you didn't consider as containing much software before - central heating units, AC units, fridges, stoves, light switches, LED light bulbs, vacuum cleaners, electric shavers, electric toothbrushes, kids toys, microwave ovens, really anything which consumes electricity.

Prepare for the support of the vendors of those appliances not taking phone calls anymore, only text communication.

Prepare for the support not understanding the random problems you encounter.

Prepare for the answers you get from support being similarly random.

And maybe, with an unknown probability, prepare for your house burning down and nobody can tell you why.

[+] Kwantuum|3 years ago|reply
A lot of the comments seem to talk about the inevitable AI event horizon but unless I'm misreading this article the results are flat out bad. Even the 6 billion parameters model barely scratches a 50% success rate on a tiny problem that is trivial to fix for any human with basic knowledge of programming. Note the log scale of the graph.
[+] startupsfail|3 years ago|reply
From the safety perspective (may get important soon), it is perhaps a very bad idea to allow easy execution/injection of arbitrary code into random places with little review.

One of the first steps of a misaligned/unhelpful/virus type of a system, attempting to secure its presence would likely be inference/GPU/TPU compute access. And code injection is a vector. There are multiple other vectors.

When designing such systems, please do keep that in mind. Make sure code changes are properly signed and the originating models are traceable.

Same applies to datasets generated by models.

[+] jakear|3 years ago|reply
Excellent. This is the beginning of the end for the cohort of people writing clear, descriptive commit messages. All your knowledge is soon to be acquisitioned and commodified by the Man with the GPU.

I on the other hand will survive: what sense is an AI to make of such classic messages as David Bowie's excellent "ch-ch-changes!", the five "fix CI maybe???"s in a row, or the eternal "fuck this shit"?

[+] PoignardAzur|3 years ago|reply
We're still in the beginning for these tools, but already they're demonstrating some really exciting capacity.

Something I haven't seen explored too much: navigation help. One of the things that takes me the most time when coding is remembering what was the next file / module / function I need to edit and jumping to it.

An autocomplete engine that would suggest jump locations instead of token could help me stay in the flow much longer, with fewer worries about whether I'm introducing subtle bugs because I'm relying on the AI too much.

[+] abhijeetpbodas|3 years ago|reply
On a philosophical level, AI for writing code has always seemed redundant to me. Here's why:

1. Humans create programming languages which machines can understand. OK.

2. Humans build tools (LSP, treesitter, tags, type checkers and others) to help humans understand code better. OK.

3. Humans build (AI) programs which run on machines so that the computer can understand... computer programs???

Aren't computers supposed to be able to understand code already? Wasn't the concept of "computer code" created so as to have something which the computer could understand? Isn't making a (AI) program to help the computer understand computer programs re-inventing the wheel?

(Of course, I get that I use the terms "understand" and "computer programs" very loosely here!)

[+] manmal|3 years ago|reply
As long as we don’t have „level 5“ code generation (no human oversight necessary), we need the code to be human readable. Afterwards, sure, why not produce assembly directly. Still it might be more practical to produce platform independent code instead - you‘ll only need to train one model instead of one per platform.
[+] semitones|3 years ago|reply
The benefit here is that the machine can execute what the AI produces, and humans can understand it / modify it if they need to.
[+] jrvarela56|3 years ago|reply
The impact of context in LLM performance makes higher level languages a must for AI to generate programs. The AI doesn't 'understand' code like a 'computer' does - it understands it like we do using text to express logic.

Arguably, we would benefit from even higher level abstractions so the LLM can fit more logic in a single prompt/output.

[+] divs1210|3 years ago|reply
Good point!

Maybe a future AI could generate machine code that could be "disasssembled" into higher level languages.

Not sure if that would be better.

[+] elcomet|3 years ago|reply
Yeah you do, machines execute code but don't understand it.
[+] lettergram|3 years ago|reply
I view programming as a trade. I’ve spent years honing my skills, I pass wisdom to junior engineers as I can. I review code and provide detailed alternatives.

My concern with AI across all fields are that people won’t gain the fundamental skills necessary for moving the bounds of what’s possible. Certainly, tools like this AI could produce good results. However, the underlying human is still providing the training data. More importantly, humans are producing the trajectory of development.

If humans are no longer capable of pushing the AI systems. Then the AI systems will either cease to improve, or the AI systems will learn to play off each other. In highly complex systems like many programs, I suspect they’ll play off each other and achieve local minimum/maximum locations. Ie because the “game” (program development) can be iterative they’ll constantly improve code. However, because the AI systems don’t interact with all data (particularly real-world data) when a customer shows a sad face at some UI/UX, it won’t completely develop a new feature that matches the desires of the customer.

Where I fear this will leave us is a class of less-skilled engineers and overly optimized AI. Basically, stuck in development.

[+] ilaksh|3 years ago|reply
Since I am building a website https://aidev.codes to do programming based on natural language descriptions, this is extremely relevant to me.

OpenAI has an 'edit' endpoint but it's 'in beta' and limited to 10-20 requests per minute. They do not acknowledge support requests about this. Azure OpenAI also has this endpoint I think but they ignore me as well.

So for my edits just like everything else I have been relying on text-davinci-003 since it has much more feasible rate limits. I have just been having it output the full new file but maybe this Unified Diff thing is possible to leverage.

Does anyone know, what would be the easiest way to try to run their 6B Diff Models thing against my own prompts for my service? Maybe Hugging Face?

[+] moconnor|3 years ago|reply
All that to end with “no meaningful improvement over the salesforce codegen model” is a bit disappointing.

Negative results are interesting in their own right. I’d rather read about why this isn’t better at the 6B parameter level than e see a hand wave that, well, the samples are more diverse and look the 350M model is better.

[+] youssefabdelm|3 years ago|reply
Yeah I felt the same way. Although perhaps at a higher scale the fine-tuning can make a bigger difference? The results go against this hypothesis but at least OpenAI states that GPT-3 only needs 200 examples, so who knows. In fact I wonder how well GPT-3 would do against this when fine-tuned on just 200 examples.
[+] mortehu|3 years ago|reply
I wrote the program between <BEF> and <MSG>, and it generated the following output:

  <NME> diff_model.py
  <BEF> import argparse

  import torch
  import transformers

  def main():
      argparser = argparse.ArgumentParser()
      argparser.add_argument('--checkpoint', default='CarperAI/diff-codegen-2b-v2', choices=['CarperAI/diff-codegen-6b-v2', 'CarperAI/diff-codegen-2b-v2', 'CarperAI/diff-codegen-350m-v2'], help='Model to use')
      args = argparser.parse_args()

      model = transformers.AutoModelForCausalLM.from_pretrained(args.checkpoint)
      tokenizer = transformers.AutoTokenizer.from_pretrained(args.checkpoint)

      # Use CUDA.
      model = model.cuda()

      text = '<NME> diff_model.py\n<BEF> '
      text += open("diff_model.py").read()
      text += '<MSG> Print a message after loading the model\n<DFF>\n'

      completion = model.generate(tokenizer.encode(text, return_tensors="pt").cuda(), max_new_tokens=400)

      completion = tokenizer.decode(completion[0], skip_special_tokens=True)

      print(completion)


  if __name__ == '__main__':
      main()
  <MSG> Print a message after loading the model
  <DFF>
  <DFF> @@ -1,4 +1,5 @@
   '''
  +import sys
   import argparse

   import torch
  @@ -8,6 +9,7 @@ def main():
       tokenizer = transformers.AutoTokenizer.from_pretrained(args.checkpoint)

       # Use CUDA.
  +    print("CUDA:", torch.cuda.is_available())
       model = model.cuda()

       text = '<NME> diff_model.py\n<BEF> '
  @@ -10,3 +12,4 @@ def main():
       completion = tokenizer.decode(completion[0], skip_special_tokens=True)

       print(completion)
  +    sys.exit(0)
  \ No newline at end of file
  [\ No newline at end of file repeated many times]
It takes about 150 seconds to run on a 3090 Ti when the model is already on disk.
[+] Epa095|3 years ago|reply
Maybe this can give a boost for languages like idris or F*, where you can specify much stronger types than in normal languages (with the price that you might have too proove the types manually). The types can help "tame" the AI generated code, and the AI can help generate the proofs.

I also wonder if it could be useful in creating Coq proofs!

[+] parasti|3 years ago|reply
I skimmed the post, but it seems not much was said about how the original diffs are generated. Git generates diffs only on request with varying levels of accuracy depending on the options given. Sometimes the diff completely fails to capture the intent of the change - it shows the path from A to B but not in any semantically meaningful way.
[+] ec109685|3 years ago|reply
2022: engineers with 3 jobs

2023: engineers with their own AI model, typing “#fixed bugs” and spending the rest of the day by the pool.

[+] Jackson__|3 years ago|reply
I'm not sure if I'm just imagining it, but there seems to be a lot more negative push-back online to this than there was for copilot.

It makes me wonder if it's related to recent protests in other creative fields in response to AI models, or just a weird dislike of openly released model weights?

[+] abdnafees|3 years ago|reply
Why now? I mean it's been only 20 odd years or so since modern programming became popular. And, it's not a lot. Let people learn how to code, make mistakes and then learn from those mistakes. Pre-cooked meals are not as good as home cooked goodness.
[+] indeyets|3 years ago|reply
So, it is loosely the same as copilot? I understand that approach is a tad different, but result of converting natural language descriptions into code-changes should be comparable.

And both are trained on large corpus of github sources

Is there a way to test it somehow? Public API maybe?

[+] pklausler|3 years ago|reply
How good are these LLMs going to be at debugging code, as opposed to writing it?