top | item 33487596

(no title)

sj4nz | 3 years ago

This isn't my experience with Co-pilot's suggestions. I've literally been able to have Co-pilot suggest a complete unit test based on a novel structure I hand-coded myself and a few words describing the unit test. The constants are often wrong, but it saves minutes of fidgeting with the syntax for unit tests and assertions.

These are not quotations from other people's code but something about the deep structures of language and programming language semantics. However, I suspect if you knew enough of a snippet from other source you could coax Co-pilot to suggest code learned from that source, but it would likely be washed over by other code in the corpus where it coincided with meanings.

discuss

jacoblambda|3 years ago

Worth noting with models like copilot is that if you deliberately give it an input similar to the training contents, odds are it'll near verbatim reiterate it.

The main issue is that while you can use copilot to create "new"/transformative code, it's also trivial to get it to pump out licensed works in a form where you could claim "I didn't know it was taken from x project with y license because the tool made it for me".

I personally have no problem with copilot in concept however to do it (or any other AI model based text/graphics tool) without infringing on people's copyrights is practically an unsolved problem (excluding just per-licensing the training data ahead of time).

withinboredom|3 years ago

I mean, you can prompt me (or any other engineer) to spit out copyrighted code. FizzBuzz comes to mind… as do a number of algorithms I’ve written in the past which belongs to my past employers…

I really think we are entering some interesting territory that will likely be an interesting can of worms.

heavyset_go|3 years ago

Companies already pay a lot of money for datasets to train models on in other spaces outside of software development. On top of that, they spend a lot of money on labelling and what not.

Software is unique in that there is a cultural trend to share source code, so that makes it easy to compile into "free" datasets.

I wouldn't say it's an unsolved problem, it's just that there are no incentives to compile or pay for datasets when Microsoft already has petabyes of code to train on. If anything, I expect Microsoft to sell datasets based on GitHub repositories if Copilot-like models survive this lawsuit and are conmoditized.