Generative AI profits off your code. Make them pay for it

[+] ipsum2|3 years ago|reply

If you put your code on Github, it's bound by the TOS, which states (https://docs.github.com/en/site-policy/github-terms/github-t...):

> We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

Doesn't that contradict the purpose of this website? Is this performance art?

Also, why is this website so secretive? Why not publish the license on the website?

> Who is PayToTrain created by and why?

> PayToTrain is created by a small group of developers and attorneys who are passionate about open source software and ensuring that developers are properly compensated for their work. The website and service are provided completely free of charge.

Edit: PayToTrain looks like a non-disclosed ad and/or project from legalist . com.

[+] sanxiyn|3 years ago|reply

I agree Humans Only Clause does not prevent Microsoft from training Copilot from codes on GitHub due to GitHub's terms of service, but I think it does prevent, say, Salesforce from training its CodeGen model.

So if the clause is widely adopted, it may be good for Microsoft and bad for Salesforce. If you want to reward Microsoft and punish Salesforce, it may be a good idea.

[+] echelon|3 years ago|reply

It shouldn't even hinge on a TOS.

If Microsoft loses this case, it actually means Microsoft wins and we all lose.

Who has a large enough corpora of training data? Only institutional copyright holders.

This is probably going to play out like Oracle vs. Google when Google suddenly realized that they should lose and intentionally threw the case.

I'm so worried about this case. Treating copyrighted training data as fair use, and letting models learn as a child might learn from a book or movie, is the best way to proceed. It widens the playing field for both development and disruption.

[+] eternalban|3 years ago|reply

You casting doubt on this with a frontal assault. Read your post and wanted to check out the 'show' that was implied but all immediately my eyes fell upon this:

Add our “Humans Only Clause” to your MIT license. Your code is still open source — for human developers only.

Sore disappointed that there is no entertainment involved. That's actually a pretty cool idea.

So github doesn't have (could be wrong) a default license grant or a over-riding licensing agreement. Your project, your license. If you change the license of your project, that is entirely your choice.

As to the Q of should we be generous to our corporate masters or take this opportunity to stick to the man and get rewarded for our mind products and compensation for being geeks! Society does owe something, does it not? /G

It's worth having a discussion about it, imo.

[+] anothernewdude|3 years ago|reply

Yeah, my content, but I don't necessarily have the ability to give them this, because people uploading code to Github don't always have the ability to grant them any license on the code uploaded.

[+] Shindi|3 years ago|reply

A lot of people learned to code from completely free materials on the internet. In fact, it's actually amazing how much free learning resources there are on the internet. That's how we create a good internet ecosystem, not by making everyone pay for every little thing.

Teaching AI how to code is a continuation of building this ecosystem because people will use these generative AI coding tools, lowering the bar to code.

[+] rklaehn|3 years ago|reply

But if the AI has been trained on open source licensed code and occasionally can produce verbatim copies of said code, shouldn't the weights of the AI then also become open source under the same license?

You could argue that training is a form of compilation, and the weights are a derivative work.

[+] googlryas|3 years ago|reply

Regardless of any legal issues surrounding this, an important consideration is that modifying any well known license will effectively mean your work will never get used by any large organization. No one wants to spend n hours of legal services vetting a custom license. They want to say "GPLv2, MIT, BSD is cool, everything else is banned".

[+] briga|3 years ago|reply

It certainly raises some interesting questions. After a model has been trained is there really any surefire way to prove that these models are profiting from your individual code? How is this different, from, say, search indexing in Google? Imagine Wikipedia wants to sue Google for stealing their content. Google essential keeps a mirror of Wikipedia and uses that data to serve up better search results (sometimes). But is there legal group to stand on in such a situation? It seems hard to prove that Company X made Y dollars off your individual code or text, therefore Company X owes you money.

In any case it raises some other questions about intellectual property as a whole. If you can sue an AI model for profiting off your intellectual property, why can't you sue a human for the same? Say you read a book one day, and are so inspired that you go ahead and write a new book. Imagine you publish that new book and sell millions of copies. Are you entitled to pay royalties to the author who gave you inspiration? It seems to me that unless you're plagiarizing large chunks of the original work verbatim, you probably shouldn't be forced to owe the original author much of anything. LLMs do plagiarize, but they do so somewhat inconsistently due to their non-deterministic output (just like humans!).

[+] rklaehn|3 years ago|reply

In some cases the model will spit out verbatim copies of code. E.g. https://twitter.com/docsparse/status/1581461734665367554?lan...

So in this case the weights are not even a derivative work but just a compressed copy of the original code.

[+] cmeacham98|3 years ago|reply

I see that your lawyers have reviewed the licensing terms. Great!

However, have they reviewed Microsoft's claims that their use of code for Copilot is open source? And if they have, is there somewhere I can read that analysis?

Now, Microsoft could be wrong on that claim, but until someone convinces me otherwise I'm going to assume their lawyers did their due diligence and they're correct. If Microsoft is correct about that it doesn't matter what you put in your license, and thus this is useless.

[+] cmeacham98|3 years ago|reply

I don't know how I missed this but I can't edit my comment now, I meant to say

"their use of code for Copilot is open source" -> "their use of code for Copilot is __fair use__"

[+] NavinF|3 years ago|reply

Yeah this is about as effective as making a FB post saying "I hereby revoke Meta's permission to use my photos". It's a surprisingly common theme on /r/oldpeoplefacebook

[+] rapjr9|3 years ago|reply

A lot of web pages already have a copyright notice, why doesn't that stop AI's from training on the contents of web pages? Can you train AI's on patents? Or on binary executables? Or from cameras that observe the public? Or from public legal document repositories? Or from Lexus-Nexus data? Or your personal health tracking wearable data? What exactly is ok or not ok? AI's can train on just about anything. I have long had a suspicion that most sites with large user populations use their collected data to train a stock market algorithm (in addition to advertising algorithms). What rules apply to government use of AI trained on all the data governments collect?

[+] neilv|3 years ago|reply

> Who is PayToTrain created by and why?

> PayToTrain is created by a small group of developers and attorneys who are passionate about open source software and ensuring that developers are properly compensated for their work. The website and service are provided completely free of charge.

That doesn't answer the obvious implied questions of how much anyone should trust this effort -- such as not to be selling them out to the infingers, either individually, or as as a "self-regulation" model that the infringers can point to in upcoming legal battles. And I'd be surprised if the attorneys didn't realize that.

[+] jongjong|3 years ago|reply

This is a great idea but I don't understand why I need to provide this service full write access to my public repos. Shouldn't full read access suffice?

[+] oadster|3 years ago|reply

We wanted to be able in the future to be able to automatically add / append the license text to the file. Adding the licenses one by one would be tedious so coming up with a way to bulk add the licenses seems essential.

[+] rklaehn|3 years ago|reply

This is a very clear cut case of a copyright violation by github copilot:

https://twitter.com/docsparse/status/1581461734665367554?lan...

In this instance the weights of the AI system just contain an obfuscated copy of the original source code.

[+] sva_|3 years ago|reply

As others have said before, it is fairly likely that the copyright violation probably happened before copilot learned from it, namely by people copying this code and publishing it under a different license.

[+] unknown|3 years ago|reply

[deleted]

[+] unknown|3 years ago|reply

[deleted]

[+] tarunmuvvala|3 years ago|reply

Thanks for taking the initiative. there is very little awareness and implication of this.

I am not a developer, but I understand how generative AI leveraging your work to make it easier for someone else.

A similar thing needs to be done for images too.

[+] s-lambert|3 years ago|reply

>I am not a developer, but I understand how generative AI leveraging your work to make it easier for someone else.

To me this sounds like it's antithetical to open source software because the point of making software open source is so that other people can leverage your work. It shouldn't matter if it's done through generative AI or through a human's brain.

[+] lost_tourist|3 years ago|reply

I think the issues is making the work usable by people for free and for a fee paid to microsoft because (mostly) due to open source licenses that help keep the software free and not a boon only to huge corporations. This seems like a sneaky way of getting around the licensing. Maybe GPL4 that covers usage by AI models

[+] rklaehn|3 years ago|reply

Interesting.

I would be fine with an AI being trained on my code, provided that the weights of said AI would then be published under the same license as my code.

Is there a license for that?

[+] carom|3 years ago|reply

Anyone have the text of the clause? I clicked the get started link but it wanted to log in on my Github. I did not see it in the FAQ.

[+] oadster|3 years ago|reply

"Use of the Software by any person to train, teach, prompt, populate, or otherwise further or facilitate any so-called generative artificial intelligence, generative algorithm, generative adversarial network, generative model, or similar or related activity (or to attempt to perform any of the foregoing acts or activity), whether in connection with any so-called machine learning, deep learning, neural network, or similar or related framework, system, or model or otherwise, is strictly prohibited and beyond the limited scope of this license, absent prior payment to licensor of the licensing fee of the amount of ____"

[+] evashang|3 years ago|reply

TL;DR our lawyers wrote a clause to protect open source code from being used by generative AI companies for profit. You can find it here: paytotrain.ai

A legal grey area exists as to whether publicly available creations (code or art) can be used to train datasets for generative AI projects without infringing their creators' underlying copyrights. Other types of claims, such as violation of license agreements and DMCA violations, require proof of damages to substantiate.

The legal solution we’ve identified is to add a specific damages amount to the license itself — a licensing fee. The failure to pay such a fee would cause the creator to suffer damages in the amount of the fee. By imbedding a licensing fee into a traditional open-source license, a creator can solve the proof-of-damages issue that could otherwise limit a claim under the DMCA or for breach of contract, and limit the fee to generative AI companies.

That’s why we built the Humans Only Clause. If you don’t want your code used by Copilot in this way, the Humans Only Clause can help strengthen your protections from use for training purposes. It’s a simple addition to your existing open source license to keep it free use and open source for other developers, but to prevent use without attribution by generative AI companies.

You can access the Humans Only Clause and insert it into your GitHub repo by going to PayToTrain.ai — we also built a payments form where you can set your own licensing fee depending on how valuable you believe your repo to be. If we get enough people using this clause, there’s a good chance we can assemble a separate class for a future class action, where each user gets significantly higher damages than what’s available statutorily under existing DMCA lawsuits.

On a philosophical level, we believe that the open source community is based on principles of taking and giving back to the collective. AI-based programming assistants strip away any attribution while drawing from the underlying contributions of the community. We want the open source community to continue to be open source, but we don’t want big companies to profit on our code.

If you’re interested, check it out: paytotrain.ai. We’d love to hear what you think

[+] dQw4w9WgXcQ|3 years ago|reply

> AI-based programming assistants strip away any attribution while drawing from the underlying contributions of the community ... we believe that the open source community is based on principles of taking and giving back to the collective

How is this behavior different than 90% of human coders? Most SW devs scream if you ask them to pay for something, whether its apps, code, TV, movies, etc. But they will happily try to build startups on top of a vast mountain of free code. I really don't care if my code gets sucked up by the AI vacuum, humans have been doing that quite well for awhile now.

[+] bugfix-66|3 years ago|reply

Thank you. Could you paste the Humans Only Clause here so we can read it?

Here was my attempt to write a clause prohibiting language model training/inference:

https://bugfix-66.com/7a82559a13b39c7fa404320c14f47ce0c304fa...

  3. Use in source or binary form for the construction or operation
     of predictive software generation systems is prohibited.

How does the Humans Only Clause fix the flaws in my attempt?

The Humans Only Clause adds an explicit licensing fee, and what else?

How is the clause worded?

54 comments