top | item 40131107

(no title)

ff317 | 1 year ago

There's two issues here still, IMHO:

1) The LLM owners really can't guarantee that it won't directly plagiarize without attribution or licensing. Your code may contain a unique algorithm or method for solving something, and when someone asks the right question, your code may simply be the only answer it knows to give.

2) While the code being used as training input was open source and visible to the public to learn from, the models being built often aren't. It seems unethical to train from public data yet keep the resulting weights private and charge for access to use the trained weights.

discuss

arghwhat|1 year ago

For the first aspect, neither can a human, and it's incredibly hard to decide if something is plagiarized or fair-use/inspiration. There are several things to consider:

1) These tools are generally used in a pair-programming fashion, and in that function, the output can be considered similar to when you ask a coworker on Slack and they paste you a snippet, or if you browse github and read someone else's implementation (without having the LICENSE text within your field of view at all times). A possible violation would then only occurs once the snippets in question are included into your code-base and distributed in ways that violate the original license.

2) One could argue that sharing the snippet with you was a form of redistribution, but I would not consider this to apply if a human did it and would therefore not apply it to machines either, and I do not think that is what people generally consider redistribution of an open source project. GPL technically has a clause second-hand violations, but I do not think that one holds.

It should also be noted that licenses like MIT only require the copyright and permissions notice included in substantial portions of the program, and so smaller snippets are always fine. Humans also do not bother attributing smaller copy-paste blocks - we'd run out of storage linking to all the stackoverflow answers!

3) The issue gets a bit hairier when the machine reproduces large/important portions of projects with no hint as to its source, license or ways to do proper attribution, but even then I'd consider the violation to occur only if included verbatim into a project which is then redistributed under incompatible terms.

4) Even when code is largely identical, it generally only an issue if the code is a unique invention, not if the code trivially follows for a skilled individual of the trade. That's a principle in the practice of many laws, including patent law.

For the second aspect, I do not see any importance to the fact that the trained model is not public. A person studying open-source projects do not upload a brain dump afterwards, and others only directly benefit from their experience (their "weights") if they decide to teach the subject. Nor is every project they write afterwards with their knowledge necessarily open-source, only being public if they want to make them public. Licenses generally do not restrict private or internal usage, including modification and derivative works. It is redistribution they trigger on (with some catches for things like AGPL).

(I would of course like the model to be public for the betterment of mankind, but that's different from the legal aspect of it.)