top | item 40566057

(no title)

lambdaxyzw | 1 year ago

>How can corporations be stealing anything from an open source project?

The code is published using some license that allows some use cases and prohibits other. For example GPL is famous for being viral. Using it to teach a LLM that spits "unlicensed" code is basically laundering copyright.

discuss

order

hiatus|1 year ago

Using it to train an LLM seems orthogonal to the output of the LLM. For instance, they could have their LLM include a link to the license. Merely training an LLM on the data does not seem to be against the spirit of GPL or Apache license.

mrweasel|1 year ago

Someone could easily create a such a license. Free to use and distribute, $10,000 per line used for AI model training.

I'll very naively assume that Amazon, OpenAI, Google and others check licenses before feeding data to their models. I'll stop assuming that when one of these companies admit that they don't actually care and it's not profitable for them to respect licenses.

carom|1 year ago

The LLM is quite literally a derivative work of GPL code. At the very least, there is an argument in such a case that the derivative function (the model weights) should conform to the same license.

slindsey|1 year ago

I've heard AI advocates talk about a "right to read" or "right to learn"; meaning that we have the right to read something and then internalize it and use it. Therefore, why shouldn't an AI have the same right? The difference to me seems to be that the AI has the ability to regurgitate it in whole.

I can read a book, learn about the concepts, then use or repeat those concepts. The AI can do the same. But is it really "learning"? It may be just spewing out pieces of the content without any understanding. In which case it's a copyright violation, right?

barfbagginus|1 year ago

You need to do more than include a link to the license to comply. You need to include the entire source code needed to compile the derived system.

For an LLM that would include:

1. Training data

2. Training code and metrics

3. Hyperparameter settings

4. Output weights

Anything less is really just misinterpretation of the nature of open source's provision for studying, modifying, and recompiling the LLM

Tldr; these companies MUST make the LLM into AGPL and provide all necessary codes as described above. Companies that refuse this will be raided by open source copyright trolls, if we're lucky and a little mischievous.