I think in this regard it works just fine. If the laws move to say that "learning from data" while not reproducing it is "stealing", then yes, you reading others code and learning from it is also stealing.
If I can't feed a news article into a classifier to teach it to learn whether or not that I would like that article that's not a world I want to live in. And yes it's exactly the same thing as what you are accusing LLMs of.
They should be subject to laws the same way humans are. If they substantially reproduce code they had access to then it's a copyright violation. Just like it would be for a human doing the same. But highly derived code is not "stolen" code, neither for AI nor for humans.
Me teaching my brain someone’s way of syntactically expressing procedures is analogous to AI developers teaching their model that same mode of expression.
It's not your reading that would be illegal, but your copying. This is well a documented area of the law and there are concrete answers to your questions.
To me, the argument is a LLM learning from GPL stuff == creating a derivative of the GPL code, just "compressed" within the LLM. The LLM then goes on to create more derivatives, or it's being distributed (with the embedded GPL code).
Yes, I provide it as a service to my employer. It's called a job. Guess what? When I read code I learn from it and my brain doesn't care what license that code is under.
If the product is the result of compiling all the open source code out in the wild into a LLM, it can be argued that the derived product, the LLM itself, must follow the licensing requirements of the used source code.
The AI companies don't care much about this. When the time comes, they will open their models or stop using sources that don't meet the appropriate licensing. Their current concern is learning how to build the best models, and win the race to become the dominant AI provider - who cares if they need to use polluted sources to reach their goal. They will fix it later.
poincaredisk|1 year ago
tensor|1 year ago
If I can't feed a news article into a classifier to teach it to learn whether or not that I would like that article that's not a world I want to live in. And yes it's exactly the same thing as what you are accusing LLMs of.
They should be subject to laws the same way humans are. If they substantially reproduce code they had access to then it's a copyright violation. Just like it would be for a human doing the same. But highly derived code is not "stolen" code, neither for AI nor for humans.
ianeigorndua|1 year ago
Me teaching my brain someone’s way of syntactically expressing procedures is analogous to AI developers teaching their model that same mode of expression.
guerrilla|1 year ago
ianeigorndua|1 year ago
candiddevmike|1 year ago
To me, the argument is a LLM learning from GPL stuff == creating a derivative of the GPL code, just "compressed" within the LLM. The LLM then goes on to create more derivatives, or it's being distributed (with the embedded GPL code).
0x457|1 year ago
ianeigorndua|1 year ago
vvillena|1 year ago
The AI companies don't care much about this. When the time comes, they will open their models or stop using sources that don't meet the appropriate licensing. Their current concern is learning how to build the best models, and win the race to become the dominant AI provider - who cares if they need to use polluted sources to reach their goal. They will fix it later.
timeon|1 year ago
ianeigorndua|1 year ago