(no title)
halfeatenscone | 2 years ago
I actually wrote a Wikipedia article on the intersection of copyright law and deep learning models the other day (https://en.wikipedia.org/wiki/Artificial_intelligence_and_co...). I was hoping to include a section on the copyrightability of model weights, but was sadly able to find 0 coverage in reliable sources.
kmeisthax|2 years ago
Let's start with a non-AI example: compilers. If I have the source code for Linux, I can compile it, but I don't own the kernel binaries I made. This is because the compiler is a purely mechanical process, not a tool of human creativity. Copyright in the US attaches to creativity from human authors. So the source code would be the creative work, not the binaries.
We don't normally talk about this because ownership over the source code still flows through to the binaries. Your permission to copy that Linux binary is downstream of Linus having granted you permission to do so under the GPL. If you had instead copied, say, the NT kernel, you would be infringing the copyright on the NT kernel source code by distributing binaries of it.
So now let's go to AI land. You've collected a bunch of training data and dumped it into a linear algebra blender. That's like compiling source code: the ML trainer program adds no creativity or authorship, so you haven't gained any ownership over the data. Remember: this training data is scraped off the Internet from other people's work. Fair use merely makes it non-infringing to do this, it does not mean you own the result.
There are two avenues by which Meta could still get US copyright over the language model:
- They could make a model with their own training data that they made, and use their ownership over the training data to get ownership over the model.
- They could assert ownership over the compilation of training data.
Compilation ownership is kind of weird. Basically, in the US, you can make a compilation of other people's work and own solely that. Like, say, a "Top 10 Songs I Like" playlist[0]. But even then the creativity and authorship rules still apply. These models are not being trained by having humans manually select specific works that would do well in the model. They scrape the Internet and train on everything[1]. In fact, they usually don't even use their own scrapes; they use Common Crawl, LAION-5B, and/or The Pile.
Whether or not any of this is right would require someone to actually share LLaMA, get sued by Facebook, and then assert this legal theory. And hope that Facebook does not assert any other legal claims, such as misappropriation of trade secrets, which might actually stick.
[0] Or in a particularly egregious example, someone copyrighting their Magic: The Gathering deck in protest of this nonsense.
[1] Stable Diffusion at least uses an "aesthetics score", but AFAIK that's generated by an AI so also not copyrightable.
shaky-carrousel|2 years ago
Because if you do that, then all I have to do to pirate a book is train a model on that book and sell the trained model as mine, which does not make sense.
williamcotton|2 years ago
shaky-carrousel|2 years ago
I guess what I want to say is that in this matter of AI, you can't have your cake and eat it. If you want to have copyright over your weights, be prepared to also pay for the rights of the content your weights were based on.
And I think nobody in the AI world want to walk through that avenue.
williamcotton|2 years ago
nl|2 years ago