top | item 35459287

(no title)

I don't think this follows. Calling it a derivative work already feels like a stretch, but even granting that framing, the use is clearly transformative and therefore likely to be considered fair use in the US.

I actually wrote a Wikipedia article on the intersection of copyright law and deep learning models the other day (https://en.wikipedia.org/wiki/Artificial_intelligence_and_co...). I was hoping to include a section on the copyrightability of model weights, but was sadly able to find 0 coverage in reliable sources.

discuss

kmeisthax|2 years ago

So, while the poster you replied to has the wrong reasoning, they got to the right conclusion. Mostly.

Let's start with a non-AI example: compilers. If I have the source code for Linux, I can compile it, but I don't own the kernel binaries I made. This is because the compiler is a purely mechanical process, not a tool of human creativity. Copyright in the US attaches to creativity from human authors. So the source code would be the creative work, not the binaries.

We don't normally talk about this because ownership over the source code still flows through to the binaries. Your permission to copy that Linux binary is downstream of Linus having granted you permission to do so under the GPL. If you had instead copied, say, the NT kernel, you would be infringing the copyright on the NT kernel source code by distributing binaries of it.

So now let's go to AI land. You've collected a bunch of training data and dumped it into a linear algebra blender. That's like compiling source code: the ML trainer program adds no creativity or authorship, so you haven't gained any ownership over the data. Remember: this training data is scraped off the Internet from other people's work. Fair use merely makes it non-infringing to do this, it does not mean you own the result.

There are two avenues by which Meta could still get US copyright over the language model:

- They could make a model with their own training data that they made, and use their ownership over the training data to get ownership over the model.

- They could assert ownership over the compilation of training data.

Compilation ownership is kind of weird. Basically, in the US, you can make a compilation of other people's work and own solely that. Like, say, a "Top 10 Songs I Like" playlist[0]. But even then the creativity and authorship rules still apply. These models are not being trained by having humans manually select specific works that would do well in the model. They scrape the Internet and train on everything[1]. In fact, they usually don't even use their own scrapes; they use Common Crawl, LAION-5B, and/or The Pile.

Whether or not any of this is right would require someone to actually share LLaMA, get sued by Facebook, and then assert this legal theory. And hope that Facebook does not assert any other legal claims, such as misappropriation of trade secrets, which might actually stick.

[0] Or in a particularly egregious example, someone copyrighting their Magic: The Gathering deck in protest of this nonsense.

[1] Stable Diffusion at least uses an "aesthetics score", but AFAIK that's generated by an AI so also not copyrightable.

shaky-carrousel|2 years ago

you explain it better than I. My idea is that, relative to AI, you can either go with: "this thing is simply an array of numbers, I'm not infringing anyone's copyright by creating this model". Or you can go with "this creation is mine, I made it, you cannot use it". Because you cannot basically create a program that can spit copyrighted work and then claim that the thing is yours. That is not going to fly.

Because if you do that, then all I have to do to pirate a book is train a model on that book and sell the trained model as mine, which does not make sense.

williamcotton|2 years ago

These weights are clearly transformative and are in fair use and covered by copyright.

shaky-carrousel|2 years ago

As another user said, the process is mechanical, so I'm not sure it can be thought as a derivative work.

I guess what I want to say is that in this matter of AI, you can't have your cake and eat it. If you want to have copyright over your weights, be prepared to also pay for the rights of the content your weights were based on.

And I think nobody in the AI world want to walk through that avenue.

williamcotton|2 years ago

That’s not true. There are plenty of copyrighted collections like lists where the items in the list are not copyrighted.

nl|2 years ago

This should include https://arxiv.org/abs/2303.15715