top | item 39077899

(no title)

enord | 2 years ago

Well then I could have been much clearer because I meant something like the latter.

An ML model can neither have nor be in breach of copyright so any discussion about how it works, and how that relates to how people work or “learn” is besides the point.

What actually matters is firstly details about collation of source material, and later the particular legal details surrounding attribution. The last part involves breaking new ground legally speaking and IANAL so I will reserve judgement. The first part, collation of source material for training is emphatically not unexplored legal or moral territory. People are acting like none of the established processes apply in the case of LLMs and handwave about “learning” to defend it.

discuss

Ukv|2 years ago

> and how that relates to how people work or “learn” is besides the point

It is important (for the training and generation stages) to distinguish between whether the model copies the original works or merely infers information from them - as copyright does not protect against the latter.

> The first part, collation of source material for training is emphatically not unexplored legal or moral territory.

Similar to as in Authors Guild v. Google, Inc. where Google internally made entire copies of millions of in-copyright books:

> > While Google makes an unauthorized digital copy of the entire book, it does not reveal that digital copy to the public. The copy is made to enable the search functions to reveal limited, important information about the books. With respect to the search function, Google satisfies the third factor test

Or in the ongoing Thomson Reuters v. Ross Intelligence case where the latter used the former's legal headnotes for training a language model:

> > verbatim intermediate copying has consistently been upheld as fair use if the copy is "not reveal[ed] to the public."

That it's an internal transient copy is not inherently a free pass, but it is something the courts take into consideration, as mentioned more explicitly in Sega v. Accolade:

> > Accolade, a commercial competitor of Sega, engaged in wholesale copying of Sega's copyrighted code as a preliminary step in the development of a competing product [yet] where the ultimate (as opposed to direct) use is as limited as it was here, the factor is of very little weight

And, given training a machine learning model is a considerably different purpose than what the images were originally intended for, it's likely to be considered transformative; as in Campbell v. Acuff-Rose Music:

> > The more transformative the new work, the less will be the significance of other factors

enord|2 years ago

Listen, most website and book-authors want to be indexed by google. It brings potential audience their way, so most don’t make use of their _right_ to be de-listed. For these models, there is no plausible benefit to the original creators, and so one has to argue they have _no_ such right to be “de-listed” in order to get any training data currently under copyright.