top | item 43018731

(no title)

JackC | 1 year ago

Here's the full decision, which (like most decisions!) is largely written to be legible to non-lawyers: https://storage.courtlistener.com/recap/gov.uscourts.ded.721...

The core story seems to be: Westlaw writes and owns headnotes that help lawyers find legal cases about a particular topic. Ross paid people to translate those headnotes into new text, trained an AI on the translations, and used those to make a model that helps lawyers find legal cases about a particular topic. In that specific instance the court says this plan isn't fair use. If it was fair use, one could presumably just pay people to translate headnotes directly and make a Westlaw competitor, since translating headnotes is cheaper than writing new ones. And conversely if it isn't fair use where's the harm (the court notes no copyright violation was necessary for interoperability for example) -- one can still pay people to write fresh headnotes from caselaw and create the same training set.

The court emphasizes "Because the AI landscape is changing rapidly, I note for readers that only non-generative AI is before me today." But I'm not sure "generative" is that meaningful a distinction here.

You can definitely see how AI companies will be hustling to distinguish this from "we trained on copyrighted documents, and made a general purpose AI, and then people paid to use our AI to compete with the people who owned the documents." It's not quite the same, the connection is less direct, but it's not totally different.

discuss

order

anon373839|1 year ago

This is an interesting opinion, but there are aspects of it that I doubt will stand the test of time.

One aspect is the court’s ruling that West’s headnotes are copyrightable even when they merely quote a court opinion verbatim, because the editorial decision to quote the material itself shows a “creative spark”. It really isn’t workable — in law specifically - for copyright to attach to the mere selection of a quote from a case to represent that case’s holding on an issue. After all, we would expect many lawyers analyzing the case independently to converge on the same quotes!

The key fact underlying all of this, I think, is that when Ross paid human annotators to write their own versions of the headnotes, they really did crib from West’s wholesale rather than doing their own independent analysis. Source text was paraphrased using curiously similar language to West’s paraphrasing. That, plus the fact that Ross was a directly competing product, is what I see as really driving this decision.

The case has very little to say about the more commonly posed question of whether copyright is infringed in large-scale language modeling.

AnthonyMouse|1 year ago

> That, plus the fact that Ross was a directly competing product, is what I see as really driving this decision.

The "competing product" thing is probably the most extreme part of this opinion.

The most important fair use factor is if the use competes with the original work, but this is generally implied to be directly competes, i.e. if you translate someone else's book from English to French and want to sell the translation, the translation is going to be in direct competition for sales to people who speak both English and French. The customer is going to use the copy claiming fair use as a direct substitute for the original work, instead of buying it.

This court is trying to extend that to anything downstream from it, which seems crazy. For example, "multiple copies for classroom use" is one of the explicit examples of fair use from the copyright statute, but schools are obviously teaching people intending to go into competition with the original author, and in general the idea that you can't read something if you ever intend to write something to sell in competition with it seems absurd and in contradiction to the common practices in reverse engineering.

But this is also a district court opinion that isn't even binding on other courts, so we'll see what happens if it gets appealed.

pigbearpig|1 year ago

" court’s ruling that West’s headnotes are copyrightable even when they merely quote a court opinion verbatim"

That is the opposite of the ruling. The judge said the ones that summarize and pick out the important parts are copyrightable and specifically excludes the headnotes that quote court opinion verbatim.

The judge:

"But I am still not granting summary judgment on any headnotes that are verbatim copies of the case opinion (for reasons that I explain below)"

bee_rider|1 year ago

> It really isn’t workable — in law specifically - for copyright to attach to the mere selection of a quote from a case to represent that case’s holding on an issue. After all, we would expect many lawyers analyzing the case independently to converge on the same quotes!

I guess it depends on how long the source is, and how long the collection of quotes is, if we’d expect multiple lawyers to converge on the same solution. I don’t think it is totally obvious, though…

I’m also not sure if that’s a generally good test. It seems great for, like, painting. But I wouldn’t be surprised if we could come up with a photography scene where most professionals would converge on the same shot…

zozbot234|1 year ago

If close paraphrase can be detected, this ought to be proof enough that some non-trivial element of creativity was involved in the original text. Because purely functional and necessary elements are not protected by copyright, even when they would otherwise be creative (this is technically known as the 'scenes à faire' case) - and surely a "quote" which is unavoidable because it factually and unquestionably is the core of the ruling would have to fall under that.

fncypants|1 year ago

I think this is the best takeaway. This case and its outcome is restricted to its facts. Most of the LLM activity today is very different than what happened here.

singleshot_|1 year ago

My experience using Westlaw Keycites at work is that they’re not primarily created by fishing a quote out of a holding, but instead by synthesizing a rule. If I want a summary, I read the Keycite; if I want a money quote, I root around in the case linked to the Keycite.

Have you seen different? I’m curious what area of law you practice and in what state, for comparison’s sake.

6stringmerc|1 year ago

The crux is Fair Use and until lobbyists change the four factor test, AI training has an uphill battle in court. It’s a very disliked observation in this forum, but I stand by my principles on this one because the courts see it my way. Derivative works, especially by artificial means, simply fail the test miserably and that’s the truth.

greggyb|1 year ago

Collections of essays or poems are considered copyrightable. This seems analogous enough to me.

fsckboy|1 year ago

>the court’s ruling that West’s headnotes are copyrightable even when they merely quote a court opinion verbatim, because the editorial decision to quote the material itself shows a “creative spark” ... when Ross paid human annotators to write their own versions of the headnotes, they really did crib from West’s wholesale rather than doing their own independent analysis

... so it follows that it was then Ross's annotators showing the creative spark

reissbaker|1 year ago

I'll quote a longer portion of the transcript about generative AI, because I think it makes the opposite of your point:

Ross’s use is not transformative. Transformativeness is about the purpose of the use. “If an original work and a secondary use share the same or highly similar purposes, and the second use is of a commercial nature, the first factor is likely to weigh against fair use, absent some other justification for copying.” Warhol, 598 U.S. at 532–33. It weighs against fair use here. Ross’s use is not transformative because it does not have a “further purpose or different character” from Thomson Reuters’s. Id. at 529.

Ross was using Thomson Reuters’s headnotes as AI data to create a legal research tool to compete with Westlaw. It is undisputed that Ross’s AI is not generative AI (AI that writes new content itself). Rather, when a user enters a legal question, Ross spits back relevant judicial opinions that have already been written. D.I. 723 at 5. That process resembles how Westlaw uses headnotes and key numbers to return a list of cases with fitting headnotes.

I think it's quite relevant that this was not generative AI: the reason that mattered is that "transformative" use biases towards Fair Use exemptions from copyright. However, this wasn't creating new content or giving people a new way to understand the data: it was just used in a search engine, much like Westlaw provided a legal search engine. The judge is pointing out that the exact implementation details of a search engine don't grant Fair Use.

This doesn't make a ruling about generative AI, but I think it's a pretty meaningful distinction: writing new content seems much more "transformative" (in a literal sense: the old content is being used to create new content) than simply writing a similar search engine, albeit one with a better search algorithm.

BoorishBears|1 year ago

I came here to point this out, and it's especially clear if you contextualize this with the original decision from September: https://www.ded.uscourts.gov/sites/ded/files/opinions/20-613...

They were doing semantic search using embeddings/rerankers.

The point that reading both decisions together compounds is that if they had trained a model on the Bulk Memos and generated novel text instead of doing direct searches, there likely would have been enough indirection introduced to prevent a summary judgement and this would have gone to a jury as the September decision states.

In other words, from their comment:

> But I'm not sure "generative" is that meaningful a distinction here.

The judge would not seem to agree at all.

qingcharles|1 year ago

Westlaw's headnotes are primarily just snippets of the case with tags attached. They are really crappy. I hate them. Some lawyers love them.

Westlaw protects them because they are the "value add." Otherwise their business model is "take published decisions the court is legally bound to provide for free and sell it to you."

An LLM today could easily recreate the headnotes in a far superior manner from scratch with the right prompt. I don't even think hallucinations would factor in on such a small task that was well regulated, but you can always just asterisk the headnotes and put a disclaimer on them.

Tteriffic|1 year ago

Exactly. Why use the headnotes at all?

I always thought they were obviously were copyrightable. Plus they’re not close to perfect either.

AlexCoventry|1 year ago

> You can definitely see how AI companies will be hustling to distinguish this from "we trained on copyrighted documents, and made a general purpose AI, and then people paid to use our AI to compete with the people who owned the documents." It's not quite the same, the connection is less direct, but it's not totally different.

Surely creating a general-purpose AI is transformative, though? Are you anticipating that AI companies will be sued for contributory infringement, because customers are using a general-purpose AI to compete with companies which created parts of the training data?

llamaimperative|1 year ago

IMO yes. The entire purpose of copyright law is to protect the incentive to create new material. A huge portion of the value prop of AI is that it captures the incentive normally bound for the creators of the training material (i.e. the whole point is you can ask the AI and not even see, never mind pay, the originator).

Ajedi32|1 year ago

Interestingly, almost the entirety of the judge's opinion seems to be focused on the question of whether the translated notes are subject to copyright. It seems to completely ignore the question of whether training an AI on copyrighted material constitutes making a copy of that work in the first place. Am I missing something?

The judge does note that no copyrighted material was distributed to users, because the AI doesn't output that information:

> There is no factual dispute: Ross’s output to an end user does not include a West headnote. What matters is not “the amount and substantiality of the portion used in making a copy, but rather the amount and substantiality of what is thereby made accessible to a public for which it may serve as a competing substitute.” Authors Guild, 804 F.3d at 222 (internal quotation marks omitted). Because Ross did not make West headnotes available to the public, Ross benefits from factor three.

But he only does so as part of an analysis of whether there's a valid fair use defense for Ross's copying of the head notes, ignoring the obvious (to me) point that if no copyrighted material was distributed to end users, how can this even be a violation of copyright in the first place?

unyttigfjelltol|1 year ago

Ross evidently copied and used the text himself. It's like Ross creating an unauthorized volume of West's books, perhaps with a twist.

Obscurity ≠ legal compliance.

brookst|1 year ago

How would training on copyrighted material be infringement in a way that merely producing the training material (but not iterating through training) would not be?

kevin_thibedeau|1 year ago

There were data brokers who literally paid people to transcribe phone books before OCR was a viable option. That was protected, as data isn't copyrightable. It isn't hard to argue that case law metadata is no different even though it includes textual descriptions (themselves taken from public documents).

musicale|1 year ago

> "we trained on copyrighted documents, and made a general purpose AI, and then people paid to use our AI to compete with the people who owned the documents"

This is a good distillation. A bit like "we trained our system on various works of art and music, and now it is being sold as a service that competes with the original artists and musicians."

bsder|1 year ago

AI has yet to demonstrate that it can do anything different from what a group of people could sit down and do. Sure, the AI may be able to do it faster, but there hasn't yet been anything demonstrated that exceeds what humans can do.

If it would be illegal for a group of people to do something, it is also going to be illegal for an AI do so.

Why is that so surprising?

echelon|1 year ago

If the copyright holders win, the model giants will just license.

This effectively kills open source, which can't afford to license and won't be able to sublicense training data.

This is very bad for democratized access to and development of AI.

The giants will probably want this. The giants were already purchasing legacy media content enterprises (Amazon and MGM, etc.), so this will probably further consolidation and create extreme barriers to entry.

If I were OpenAI, I'd probably be very happy right now. If I were a recent batch YC AI company, I'd be mortified.

dkjaudyeqooe|1 year ago

License what? Every available copyrighted work? Even getting a tiny fraction is not practical.

To the contrary, this just means companies can't make money from these models.

Those using models for research and personal use wouldn't be infringing under the fair use tests.

mvdtnz|1 year ago

Open source model builders are no more entitled to rip off content owners than anyone else. I couldn't possibly care any less if this impacts "democratized access" to bullshit generators. At least if the big boys license the content then the rightful owners get paid (and have the option to opt out).

vkou|1 year ago

I don't have either a data center, or every single copyrighted work in history to import as training data to train my open source model.

Whether or not OpenAI is found to be breaking the law will be utterly irrelevant to actual open AI efforts.

JoshTriplett|1 year ago

> If the copyright holders win, the model giants will just license.

No, they won't. The biggest models want to train on literally every piece of human-written text ever written. You can pay to license small subsets of that at a time. You can't pay to license all of it. And some of it won't be available to license at all, at any price.

If the copyright holders win, model trainers will have to pay attention to what they train on, rather than blithely ignoring licenses.

pabs3|1 year ago

Open source models can crowdsource open source training data. This was done for RNNoise for example.

alberto-m|1 year ago

> Here's the full decision, which (like most decisions!) is largely written to be legible to non-lawyers

For me (Italian) this is amazing! Most Italian judges and lawyers write in a purposely obscure fashion, as if they wanted to keep the plebs away from their holy secrets. This document instead begs to be read; some parts are more in the style of a novel than of a technical document.

dkjaudyeqooe|1 year ago

> The court emphasizes "Because the AI landscape is changing rapidly, I note for readers that only non-generative AI is before me today." But I'm not sure "generative" is that meaningful a distinction here.

Also the judge makes that statement, it looks like he misunderstands the nature of the AI system and the inherent generative elements it includes.

echoangle|1 year ago

How is the system inherently generative?

Tanjreeve|1 year ago

Does it really matter what the judge calls it when the ruling is about its end effects and outcomes?