AI Data Laundering

moyix|3 years ago

The Authors Guild v Google decision about Google Books seems relevant:

> In late 2013, after the class action status was challenged, the District Court granted summary judgement in favor of Google, dismissing the lawsuit and affirming the Google Books project met all legal requirements for fair use. The Second Circuit Court of Appeal upheld the District Court's summary judgement in October 2015, ruling Google's "project provides a public service without violating intellectual property law." The U.S. Supreme Court subsequently denied a petition to hear the case.

[...]

> The court's summary of its opinion is:

[...]

> Google’s unauthorized digitizing of copyright-protected works, creation of a search functionality, and display of snippets from those works are non-infringing fair uses. The purpose of the copying is highly transformative, the public display of text is limited, and the revelations do not provide a significant market substitute for the protected aspects of the originals. Google’s commercial nature and profit motivation do not justify denial of fair use.

https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

This doesn't touch on the ethics of course – at minimum I think allowing people to exclude themselves or their work from a dataset is necessary.

VanTheBrand|3 years ago

I would argue (as the court did) that google's use is transformative because the end result "book search" is in a different marketplace from "books." The end result / output of these generative AI systems trained on stock media and art is..."stock media and art."

That's kind of what this whole article is about. Just training the systems in research is arguably fair use but creating the entire pipeline might not be and the "loophole" here is trying to claim no responsibility for the training at the center of it because that was technically done by a 3rd party (...funded by the final creator of the full entire pipeline.)

russellbeattie|3 years ago

An important part of the opinion (on the wiki page you linked to) is completely missing in the case of AI datasets:

> It generates new audiences and creates new sources of income for authors and publishers.

This is definitely not the case for artists and photographers, who don't benefit at all from the transformative nature of the AI output, and in fact are significantly harmed since it dilutes the uniqueness of their work by allowing anyone to imitate their style. Though to my knowledge "style" isn't protected by copyright - only trademark - I can't imagine there won't be lawsuits about this in the future.

That one artist who complained that people can't find his original work online now because of so many imitated pics is definitely exhibit A in terms of direct harm.

9wzYQbTYsAIc|3 years ago

> the revelations do not provide a significant market substitute for the protected aspects of the originals

It does seem like generative AI systems provide a significant market substitute, so this ruling probably wouldn’t apply, in court.

edit: see https://news.ycombinator.com/item?id=33194623 for some initial thoughts on how this problem (and others) could be rectified.

For example, with a database of protected works and self-censorship algorithms for generative AI systems, conscientiously objecting creatives could have a mechanism for excluding their works.

dangerface|3 years ago

> Google’s unauthorized digitizing of copyright-protected works, creation of a search functionality, and display of snippets from those works are non-infringing fair uses. The purpose of the copying is highly transformative, the public display of text is limited, and the revelations do not provide a significant market substitute for the protected aspects of the originals.

So is digitizing a copyright vhs and hosting it via torrents also fair use? Its transformative, the public display of the video is limited, there is no market for vhs.

I don't get it whats the difference other than Google having deeper pockets than me?

authpor|3 years ago

> I think allowing people to exclude themselves or their work from a dataset is necessary.

or they could open it all up for everybody and stop protecting the rights of death people (authors dead less then 70 years ago)

then again, that will make the publishers starve... but why pretend publishing corporations need food?

echelon|3 years ago

Do we allow artists to withhold their works from the minds of eager, learning children? [1]

Tell me how ML is different than the mind of a toddler ravenous for new information.

For every billion dollar start-up using data at scale, there are tens of thousands more researchers and hobbyists doing the exact same, producing wonderful results and advances.

If we stop this growth dead in the tracks, other countries more willing to look past the IP laws will jump ahead. And if Stability locks away their secret sauce, some new party will come and give away the keys to the kingdom yet again.

You can't block the signal. Except, of course, by legislating against it in some Luddite hope we can prevent the future from happening.

Instead of worrying careers will end, we should look at this as being the end of specialization. No longer do we need to pay 20,000 hours to learn one thing to the exclusion of all others we would like to try. Now we'll be able to clearly articulate ourselves with art, music, poetry. We'll become powerful beings of thought and expression.

Humans aren't the end or the peak of evolution. We should be excited to watch this unfold.

[1] Maybe Disney would like you to pay more for a premium learning plan for your child, but thankfully that's not (yet) possible.

dkural|3 years ago

This reminds me of the Jedi Mind trick of Uber of waving a smartphone to argue that labor & other laws all of a sudden don't apply to them, to the detriment of the public that'll now shoulder the costs.

daniel-cussen|3 years ago

[deleted]

nojvek|3 years ago

Big Tech has really big datasets esp Google. With YouTube, Photos, Music, Gmail, Docs, Maps, Books, Waymo, Search … they have giant multimodal datasets that capture essence of all human knowledge. They have 10+ products with more a billion users creating data for them.

If Google Brain/DeepMind were to crack AGI, it would make Google/Alphabet crazy rich at the detriment of millions of YouTubers, Book authors, musicians, drivers.

AI will concentrate power and wealth to fewer individuals.

mirker|3 years ago

Ads companies getting rich off of AGI seems a bit sensational when they’re already getting rich off of the boring type of AI. They’ve already gotten rich indexing the web and all the data we have years ago.

noduerme|3 years ago

I've got a couple examples of Stable Diffusion replicating watermarks along with similar swatches of imagery into scenes from the same prompt [1]. A single case of this should be enough to file a massive lawsuit if the art were recognizable to the creator.

[1] https://news.ycombinator.com/item?id=33061707

danielbln|3 years ago

The model learns all attributes of the images it's trained on, including that some have a watermark. The fact that it generates a watermark in some images doesn't mean that that is a 1:1 image from the training set, it just means to the model some images seem to have a watermark, so it will add it sometimes. Often you can just add "no watermark" (or add it as a negative prompt with some weights) and re-use the same seed to get the same image without the watermark.

killjoywashere|3 years ago

> It’s currently unclear if training deep learning models on copyrighted material is a form of infringement

What? It's clearly a derived work.

wodenokoto|3 years ago

I'm pretty sure I can count the number of words in Harry Potter without breaking copyright law.

It is absolutely not clear when statistical models stops counting ngrams and starts making a derived works.

alar44|3 years ago

Alright, then every piece of music you've ever heard is also a derived work. Unless the composer grew up in a void.

gfd|3 years ago

Was this term coined on HN? I remember first seeing it (used in an AI context) from this 2019 comment under "Cool stuff that's still completely unregulated": https://news.ycombinator.com/item?id=21167689

Most of the predictions in that first comment came true.

lachlan_gray|3 years ago

William Gibson mentions data laundering as an illicit activity in the Neuromancer books! It’s plausible that the phrase itself was coined there

learndeeply|3 years ago

> But then Meta is using those academic non-commercial datasets to train a model, presumably for future commercial use in their products. Weird, right?

This is a very strong and likely inaccurate presumption.

nerdponx|3 years ago

Is it? Maybe they have their own internal version they are using, but who's to say they aren't fine tuning the model and applying it somewhere?

brrrrrm|3 years ago

yep. This class of fallacy has its own wiki article: https://en.wikipedia.org/wiki/Appeal_to_probability

RosanaAnaDana|3 years ago

The horse seems well out of the gate.

VanTheBrand|3 years ago

the horse is out of the gate on photocopiers and before them printing presses but that doesn't make using them without the rights to what you are copying legal

Havoc|3 years ago

The whole thing is a mess but frankly i doubt this genie can be put back in the bottle

9wzYQbTYsAIc|3 years ago

I think it is an a priori fact that the cat is out of the bag.

The existing publicly available datasettes, algorithms, and weighted models certainly should be expected to be permanently in the hands of some non-law-abiding parties, at this point.

I think that it will be important to ensure that we have symmetric information, going forward, otherwise trying to put the genie back in the bottle may just end up further disadvantaging those that try to follow the rules.

ROTMetro|3 years ago

...said the music industry about samplers in the 1990s.

bo1024|3 years ago

The Flickr example is wild. How was nobody sued for that!?

krab|3 years ago

Are we heading towards voiding most of current copyrights or is there a way out of this mess with another patch to the laws?

theGnuMe|3 years ago

It’s definitely fair use. One question I have though is Mickey Mouse protected by copyright or trademark or ? I assume someone other than Disney can’t sell mickeys likeness or is that wrong in art? And if the AI makes a movie?

patcon|3 years ago

Not sure laundering it the right term.

Laundering private things through the commons feels not as shady as laundering in private networks. The commons benefits too.

It's more like open source that money laundering

113 comments