top | item 32573523

Ask HN: DALL-E was trained on watermarked stock images?

266 points| whycombinetor | 3 years ago | reply

I just got a Dall-E render with a very intact "gettyimages" watermark on it. I'm no legal expert on whether you have to own the license to something to use it as training input to your AI model, but surely you can't just... use stock photos without paying for the license? Maybe I'm just old fashioned.

Prompt: "king of belgium giving a speech to an audience, but the audience members are cucumbers"

All 4 results (all no good as far as the prompt is concerned): https://ibb.co/gz5RDkB

Fullsize of the one with the watermark https://ibb.co/DzGR063

227 comments

order
[+] dlg|3 years ago|reply
I am not a lawyer, but I've had to argue about copyright with several.

In the United States, there are two bits of case law that are widely cited and relevant: In Kelly v. Arriba Soft Corp (9th), found that making thumbnails of images for use in a search engine was sufficiently "transformative" that it was ok. Another case, Perfect 10 (9th), found that thumbnails for image search and cached pages were also transformative.

OTOH, cases like Infinity Broad. Corp. v. Kirkwood found that that retransmission of radio broadcast over telephone lines is not transformative.

If I understand correctly, there are four parts to the US courts' test for transformativness within fair use (1) character of use (2) creative nature of the work (3) amount or substantiality of copying (4) market harm.

I'd think that training a neural network on artwork--including copyrighted stock photos--is almost certainly transformative. However, as you show, a neural network might be overtrained on a specific image and reproduce it too perfectly--that image probably wouldn't fall under fair use.

There are also questions of if they violated the CFAA or some agreement crawling the images (but Hiq v Linkedin makes it seem like it's very possible to do legally) and whether they reproduced Getty's logo in a way that violates trademarks (are they trying to use it in trade in a way there could be confusion though?)

[+] chrismorgan|3 years ago|reply
All large-scale public machine learning stuff is depending on being exempt from copyright restrictions, under fair use doctrine. Look at my responses to all of the threads about Copilot + GPL for more info about that application of it: https://hn.algolia.com/?query=chrismorgan+copilot+gpl&type=c....

When that is finally tried in court, if it fails to any meaningful extent at all (including going all the way up to Supreme Courts as it doubtless will), then Copilot is dead, DALL·E is dead, GPT-3 is dead, all of these things will be immediately discontinued in at least the affected jurisdictions, at least until such a time as they get the laws changed or judgements overturned.

[+] webwielder2|3 years ago|reply
These are the absolute worst DALL-E images I've seen. Do people generally just share the amazing ones and most of the output is actually complete shite? Like Instagram presenting the top 1% of people's lives.
[+] ehsankia|3 years ago|reply
Top 1% is a bit exaggerated, but there is definitely a lot of not good stuff. I find that Dall-E does especially poorly with underspecified prompts too, unlike something like Midjourney which can give visually pleasing photos for even the most abstract concepts. Dall-E tends to do better with concrete and specific prompts.

Here's an example: Stressful Shapes

Dall-E: https://i.imgur.com/JBkSh0y.png

Midjourney: https://i.imgur.com/C02Zq3i.png

On the other hand, here's a specific prompt: "nerdy yellow duck reading a magical book full of spells"

Dall-E: https://i.imgur.com/FMKZ8zc.png

Midjourney: https://i.imgur.com/lpsg6af.png

[+] yreg|3 years ago|reply
Op constructed a horrible prompt. First of all, using king Philippe I. is against the ToS, so let's go with a generic "king".

Let's not confuse the AI with "buts", just say that he is giving the speech to cucumbers.

Lastly, specify some style, because this would probably not work out as a photo.

My single try is not bad at all and it could definitely be improved.

https://labs.openai.com/s/3OUmUxKefJCeLhAk4hkeKX4V

[+] NoMoreBro|3 years ago|reply
There is always some cherry-picking, but prompt engineering is an art per se, you become better and better working on it. I just started this experiment https://www.instagram.com/unshushproject or without Instagram https://unshush.com and spent A LOT of hours and patience to become good at it. Now I'm very proud of my results and I'm working on doing better.

It's a bit risky to invest too much time because every generator is different and they change the underlying model frequently (see the beta of MidJourney yesterday), but if you do it for passion or curiosity there is no problem.

Now I'm experimenting with a local installation of Stable Diffusion (well, not really "local" because I have an old computer) and the prompt is only one of the things you can tweak. There are num_inference_steps, guidance_scale and other parameters.

[+] smlacy|3 years ago|reply
Prompt Engineering can help a lot but yes, you're basically right: People are generating many, many images and sharing only the best ones with the fewest artifacts.

For simple prompts with little additional guidance, all the diffusion image generators I've seen/used will produce output about like what the author linked most of the time. There are always a few gems, and honing in via prompt engineering helps immensely.

[+] grumbel|3 years ago|reply
Um, have you read the prompt? It looking weird is simply the result of "the audience members are cucumbers". The more crazy your prompt is, the worse the results will generally get.

On top of that DALL-E2 has generally issues with anything dealing with multiple objects. A single person will render fine, groups of people will generally give artifacts. Attributes will also be spread across all objects in the scenes, not just the ones you specified in your prompt, so doing anything more complex will require manual uncropping und inpainting, not just a single prompt.

Anyway, if you avoid the obvious weak spots and holes in the training set, DALL-E2 output is for most part pretty amazing out of the box. It's really more a top 50% than a top 1%.

The biggest bias when it comes to published DALL-E2 images are the prompts. Most prompts you see online are not the actual prompts, but funny descriptions made by a human after the fact. The actual prompt are often much longer and sometimes completely different.

[+] pdntspa|3 years ago|reply
I've been reading some folks saying that "prompt engineering" is a legit future vocation in a world where AI has taken over a lot of creative work

And from my experience getting high-quality output from AIs takes a bit of finesse. Not quite unlike crafting a good Google query

so... yes

[+] grungegun|3 years ago|reply
For diversity, Dalle 2 has a random chance of injecting "women" or "black" after a prompt. When this happens, at least for me, it generally destroyed the quality of the images. Probably "King" was identified as a gendered word. You can find some discussion of this on the subreddit r/dalle2. Sometimes the images are quite poor, but in this case, openAI is doing additional tampering.

A twitter user figured out which words they were using by generating a lot of images with the starting prompt "A sign being held that says "

[+] tkgally|3 years ago|reply
As I mentioned here a couple of weeks ago [1], I tested DALL-E with prompts for paintings and drawings in three standard genres: still life, landscape, and portrait. The prompts for portraits yielded a lot of grotesquely unacceptable faces, but almost all of the DALL-E output for the still lifes and landscapes was perfectly fine.

[1] https://news.ycombinator.com/item?id=32433821

[+] JimDabell|3 years ago|reply
They are the worst I’ve seen as well.

Yes, people tend to share the best of the best. However these results seem especially bad, like bottom 10% bad.

[+] gojomo|3 years ago|reply
Of course people are more likely to share the best iamges – or in this case, the one most illustrative of their concern (about watermarks).

Also: my sense is that getting the best results often requires a lot of extra coaching with style/detail words. As we can't see the prompt here, we don't know what sort of style/details were requested. GIGO.

[+] smileybarry|3 years ago|reply
It definitely requires some very detailed descriptions and sifting through to find a good one. One time I've regenerated a prompt as well because the existing 4 were just not that good. But I did get some great ones at a pretty good usable:unusable ratio.
[+] BrainVirus|3 years ago|reply
People here, as always, get hung up on legalese bullshit, but miss the overall picture.

The dynamics in play is highly questionable. Countless artists and photographers put effort into creating their works. They put they work online to get some attention and recognition. A company comes along, scrapes all of it and starts selling access to the model to generate something that looks highly derivative. The original cohort of artists and photographers not only get zero money or attention from this new endeavor, they are now in competition with the resulting model.

In short, someone whose work was essential to building a thing gets no benefits and possibly even gets (financially) harmed by that thing. Just because this gets verbally labeled "fair use" doesn't make it fair.

Additional point:

Just a few years ago a bunch of tech companies were talking about "data dignity". Somehow, magically, this (marketing) term is no longer used anywhere.

[+] xg15|3 years ago|reply
Reminds me of the discussion about GitHub Copilot using the entirety of GitHub as training data. I was honestly baffled how many people, even experts in the field, saw use as training data as non-infringing. With the corrolay that it's apparently perfectly legal to "copyright-wash" a work by feeding it to an AI and have that AI generate a slightly different but extremely similar work.

Considering how strict and heavy-handed copyright handling has been otherwise, this has added to my belief that copyright in practice is really just enforcement of the interests of whatever industry has the most power at a given time: When entertainment and content generation was the biggest revenue generator, copyright couldn't be strict enough, now all money is on AI and suddenly loopholes the size of barn doors pop up.

[+] rich_sasha|3 years ago|reply
Written laws are vague, practical verdicts are based on case law, cases are won by better-funded lawyers, rich industries prevail.

It's a bit of an exaggeration but maybe not too much.

[+] Karunamon|3 years ago|reply
"Copyright washing" seems a lot like clean room reverse engineering to me; this is usually done by having one person read the copyrighted code and describe what it does to another person, who then designs an implementation based on the description.

At least, I can't see a substantial difference in the result.

[+] wongarsu|3 years ago|reply
These loopholes are purely theoretical until tested in court. At some point a generating AI will hurt the wrong company, and they will either make a public spectacle out of it in court, or if they see no chance of winning lobby congress to introduce laws that make the case winnable.
[+] ShamelessC|3 years ago|reply
> but surely you can't just... use stock photos without paying for the license?

They aren't hosting the infringing content. Training on the data is probably covered under fair use. Generations are of _learned_ representations of the dataset, not the dataset itself. This makes it closer to outputting original works (probably owned by the person who used the model).

The players involved here are known for being litigious, however. I wouldn't be surprised if OpenAI did in fact pay some hefty fee upfront to get full permission to use these images.

[+] BeefWellington|3 years ago|reply
> Training on the data is probably covered under fair use. Generations are of _learned_ representations of the dataset, not the dataset itself. This makes it closer to outputting original works (probably owned by the person who used the model).

"Probably" is doing a lot of heavy lifting in that sentence.

As for "_learned_", that's pretty debatable considering it's reproducing recognizable trademark infringement.

> The players involved here are known for being litigious, however. I wouldn't be surprised if OpenAI did in fact pay some hefty fee upfront to get full permission to use these images.

I have no idea why anyone would assume the "move fast and break things" disruption mindset that pervades tech companies these days, especially in spaces like ML/"AI", would mean they considered the legality, ethics, or good business sense of their training dataset.

As with Copilot, I suspect the DALL-E terms of use puts the onus on the user to avoid using infringing items.

[+] kej|3 years ago|reply
If they had been paying for the images upfront, wouldn't you expect them to train the model on the non-watermarked versions?
[+] gricardo99|3 years ago|reply
if they paid for access, or permission, why train on the watermark versions?

I’m guessing they assumed fair use and there will be lawsuits.

[+] chrismorgan|3 years ago|reply
I would be very surprised if OpenAI paid anything for these, because it would set precedent that copyright infringement was applicable, which would be fatal down the road. (The only argument they could possibly mount in their defence would be that they wanted to train on the original images without watermarks.)
[+] whywhywhywhy|3 years ago|reply
What if my dataset is just the one Getty image I don’t want to pay for.
[+] bhedgeoser|3 years ago|reply
What if I write a machine learning algorithm that only generates images that it has seen in the training dataset, with one pixel slightly different.
[+] sulam|3 years ago|reply
I think it’s amusing that many commenters here are perfectly willing to defend DALL-E, but mention Copilot and the discussion looks radically different.
[+] cercatrova|3 years ago|reply
Based on the new scraping ruling with LinkedIn [0], anything that is "open gate" (as in, accessible without logging in) can be scraped and (I assume) be used by neural networks. The onus, it appears, is to not use it to generate copyrighted works, like Iron Man from Marvel, just as one can use Photoshop as a tool but is still barred from making and selling an Iron Man digital painting.

[0] https://cdn.ca9.uscourts.gov/datastore/opinions/2022/04/18/1...

[+] resoluteteeth|3 years ago|reply
> Based on the new scraping ruling with LinkedIn [0], anything that is "open gate" (as in, accessible without logging in) can be scraped and (I assume) be used by neural networks.

The ruling you are linking to is about whether scraping violates the Computer Fraud and Abuse Act.

This isn't really applicable here. First of all, that's a separate issue from copyright. Just because scraping publicly accessible data doesn't violate the CFAA doesn't mean that suddenly all images posted on the internet are public domain or that can use copyrighted images from websites for whatever you want, for example.

Furthermore, how copyright applies to training neural networks on copyrighted works is an open question right now.

[+] olliej|3 years ago|reply
I would assume that for cases like this is is more a matter of whether you can redistribute copyrighted work that has not had any of the usual "creative use" things applied, rather than whether the original scanning was protected.
[+] im3w1l|3 years ago|reply
I remember when people used to say ianal. Innocent times when we thought there was an objective law and lawyers knew it. But that's not how these things work. The truth is that no one knows. Ultimately a bunch of people will decide how they feel about it. Well-read legal scholars trying really hard to be fair, but still just people. No one can predict with full certainty which way it will go.
[+] otoburb|3 years ago|reply
>>No one can predict with full certainty which way it will go.

Until somebody tries to float a trial balloon (case) in court.

[+] jcims|3 years ago|reply
Legally wouldn't it just boil down to the license on the watermarked image?

BTW you can add 'royalty free' to the prompt to get rid of those most of the time (lol?).

[+] trention|3 years ago|reply
My personal opinion is that it's unethical (and possibly illegal, in a subset of cases) to train models on data without explicit consent of the creators of that data. And that really encompasses all data - generative models were not a thing when said data was created and no matter how it was licensed before, explicit consent about using it for model training must be obtained from the creators themselves.

That being said, arguments about copyright are just a fig leaf as far as I am concerned. The outcome of whether this is allowed or not will depend on the net impact of using those models on the job market and whether society will be willing to tolerate it.

[+] gojomo|3 years ago|reply
You may want to use the native 'Share' option, especially on the one with the watermark.

You'll get a public link, at `labs.openai.com` rather than some random image-sharing site, which will show the image & the prompt used to generate it (including a credit to "your-first-name × DALL·E").

[+] RcouF1uZ4gsC|3 years ago|reply
What is interesting is a human analogy.

Say you were an artist who went to every art show and museum and studied all the art there.

If you produced a work of art solely from memory that contained large portions of other people's copyrighted art, would that still fall under copyright/require licensing?

[+] whycombinetor|3 years ago|reply
Precedent in music says sometimes-yes. The "Blurred Lines" lawsuit found that Pharrell and Robin Thicke were liable in the tune of $7m for producing a work of art solely from memory that copied the "signature phrases, hooks, bass lines, keyboard chords, harmonic structures and vocal melodies" of a Marvin Gaye song. https://en.wikipedia.org/wiki/Pharrell_Williams_v._Bridgepor... https://www.npr.org/2015/03/11/392375390/-7-million-verdict-...
[+] alcolade|3 years ago|reply
Another human analogy could be: you take a photo from every art show and museum, and use those for reference as you paint.
[+] 8note|3 years ago|reply
Alternatively, if you memorized some GPL code, can you write a copy of it and put in a proprietary licence?
[+] egypturnash|3 years ago|reply
There are definitely lines to be crossed. Let me tell you about one of them.

There is a comics creator named Kieth Giffen. He's done a lot of solid work over the years for DC and Marvel, there's a playful love of the medium and its history that flows through a lot of his work. At first his style was pretty middling; nothing terrible, nothing to really stand out from the pack. Then one day his work changed dramatically - he got a lot more daring in spotting his blacks, inking with a heavier brush, and doing a lot of panels that were a closeup of a backlit head with rim lighting, and eyes and teeth standing out in white. It was grounded in observation but had a lot of fresh ways to abstract a scene in the service of story. It was like nothing else on the racks and really striking.

It was also completely swiped from the work of an Argentinian artist named José Muñoz. Pick up one of Muñoz's shadow-drenched crime stories, put it next to one of Giffen's superhero tales, and you could clearly see the influence. And not just the influence, influence is okay - Giffen had started entirely cloning Muñoz's style, completely dropping all his other influences in the process. Muñoz was not happy when he heard about this, and neither were other artists in the field of comics. Influence is one thing, everyone's influenced by other artists, and if you're familiar with an artist's influences you can tell. But dropping all your other influences to start drawing almost exactly like a new one? That's just not done.

Giffen got a lot of shit for this. Giffen quit comics for a couple of years after this, and when he came back he had a new look. He still does the Shadowy Muñoz Face now and then but it's more along the lines of one of the many things he's borrowed from his multiple influences rather than one of the ways he was wholesale ripping off Muñoz.

"Style theft" is completely legal in the eyes of the court. There was nothing legally actionable going on here. But in the court of his fellow artists, Giffen was judged, and found guilty.

There's a range here. Nobody's going to care if you pick up a collection of Winsdor McCay's pioneering 19xx comic strip "Little Nemo" and do a dream-themed story that borrows his distinctive panel composition, lettering, and inking choices. Nobody's going to care if you do one drawing that precisely lifts Mike Mignola's heavy use of black and thin, clear lines. If you do superheros long enough then you're pretty much obligated to do at least one story that emulates Jack Kirby as closely as you can. If you worked as someone's assistant for a half a decade then you are very much allowed to bust out a perfect rendition of their style at any point in your entire life. But there is definitely a line you can cross where every artist (and a lot of non-artists) who sees a side-by-side view of what you're doing and what you're swiping from will say "dude, not cool, stop swiping their style".

These image generators actively encourage adding the names of prominent, living artists to your prompts to get the results you want. Is this crossing the same line Kieth Giffen did?

[+] donkarma|3 years ago|reply
yeah except this artist wont go around painting watermarks
[+] _trampeltier|3 years ago|reply
If you read the licence from Getty, they say, you are not allowed to use Getty pictures for ML.
[+] userbinator|3 years ago|reply
This interesting era of AI will surely teach us the meaning of that old phrase "great artists steal", or more subtly rephrased, "everything is a derived work".
[+] Geonode|3 years ago|reply
It doesn't matter. I could put a Getty watermark on anything. Getty would have to show that a generated image was at least in part the same as their image.
[+] surfacedetail|3 years ago|reply
I'm finding it amusing that everyone immediately assumes infringement, OpenAI is a company that will not be inviting lawsuits.

We can't assume any licensing behind closed doors, my guess is that OpenAI has an agreement with Getty, take a look at the licensing in this Observer piece, it's been licensed by Getty, this would indicate that Getty are happy with scraping.

https://www.theguardian.com/commentisfree/2022/aug/20/ai-art...

Besides, this is not infringement in principle, the AI has been trained to think that high-quality news images have watermarks.

[+] registeredcorn|3 years ago|reply
I don't care much for what laws say. If the only way someones service can work is by ingesting the work of someone else, without compensation, and then compete with that same person, that is wrong.

If a company reverse engineers a competitors product, they still buy the product to tear it apart and figure out how it works.

If a student learns from their teacher, then goes on to sell a similar kind of work as what their teacher makes, at least the student paid for the classes.

This arrangement offers none of that. As long as theft is illegal, this should be. I'd call it parasitic, but it isn't; this is a parasite who's sole intent is to kill the host.