DALL·E: Creating Images from Text

[+] desideratum|5 years ago|reply

Some truly impressive results. I'll pick my usual point here when a fancy new (generative) model comes out, and I'm sure some of the other commenters have alluded to this. The examples shown are likely from a set of well-defined (read: lots of data, high bias) input classes for the model. What would be really interesting is how the model generalizes to /object concepts/ that have yet to be seen, and which have abstract relationships to the examples it has seen. Another commenter here mentioned "red square on green square" working, but "large cube on small cube", not working. Humans are able to infer and understand such abstract concepts with very few examples, and this is something AI isn't as close to as it might seem.

[+] nojvek|5 years ago|reply

Wow. This is amazing. Although I wish they documented how much compute and data was used to get these results.

I absolutely believe we'll crack the fundamental principles of intelligence in our lifetimes. We now have capability to process all public data available on internet (of wikipedia is a huge chunk). We have so many cameras and microphones (one in each pocket).

It's also scary to think if it goes wrong (the great filter for fermi paradox). However I'm optimistic.

The brain only uses 20 watts of power to do all its magic. The entire human body is built from 700MB of data in the DNA. The fundamental principles of intelligence is within reach if we look from that perspective.

Right now GPT3 and DALL-E seem to be using an insane amount of computation to achieve what they are doing. My prediction is that in 2050, we'll have pretty good intelligence in our phones that has deep understanding (language and visual) of the world around us.

[+] commonturtle|5 years ago|reply

This is simultaneously amazing and depressing, like watching someone set off a hydrogen bomb for the first time and marveling at the mushroom cloud it creates.

I really find it hard to understand why people are optimistic about the impact AI will have on our future.

The pace of improvement in AI has been really fast over the last two decades, and I don't feel like it's a good thing. Compare the best text generator models from 10 years ago with GPT-3. Now do the same for image generators. Now project these improvements 20 years into the future. The amount of investment this work is getting grows with every such breakthrough. It seems likely to me we will figure out general-purpose human-level AI in a few decades.

And what then? There are so many ways this could turn into a dystopian future.

Imagine for example huge mostly-ML operated drone armies, tens of millions strong, that only need a small number of humans to supervise them. Terrified yet? What happens to democracy when power doesn't need to flow through a large number of people? When a dozen people and a few million armed drones can oppress a hundred million people?

If there's even a 5% chance of such an outcome (personally I think it's higher), then we should be taking it seriously.

[+] lacker|5 years ago|reply

I wish this was available as a tool for people to use! It's neat to see their list of pregenerated examples, but it would be more interesting to be able to try things out. Personally, I get a better sense of the powers and limitations of a technology when I can brainstorm some functionality I might want, and then see how close I can come to creating it. Perhaps at some point someone will make an open source version.

[+] minimaxir|5 years ago|reply

The way this model operates is the equivalent of machine learning shitposting.

Broke: Use a text encoder to feed text data to an image generator, like a GAN.

Woke: Use a text and image encoder as the same input to decode text and images as the same output

And yet, due to the magic of Transformers, it works.

From the technical description, this seems feasible to clone given a sufficiently robust dataset of images, although the scope of the demo output implies a much more robust dataset than the ones Microsoft has offered publicly.

[+] ArtWomb|5 years ago|reply

"Teapot in the shape of brain coral" yields the opposite. The topology is teapot-esque. The texture composed of coral-like appendages. Sorry if this is overly semantic, I just happen to be in a deep dive in Shape Analysis at the moment ;)

>>> DALL·E appears to relate the shape of a half avocado to the back of the chair, and the pit of the avocado to the cushion.

That could be human bias recognizing features the generator yields implicitly. Most of the images appear as "masking" or "decal" operations. Rather than a full style transfer. In other words the expected outcome of "soap dispenser in the shape of hibiscus" would resemble a true hybridized design. Like an haute couture bottle of eau du toilette made to resemble rose petals.

The name DALL-E is terrific though!

[+] dj_mc_merlin|5 years ago|reply

I find it's ability to give different interpretations of the same thing amazing. This kind of fuzziness is also present in human art.

Another good example is the "collection of glasses" on the table. It makes both glassware and eyeglasses!

[+] dane-pgp|5 years ago|reply

> a living room with two white armchairs and a painting of the colosseum. the painting is mounted above a modern fireplace.

With the ability to construct complex 3D scenes, surely the next step would be for it to ingest YouTube videos or TV/movies and be able to render entire scenes based on a written narration and dialogue.

The results would likely be uncanny or absurd without careful human editorial control, but it could lead to some interesting short films, or fan-recreations of existing films.

[+] reubens|5 years ago|reply

"For other captions, such as “a snail made of harp,” the results are less good, with images that combine snails and harps in odd ways." [0]

You try drawing a snail made of harp! Seriously! DALL-E did an incredible job

[0] https://www.technologyreview.com/2021/01/05/1015754/avocado-...

[+] inferense|5 years ago|reply

In spite of the close architectural resemblance with the VQVAE2, it definitely pushes the text-to-image synthesis domain forward. I'd be curious to see how well it can perform on a multi-object image setting which currently presents the main challenge in the field. Also, I wouldn't be surprised if these results were limited to openAI scale of computing resources. All in all, great progress in the field. The phase of development here is simply staggering, considering the fact that few years back we could hardly generate any image in high fidelity.

[+] dj_mc_merlin|5 years ago|reply

This is real? A computer can take "an armchair in the shape of an avocado" as input and make a picture of one?

I can't believe it. How does it put the baby daikon radish in the tutu?

[+] hnthrowopen|5 years ago|reply

Is there a link to the git repo or is OpenAI not really open?

[+] wccrawford|5 years ago|reply

I suspect you meant for Dall-E specifically, but this is their repo. Found on their about page.

https://github.com/openai/

[+] jokethrowaway|5 years ago|reply

the only open thing is the name

[+] jonesn11|5 years ago|reply

Someone made a replication of the github and it can be found here: https://github.com/lucidrains/DALLE-pytorch

[+] jcims|5 years ago|reply

They are going to train on YouTube/PornHub before long and it’s going to get weird.

[+] SoSoRoCoCo|5 years ago|reply

I'm not sure how to feel, because I had this exact same thought. The evolution of porn from 320x200 EGA on a BBS, to usenet (alt.binaries,pictures.erotica, etc.) on XVGA (on an AIX Term), to the huge pool of categories on today's porn sites, which eventually became video and bespoke cam performers... Is this going to be some new weird kind of porn that Gen Alpha normalizes?

[+] MrBuddyCasino|5 years ago|reply

Also, someone will make a version for furries. They pay well.

[+] irrational|5 years ago|reply

Combine this with deep fakes.

Donald Trump is Nancy Pelosi's and AOC's step-brother in a three-way in the Lincoln Bedroom.

[+] dfischer|5 years ago|reply

Maybe already has...

[+] ignoranceprior|5 years ago|reply

Does this address NLP skeptics' concerns that Transformer models don't "understand" language?

If the AI can actually draw an image of a green block on a red block, and vice versa, then it clearly understands something about the concepts "red", "green", "block", and "on".

[+] dcolkitt|5 years ago|reply

The root-case of skepticism has always been that while Transformers do exceptionally well on finite-sized tasks, they lack any fully recursive understanding of the concepts.[0]

A human can learn basic arithmetic, then generalize those principles to bigger number arithmetic, then go from there to algebra, then calculus, then so. Successively building on previously learned concepts in a fully recursive manner. Transformers are limited by the exponential size of their network. So GPT-3 does very well with 2-digit addition and okay with 2-digit multiplication, but can't abstract to 6-digit arithmetic.

DALL-E is an incredible achievement, but doesn't really do anything to change this fact. GPT-3 can have an excellent understanding of a finite sized concept space, yet it's still architecturally limited at building recursive abstractions. So maybe it can understand "green block on a red block". But try to give it something like "a 32x16 checkerboard of green and red blocks surrounded by a gold border frame studded with blue triangles". I guarantee the architecture can't get that exactly correct.

The point is that, in some sense, GPT-3 is a technical dead-end. We've had to exponentially scale up the size of the network (12B parameters) to make the same complexity gains that humans make with linear training. The fact that we've managed to push it this far is an incredible technical achievement, but it's pretty clear that we're still missing something fundamental.

[0] https://arxiv.org/pdf/1906.06755.pdf

[+] tralarpa|5 years ago|reply

Try a large block on a small block. As the authors also have noted in their comments the success rate is nearly zero. One may wonder why. Maybe because that's something you see rarely in photos? At the end, it doesn't "understand" the meaning of the words.

[+] karmasimida|5 years ago|reply

I think it is safe to say that learning a joint distribution of vision + language, is fully possible at this stage, demonstrating by this work.

But 'understanding' itself needs to be further specified, in order to be tested even.

What strikes me most is the fidelity of those generated images, matching the SOTA from GAN literature with much more variety, without using the GAN objective.

It seems Transformer model might be the best neural construct we have right now, to learn any distribution, assuming more than enough data.

[+] TigeriusKirk|5 years ago|reply

There are examples on twitter showing it doesn't really understand spatial relations very well. Stuff like "red block on top of blue block on top of green block" will generate red, green, and blue blocks, but not in the desired order.

https://twitter.com/peabody124/status/1346565268538089483

[+] tikwidd|5 years ago|reply

if(adj == 'red') drawBlock(RED)

According to your definition of understanding, this program understands something about the concept RED. But the code is just dealing with arbitrary values in memory (e.g. RED = 0xFF0000)

[+] CyberRabbi|5 years ago|reply

Seems like we’re getting closer to AI driven software engineering.

Prompt: a Windows GUI executable that implements a scientific calculator.

[+] rkagerer|5 years ago|reply

Anywhere this can be tried out interactively? I'd like to type some phrases and see how it does.

[+] sircastor|5 years ago|reply

In various Episodes of Star Trek The Next Geneneration, the crew asks the computer to generate some environment or object with relatively little description. It’s a story telling tool of course, but looking at this, I can begin to imagine how we might get there from here.

[+] worldsayshi|5 years ago|reply

There's something that really creeps me out about errors in AI generated images. More than uncanny valley creepiness. Like trypophobia creepy.

[+] Nition|5 years ago|reply

Same for me. It's like the feeling you get in a dream where things seem normal and you think you're awake, then suddenly you notice something wrong about the room, something impossible.

[+] ravi-delia|5 years ago|reply

I know exactly what you mean. Like if you had to see it in real life you'd see something horrible just out of shot. For some reason that's amplified with the furniture.

[+] _fx6v|5 years ago|reply

Just wait until you can’t tell the difference and then contemplate if it matters, and then if that’s what reality already is.

[+] infinityobject|5 years ago|reply

Don't look at the food, I advise.

[+] jamcohen|5 years ago|reply

I get the same feeling, to the point that I occasionally let out a brief scream when browsing GAN images.

[+] ve55|5 years ago|reply

I really do think AI is going to replace millions of workers very quickly, but just not in the order that we used to think of. We will replace jobs that require creativity and talent before we will replace most manual factor workers, as hardware is significantly more difficult to scale up and invent than software.

At this point I have replaced a significant amount of creative workers with AI for personal usage, for example:

- I use desktop backgrounds generated by VAEs (VD-VAE)

- I use avatars generated by GANs (StyleGAN, BigGAN)

- I use and have fun with written content generated by transformers (GPT3)

- I listen to and enjoy music and audio generated by autoencoders (Jukebox, Magenta project, many others)

- I don't purchase stock images or commission artists for many previous things I would have when a GAN exists that already makes the class of image I want

All of this has happened in that last year or so for me, and I expect that within a few more years this will be the case for vastly more people and in a growing number of domains.

[+] sushisource|5 years ago|reply

> - I use and have fun with written content generated by transformers (GPT3)

> - I listen to and enjoy music and audio generated by autoencoders (Jukebox, Magenta project, many others)

Really, you've "replaced" normal music and books with these? Somehow I doubt that.

[+] rich_sasha|5 years ago|reply

Not to undermine this development, but so far, no surprise, AI depends on vast quantities of human-generated data. This leads us to a loop: if AI replaces human creativity, who will create novel content for new generation of AI? Will AI also learn to break through conventions, to shock and rewrite the rules of the game?

It’s like efficient market hypothesis: markets are efficient because arbitrage, which is highly profitable, makes them so. But if they are efficient, how can arbitrageurs afford to stay in business? In practice, we are stuck in a half-way house, where markets are very, but not perfectly, efficient.

I guess in practice, the pie for humans will keep on shrinking, but won’t disappear too soon. Same as horse maintenance industry, farming and manufacturing, domestic work etc. Humans are still needed there, just a lot less of them.

[+] Impossible|5 years ago|reply

I believe that AI will accelerate creativity. This will have a side effect of devaluing some people's work (like you mentioned), but it will also increase the value of some types of art and, more importantly, make it possible to do things that were impossible before, or allow for small teams and individuals to produce content that were prohibitively expensive.

[+] minimaxir|5 years ago|reply

There still needs to be some sort of human curation, lest bad/rogue output risks sinking the entire AI-generated industry. (in the case of DALL-E, OpenAI's new CLIP system is intended to mitigate the need for cherry-picking, although from the final demo it's still qualitative)

The demo inputs here for DALL-E are curated and utilize a few GPT-3 prompt engineering tricks. I suspect that for typical unoptimized human requests, DALL-E will go off the rails.

[+] yowlingcat|5 years ago|reply

> We will replace jobs that require creativity

Frankly, I think the "AI will replace jobs that require X" angle of automation is borderline apocalyptic conspiracy porn. It's always phrased as if the automation simply stops at making certain jobs redundant. It's never phrased as if the automation lowers the bar to entry from X to Y for /everyone/, which floods the market with crap and makes people crave the good stuff made by the top 20%. Why isn't it considered as likely that this kind of technology will simply make the best 20% of creators exponentially more creatively prolific in quantity and quality?

[+] ErikAugust|5 years ago|reply

Isn't training data effectively a form of sampling?

Couldn't any creator of images that a model was trained on sue for copyright infringement?

Or do great artists really just steal (just at a massive scale)?

[+] ryan93|5 years ago|reply

Those don’t seem in any way similar to like writing a tv show or animating a Pixar movie.

[+] unknown|5 years ago|reply

[deleted]

[+] karmasimida|5 years ago|reply

I think this is actually not a bad thing.

I won't say many of those things are creativity driven. There are more like auto assets generation.

One use case of such model would be in gaming industry, to generate large amount of assets quickly. This process along takes years, and more and more expensive as gamers are demanding higher and higher resolution.

AI can make this process much more tenable, bring down the overall cost.

[+] A4ET8a8uTh0|5 years ago|reply

You are probably right. Still, there is hope that this just a prelude to getting closer to a Transmetropolitan box ( assuming we can ever figure out how to make AI box that can make physical items based purely on information given by the user ).

[+] ignoranceprior|5 years ago|reply

Do you think investing in MSFT/GOOGL is the best way to profit off this revolution?

[+] RealSpaceMonkey|5 years ago|reply

What GANs do you use to generate stock images?

Do you have a GPT-3 key?

[+] brian_herman|5 years ago|reply

Now we just have to wait for huggingface to create an open source implementation. So much for openness I guess if you go on Microsoft azure you can use closed ai.

[+] fumblebee|5 years ago|reply

Does anyone have any insight on how much it would cost for OpenAI to host an online, interactive demo of a model like this? I'd expect a lot - even just for inference - based on the size of the model and the expected virality of the demo, but have no reference points for quantifying.

267 comments