Some truly impressive results. I'll pick my usual point here when a fancy new (generative) model comes out, and I'm sure some of the other commenters have alluded to this. The examples shown are likely from a set of well-defined (read: lots of data, high bias) input classes for the model. What would be really interesting is how the model generalizes to /object concepts/ that have yet to be seen, and which have abstract relationships to the examples it has seen. Another commenter here mentioned "red square on green square" working, but "large cube on small cube", not working. Humans are able to infer and understand such abstract concepts with very few examples, and this is something AI isn't as close to as it might seem.
Wow. This is amazing. Although I wish they documented how much compute and data was used to get these results.
I absolutely believe we'll crack the fundamental principles of intelligence in our lifetimes. We now have capability to process all public data available on internet (of wikipedia is a huge chunk). We have so many cameras and microphones (one in each pocket).
It's also scary to think if it goes wrong (the great filter for fermi paradox). However I'm optimistic.
The brain only uses 20 watts of power to do all its magic. The entire human body is built from 700MB of data in the DNA. The fundamental principles of intelligence is within reach if we look from that perspective.
Right now GPT3 and DALL-E seem to be using an insane amount of computation to achieve what they are doing. My prediction is that in 2050, we'll have pretty good intelligence in our phones that has deep understanding (language and visual) of the world around us.
This is simultaneously amazing and depressing, like watching someone set off a hydrogen bomb for the first time and marveling at the mushroom cloud it creates.
I really find it hard to understand why people are optimistic about the impact AI will have on our future.
The pace of improvement in AI has been really fast over the last two decades, and I don't feel like it's a good thing. Compare the best text generator models from 10 years ago with GPT-3. Now do the same for image generators. Now project these improvements 20 years into the future. The amount of investment this work is getting grows with every such breakthrough. It seems likely to me we will figure out general-purpose human-level AI in a few decades.
And what then? There are so many ways this could turn into a dystopian future.
Imagine for example huge mostly-ML operated drone armies, tens of millions strong, that only need a small number of humans to supervise them. Terrified yet? What happens to democracy when power doesn't need to flow through a large number of people? When a dozen people and a few million armed drones can oppress a hundred million people?
If there's even a 5% chance of such an outcome (personally I think it's higher), then we should be taking it seriously.
I wish this was available as a tool for people to use! It's neat to see their list of pregenerated examples, but it would be more interesting to be able to try things out. Personally, I get a better sense of the powers and limitations of a technology when I can brainstorm some functionality I might want, and then see how close I can come to creating it. Perhaps at some point someone will make an open source version.
The way this model operates is the equivalent of machine learning shitposting.
Broke: Use a text encoder to feed text data to an image generator, like a GAN.
Woke: Use a text and image encoder as the same input to decode text and images as the same output
And yet, due to the magic of Transformers, it works.
From the technical description, this seems feasible to clone given a sufficiently robust dataset of images, although the scope of the demo output implies a much more robust dataset than the ones Microsoft has offered publicly.
"Teapot in the shape of brain coral" yields the opposite. The topology is teapot-esque. The texture composed of coral-like appendages. Sorry if this is overly semantic, I just happen to be in a deep dive in Shape Analysis at the moment ;)
>>> DALL·E appears to relate the shape of a half avocado to the back of the chair, and the pit of the avocado to the cushion.
That could be human bias recognizing features the generator yields implicitly. Most of the images appear as "masking" or "decal" operations. Rather than a full style transfer. In other words the expected outcome of "soap dispenser in the shape of hibiscus" would resemble a true hybridized design. Like an haute couture bottle of eau du toilette made to resemble rose petals.
> a living room with two white armchairs and a painting of the colosseum. the painting is mounted above a modern fireplace.
With the ability to construct complex 3D scenes, surely the next step would be for it to ingest YouTube videos or TV/movies and be able to render entire scenes based on a written narration and dialogue.
The results would likely be uncanny or absurd without careful human editorial control, but it could lead to some interesting short films, or fan-recreations of existing films.
In spite of the close architectural resemblance with the VQVAE2, it definitely pushes the text-to-image synthesis domain forward. I'd be curious to see how well it can perform on a multi-object image setting which currently presents the main challenge in the field. Also, I wouldn't be surprised if these results were limited to openAI scale of computing resources.
All in all, great progress in the field. The phase of development here is simply staggering, considering the fact that few years back we could hardly generate any image in high fidelity.
I'm not sure how to feel, because I had this exact same thought. The evolution of porn from 320x200 EGA on a BBS, to usenet (alt.binaries,pictures.erotica, etc.) on XVGA (on an AIX Term), to the huge pool of categories on today's porn sites, which eventually became video and bespoke cam performers... Is this going to be some new weird kind of porn that Gen Alpha normalizes?
Does this address NLP skeptics' concerns that Transformer models don't "understand" language?
If the AI can actually draw an image of a green block on a red block, and vice versa, then it clearly understands something about the concepts "red", "green", "block", and "on".
The root-case of skepticism has always been that while Transformers do exceptionally well on finite-sized tasks, they lack any fully recursive understanding of the concepts.[0]
A human can learn basic arithmetic, then generalize those principles to bigger number arithmetic, then go from there to algebra, then calculus, then so. Successively building on previously learned concepts in a fully recursive manner. Transformers are limited by the exponential size of their network. So GPT-3 does very well with 2-digit addition and okay with 2-digit multiplication, but can't abstract to 6-digit arithmetic.
DALL-E is an incredible achievement, but doesn't really do anything to change this fact. GPT-3 can have an excellent understanding of a finite sized concept space, yet it's still architecturally limited at building recursive abstractions. So maybe it can understand "green block on a red block". But try to give it something like "a 32x16 checkerboard of green and red blocks surrounded by a gold border frame studded with blue triangles". I guarantee the architecture can't get that exactly correct.
The point is that, in some sense, GPT-3 is a technical dead-end. We've had to exponentially scale up the size of the network (12B parameters) to make the same complexity gains that humans make with linear training. The fact that we've managed to push it this far is an incredible technical achievement, but it's pretty clear that we're still missing something fundamental.
Try a large block on a small block. As the authors also have noted in their comments the success rate is nearly zero. One may wonder why. Maybe because that's something you see rarely in photos? At the end, it doesn't "understand" the meaning of the words.
I think it is safe to say that learning a joint distribution of vision + language, is fully possible at this stage, demonstrating by this work.
But 'understanding' itself needs to be further specified, in order to be tested even.
What strikes me most is the fidelity of those generated images, matching the SOTA from GAN literature with much more variety, without using the GAN objective.
It seems Transformer model might be the best neural construct we have right now, to learn any distribution, assuming more than enough data.
There are examples on twitter showing it doesn't really understand spatial relations very well. Stuff like "red block on top of blue block on top of green block" will generate red, green, and blue blocks, but not in the desired order.
According to your definition of understanding, this program understands something about the concept RED. But the code is just dealing with arbitrary values in memory (e.g. RED = 0xFF0000)
In various Episodes of Star Trek The Next Geneneration, the crew asks the computer to generate some environment or object with relatively little description. It’s a story telling tool of course, but looking at this, I can begin to imagine how we might get there from here.
Same for me. It's like the feeling you get in a dream where things seem normal and you think you're awake, then suddenly you notice something wrong about the room, something impossible.
I know exactly what you mean. Like if you had to see it in real life you'd see something horrible just out of shot. For some reason that's amplified with the furniture.
I really do think AI is going to replace millions of workers very quickly, but just not in the order that we used to think of. We will replace jobs that require creativity and talent before we will replace most manual factor workers, as hardware is significantly more difficult to scale up and invent than software.
At this point I have replaced a significant amount of creative workers with AI for personal usage, for example:
- I use desktop backgrounds generated by VAEs (VD-VAE)
- I use avatars generated by GANs (StyleGAN, BigGAN)
- I use and have fun with written content generated by transformers (GPT3)
- I listen to and enjoy music and audio generated by autoencoders (Jukebox, Magenta project, many others)
- I don't purchase stock images or commission artists for many previous things I would have when a GAN exists that already makes the class of image I want
All of this has happened in that last year or so for me, and I expect that within a few more years this will be the case for vastly more people and in a growing number of domains.
Not to undermine this development, but so far, no surprise, AI depends on vast quantities of human-generated data. This leads us to a loop: if AI replaces human creativity, who will create novel content for new generation of AI? Will AI also learn to break through conventions, to shock and rewrite the rules of the game?
It’s like efficient market hypothesis: markets are efficient because arbitrage, which is highly profitable, makes them so. But if they are efficient, how can arbitrageurs afford to stay in business? In practice, we are stuck in a half-way house, where markets are very, but not perfectly, efficient.
I guess in practice, the pie for humans will keep on shrinking, but won’t disappear too soon. Same as horse maintenance industry, farming and manufacturing, domestic work etc. Humans are still needed there, just a lot less of them.
I believe that AI will accelerate creativity. This will have a side effect of devaluing some people's work (like you mentioned), but it will also increase the value of some types of art and, more importantly, make it possible to do things that were impossible before, or allow for small teams and individuals to produce content that were prohibitively expensive.
There still needs to be some sort of human curation, lest bad/rogue output risks sinking the entire AI-generated industry. (in the case of DALL-E, OpenAI's new CLIP system is intended to mitigate the need for cherry-picking, although from the final demo it's still qualitative)
The demo inputs here for DALL-E are curated and utilize a few GPT-3 prompt engineering tricks. I suspect that for typical unoptimized human requests, DALL-E will go off the rails.
Frankly, I think the "AI will replace jobs that require X" angle of automation is borderline apocalyptic conspiracy porn. It's always phrased as if the automation simply stops at making certain jobs redundant. It's never phrased as if the automation lowers the bar to entry from X to Y for /everyone/, which floods the market with crap and makes people crave the good stuff made by the top 20%. Why isn't it considered as likely that this kind of technology will simply make the best 20% of creators exponentially more creatively prolific in quantity and quality?
I won't say many of those things are creativity driven. There are more like auto assets generation.
One use case of such model would be in gaming industry, to generate large amount of assets quickly. This process along takes years, and more and more expensive as gamers are demanding higher and higher resolution.
AI can make this process much more tenable, bring down the overall cost.
You are probably right. Still, there is hope that this just a prelude to getting closer to a Transmetropolitan box ( assuming we can ever figure out how to make AI box that can make physical items based purely on information given by the user ).
Now we just have to wait for huggingface to create an open source implementation. So much for openness I guess if you go on Microsoft azure you can use closed ai.
Does anyone have any insight on how much it would cost for OpenAI to host an online, interactive demo of a model like this? I'd expect a lot - even just for inference - based on the size of the model and the expected virality of the demo, but have no reference points for quantifying.
[+] [-] desideratum|5 years ago|reply
[+] [-] nojvek|5 years ago|reply
I absolutely believe we'll crack the fundamental principles of intelligence in our lifetimes. We now have capability to process all public data available on internet (of wikipedia is a huge chunk). We have so many cameras and microphones (one in each pocket).
It's also scary to think if it goes wrong (the great filter for fermi paradox). However I'm optimistic.
The brain only uses 20 watts of power to do all its magic. The entire human body is built from 700MB of data in the DNA. The fundamental principles of intelligence is within reach if we look from that perspective.
Right now GPT3 and DALL-E seem to be using an insane amount of computation to achieve what they are doing. My prediction is that in 2050, we'll have pretty good intelligence in our phones that has deep understanding (language and visual) of the world around us.
[+] [-] commonturtle|5 years ago|reply
I really find it hard to understand why people are optimistic about the impact AI will have on our future.
The pace of improvement in AI has been really fast over the last two decades, and I don't feel like it's a good thing. Compare the best text generator models from 10 years ago with GPT-3. Now do the same for image generators. Now project these improvements 20 years into the future. The amount of investment this work is getting grows with every such breakthrough. It seems likely to me we will figure out general-purpose human-level AI in a few decades.
And what then? There are so many ways this could turn into a dystopian future.
Imagine for example huge mostly-ML operated drone armies, tens of millions strong, that only need a small number of humans to supervise them. Terrified yet? What happens to democracy when power doesn't need to flow through a large number of people? When a dozen people and a few million armed drones can oppress a hundred million people?
If there's even a 5% chance of such an outcome (personally I think it's higher), then we should be taking it seriously.
[+] [-] lacker|5 years ago|reply
[+] [-] minimaxir|5 years ago|reply
Broke: Use a text encoder to feed text data to an image generator, like a GAN.
Woke: Use a text and image encoder as the same input to decode text and images as the same output
And yet, due to the magic of Transformers, it works.
From the technical description, this seems feasible to clone given a sufficiently robust dataset of images, although the scope of the demo output implies a much more robust dataset than the ones Microsoft has offered publicly.
[+] [-] ArtWomb|5 years ago|reply
>>> DALL·E appears to relate the shape of a half avocado to the back of the chair, and the pit of the avocado to the cushion.
That could be human bias recognizing features the generator yields implicitly. Most of the images appear as "masking" or "decal" operations. Rather than a full style transfer. In other words the expected outcome of "soap dispenser in the shape of hibiscus" would resemble a true hybridized design. Like an haute couture bottle of eau du toilette made to resemble rose petals.
The name DALL-E is terrific though!
[+] [-] dj_mc_merlin|5 years ago|reply
Another good example is the "collection of glasses" on the table. It makes both glassware and eyeglasses!
[+] [-] dane-pgp|5 years ago|reply
With the ability to construct complex 3D scenes, surely the next step would be for it to ingest YouTube videos or TV/movies and be able to render entire scenes based on a written narration and dialogue.
The results would likely be uncanny or absurd without careful human editorial control, but it could lead to some interesting short films, or fan-recreations of existing films.
[+] [-] reubens|5 years ago|reply
You try drawing a snail made of harp! Seriously! DALL-E did an incredible job
[0] https://www.technologyreview.com/2021/01/05/1015754/avocado-...
[+] [-] inferense|5 years ago|reply
[+] [-] dj_mc_merlin|5 years ago|reply
I can't believe it. How does it put the baby daikon radish in the tutu?
[+] [-] hnthrowopen|5 years ago|reply
[+] [-] wccrawford|5 years ago|reply
https://github.com/openai/
[+] [-] jokethrowaway|5 years ago|reply
[+] [-] jonesn11|5 years ago|reply
[+] [-] jcims|5 years ago|reply
[+] [-] SoSoRoCoCo|5 years ago|reply
[+] [-] MrBuddyCasino|5 years ago|reply
[+] [-] irrational|5 years ago|reply
Donald Trump is Nancy Pelosi's and AOC's step-brother in a three-way in the Lincoln Bedroom.
[+] [-] dfischer|5 years ago|reply
[+] [-] ignoranceprior|5 years ago|reply
If the AI can actually draw an image of a green block on a red block, and vice versa, then it clearly understands something about the concepts "red", "green", "block", and "on".
[+] [-] dcolkitt|5 years ago|reply
A human can learn basic arithmetic, then generalize those principles to bigger number arithmetic, then go from there to algebra, then calculus, then so. Successively building on previously learned concepts in a fully recursive manner. Transformers are limited by the exponential size of their network. So GPT-3 does very well with 2-digit addition and okay with 2-digit multiplication, but can't abstract to 6-digit arithmetic.
DALL-E is an incredible achievement, but doesn't really do anything to change this fact. GPT-3 can have an excellent understanding of a finite sized concept space, yet it's still architecturally limited at building recursive abstractions. So maybe it can understand "green block on a red block". But try to give it something like "a 32x16 checkerboard of green and red blocks surrounded by a gold border frame studded with blue triangles". I guarantee the architecture can't get that exactly correct.
The point is that, in some sense, GPT-3 is a technical dead-end. We've had to exponentially scale up the size of the network (12B parameters) to make the same complexity gains that humans make with linear training. The fact that we've managed to push it this far is an incredible technical achievement, but it's pretty clear that we're still missing something fundamental.
[0] https://arxiv.org/pdf/1906.06755.pdf
[+] [-] tralarpa|5 years ago|reply
[+] [-] karmasimida|5 years ago|reply
But 'understanding' itself needs to be further specified, in order to be tested even.
What strikes me most is the fidelity of those generated images, matching the SOTA from GAN literature with much more variety, without using the GAN objective.
It seems Transformer model might be the best neural construct we have right now, to learn any distribution, assuming more than enough data.
[+] [-] TigeriusKirk|5 years ago|reply
https://twitter.com/peabody124/status/1346565268538089483
[+] [-] tikwidd|5 years ago|reply
According to your definition of understanding, this program understands something about the concept RED. But the code is just dealing with arbitrary values in memory (e.g. RED = 0xFF0000)
[+] [-] CyberRabbi|5 years ago|reply
Prompt: a Windows GUI executable that implements a scientific calculator.
[+] [-] rkagerer|5 years ago|reply
[+] [-] sircastor|5 years ago|reply
[+] [-] worldsayshi|5 years ago|reply
[+] [-] Nition|5 years ago|reply
[+] [-] ravi-delia|5 years ago|reply
[+] [-] _fx6v|5 years ago|reply
[+] [-] infinityobject|5 years ago|reply
[+] [-] jamcohen|5 years ago|reply
[+] [-] ve55|5 years ago|reply
At this point I have replaced a significant amount of creative workers with AI for personal usage, for example:
- I use desktop backgrounds generated by VAEs (VD-VAE)
- I use avatars generated by GANs (StyleGAN, BigGAN)
- I use and have fun with written content generated by transformers (GPT3)
- I listen to and enjoy music and audio generated by autoencoders (Jukebox, Magenta project, many others)
- I don't purchase stock images or commission artists for many previous things I would have when a GAN exists that already makes the class of image I want
All of this has happened in that last year or so for me, and I expect that within a few more years this will be the case for vastly more people and in a growing number of domains.
[+] [-] sushisource|5 years ago|reply
> - I listen to and enjoy music and audio generated by autoencoders (Jukebox, Magenta project, many others)
Really, you've "replaced" normal music and books with these? Somehow I doubt that.
[+] [-] rich_sasha|5 years ago|reply
It’s like efficient market hypothesis: markets are efficient because arbitrage, which is highly profitable, makes them so. But if they are efficient, how can arbitrageurs afford to stay in business? In practice, we are stuck in a half-way house, where markets are very, but not perfectly, efficient.
I guess in practice, the pie for humans will keep on shrinking, but won’t disappear too soon. Same as horse maintenance industry, farming and manufacturing, domestic work etc. Humans are still needed there, just a lot less of them.
[+] [-] Impossible|5 years ago|reply
[+] [-] minimaxir|5 years ago|reply
The demo inputs here for DALL-E are curated and utilize a few GPT-3 prompt engineering tricks. I suspect that for typical unoptimized human requests, DALL-E will go off the rails.
[+] [-] yowlingcat|5 years ago|reply
Frankly, I think the "AI will replace jobs that require X" angle of automation is borderline apocalyptic conspiracy porn. It's always phrased as if the automation simply stops at making certain jobs redundant. It's never phrased as if the automation lowers the bar to entry from X to Y for /everyone/, which floods the market with crap and makes people crave the good stuff made by the top 20%. Why isn't it considered as likely that this kind of technology will simply make the best 20% of creators exponentially more creatively prolific in quantity and quality?
[+] [-] ErikAugust|5 years ago|reply
Couldn't any creator of images that a model was trained on sue for copyright infringement?
Or do great artists really just steal (just at a massive scale)?
[+] [-] ryan93|5 years ago|reply
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] karmasimida|5 years ago|reply
I won't say many of those things are creativity driven. There are more like auto assets generation.
One use case of such model would be in gaming industry, to generate large amount of assets quickly. This process along takes years, and more and more expensive as gamers are demanding higher and higher resolution.
AI can make this process much more tenable, bring down the overall cost.
[+] [-] A4ET8a8uTh0|5 years ago|reply
[+] [-] ignoranceprior|5 years ago|reply
[+] [-] RealSpaceMonkey|5 years ago|reply
Do you have a GPT-3 key?
[+] [-] brian_herman|5 years ago|reply
[+] [-] fumblebee|5 years ago|reply