Is DALL-E 2 ‘gluing things together’ without understanding their relationships?

[+] vannevar|3 years ago|reply

I'm sure it is, but "gluing things together" coherently in response to a text prompt is a stupendous achievement. It's not AGI, but it's miles ahead of where we were even a few years ago and opens the door to automating a class of jobs I don't think anyone back then believed could be automated, short of AGI.

[+] _nhynes|3 years ago|reply

I ended up reading the book Blindsight (Peter Watts) that's been floating around in comments recently. A major theme in the book is intelligence and its relation to consciousness (including whether consciousness is even beneficial). If you agree with the idea, you'd consider that DALL-E is indeed intelligent even though it appears to be a "Chinese Room". Humans would be "gluing things together" in just the same way, but with this odd introspective ability that makes it seem different.

[+] orlp|3 years ago|reply

The most important thing I think DALL-E shows is that it has a model of our world and culture. It's not intelligence, but it is knowledge.

Google can give you endless pictures of giraffes if you search for it. But it can only connect you to what exists. It doesn't know things, it knows OF things.

DALL-E has knowledge of the concept of a giraffe, and can synthesize an endless amount of never-before seen giraffes for you. It actually knows what a giraffe is.

[+] IshKebab|3 years ago|reply

I agree. This paragraph is baffling:

> DALL-E's difficulty in juxtaposing wildly contrastive image elements suggests that the public is currently so dazzled by the system’s photorealistic and broadly interpretive capabilities as to not have developed a critical eye for cases where the system has effectively just ‘glued’ one element starkly onto another, as in these examples from the official DALL-E 2 site:

Yes the public is so dazzled by this massive leap in capability that it hasn't developed a critical eye for minor flaws.

Yeah we get it. It's not instantly perfect. But the fact that people aren't moaning that it can't put a tea cup in a cylinder isn't because everyone stupidly thinks it is perfect, it's because not everyone is a miserable naysayer.

[+] jcelerier|3 years ago|reply

I don't understand how my brain isn't just gluing things together either. I don't personally feel like I'm actually experiencing the understanding of anything

[+] joe_the_user|3 years ago|reply

Yes, DALL-E is very impressive to see and can have a number of actual practical uses.

But fear of AGI is huge currently. The more impressive non-AGI things we see, the more worried people naturally become that we're reaching the "dawn" of AGI with all the disturbing implications that this might have. (A lot of people are afraid an AGI might escape the control of its creator and destroy humanity. I think that's less likely but I think AGI under control of it's creator could destroy or devastate humanity so I'd agree AGI is a worry).

That DALL-E doesn't understand object-relationships should be obvious to people who know this technology but a lot of people seem to need it spelled-out. And they probably need it spelled why this implies it's not AGI. But that would be several more paragraphs for me.

[+] seanmcdirmid|3 years ago|reply

Just think what this could do for a game experience like Scribblenauts. Just being able to glue a fixed number of concepts in a huge number of ways...game designers are going to have to learn how to leverage ML.

[+] bakuninsbart|3 years ago|reply

I know very little about this topic, but one thing that strikes me about the argument of AI being far away from real intelligence due to just "gluing things together" is that it is non-obvious to me how we as intelligent creatures aren't just extremely sophisticated gluing machines.

[+] monkeynotes|3 years ago|reply

Their research showed that Dall-E had most success with real world stuff it had been trained on. Is this surprising? I mean, if I didn't know much about iguanas I'd also have a hard time representing them.

[+] jeremyjh|3 years ago|reply

The whole point is it isn’t doing so coherently if similar images were absent from the training data. The monkey wasn’t touching the iguana.

[+] danielmorozoff|3 years ago|reply

Where do you see DALLE automating away jobs?

[+] IIAOPSW|3 years ago|reply

I have a phrase I'd like to coin in contrast to AI. "Artificial Bullshit". AB.

I of course mean "bullshit" in the highly technical sense defined by Frankfurt [1]. The defining feature that separates a bullshitter from a liar is that a liar knows and understands the truth and intentionally misrepresents the matters of fact to further their aims, whereas a bullshitter is wholly unconcerned with the truth of the matters they are discussing and is only interested in the social game aspect of the conversation. Bullshit is far more insidious than a lie, for bullshit can (and often does) turn out to be coincident with the truth. When that happens the bullshitter goes undetected and is free to infect our understanding with more bullshit made up on the spot.

DallE generates the images it thinks you want to see. It is wholly unconcerned with the actual objects rendered that are the ostensible focus of the prompt. In other words, its bullshitting you. It was only trained on how to get your approval, not to understand the mechanics of the world it is drawing. In other words, we've trained a machine to have daddy issues.

A profoundly interesting question (to me) is if there's a way to rig a system of "social game reasoning" into ordinary logical reasoning. Can we construct a Turing Tarpit out of a reasoning system with no true/false semantics, a system only designed to model people liking/disliking what you say? If the answer is yes, then maybe a system like Dalle will unexpectedly gain real understanding of what it is drawing. If not, systems like Dalle will always be Artificial Bullshit.

[1] http://www2.csudh.edu/ccauthen/576f12/frankfurt__harry_-_on_...

[+] ___rubidium___|3 years ago|reply

I think you're right, but I would qualify that the AI is bullshitting in the same way that a child's drawing of a stick figure, house, and smiling sun is bullshit designed to get approval. The AI is giving symbols--very visually stunning ones, to be sure, but symbols nonetheless--of what it is prompted to create, just like a child learns that "circle with lines coming out of it" is a symbol that can be read as "sun" and praised by adults.

[+] thaw13579|3 years ago|reply

To me, Dall-E seems analogous to a film production team that produces visual imagery reflecting a script written by a screenwriter. By the above reasoning, would that team would be producing "bullshit"? I think most people would think not, because the goal isn't to communicate objective truth about the world, rather something plausible, interesting, entertaining, etc. (unless it is a documentary).

I also think distinguishing bullshit from lying depends heavily on internal mental thoughts, goals, and intentions. Isn't talking about Dall-E this way personification and ascribing some level of consciousness?

[+] leereeves|3 years ago|reply

> It is wholly unconcerned with the actual objects rendered that are the ostensible focus of the prompt.

I disagree. To the extent that the training data are images of actual objects, recreating images of actual objects is the only thing DALL-E cares about.

If we define "caring" about something as changing behavior to cause that to happen, then a neural network doesn't "care" about inference at all, because inference never changes the network's behavior.

It also doesn't know or care about your approval. It only cares about minimizing the loss function.

(But now that you bring this up, I think it would be really interesting to create a network that, after training initially on training data, began interacting with people and continued training to maximize approval.)

[+] dougmwne|3 years ago|reply

I reached essentially the same conclusion after playing with GTP-3 for a while. It spins out mountains of convincing and impressive bullshit, but you can’t actually trust anything it says because it is disconnected from right, wrong, correct and incorrect. Reading too much of what it outputs is dangerous because it basically is feeding white noise into your perception and experience of the world.

[+] acoard|3 years ago|reply

I'm a big fan of Frankfurt's "On Bullshit", and love the reference.

I think there's one significant distinction between a normal human bullshitter that Frankfurt originally envisioned, and the AI practicing Artificial Bullshit. The bullshitter knows there is truth and intentionally disregards it; whereas the AI is blind to the concept. I guess this is "mens rea" in a sense, the human is conscious of their guilt (even if they're apathetic towards it), whereas DALL-E is just a tool that does what it is programmed to do.

I do like this application of "bullshit" though, and will keep it in mind going forward.

[+] thfuran|3 years ago|reply

>Bullshit is far more insidious than a lie, for bullshit can (and often does) turn out to be coincident with the truth. When that happens the bullshitter goes undetected and is free to infect our understanding with more bullshit made up on the spot.

If the bullshit is turning out to be true, what's the issue with more of it? If it's not true but still believed and so causing problems, what's the practical difference between it and an undetected lie that makes it more insidious?

[+] adamsmith143|3 years ago|reply

>In other words, its bullshitting you. It was only trained on how to get your approval, not to understand the mechanics of the world it is drawing.

A system can learn to do all kinds of interesting things by trying to optimize getting rewards.

See: https://www.deepmind.com/publications/reward-is-enough

[+] visarga|3 years ago|reply

What you call bullshit I call imagination. Both humans and AI need it. Humans use imagination to plan ahead. AlphaGo was generating moves to plan ahead.

Dall-E and GPT-3 are not being used as agents, they are just tool AIs. They have a narrow task - generating images and text. Agents on the other hand need to learn how to act in the environment, while learning to understand the world at the same time.

[+] TremendousJudge|3 years ago|reply

As far as I know, the human brain is just a "social game reasoning" optimizer, that we try (and fail) to use to do actual logical reasoning. The zillion cognitive biases we have are the clue: we don't do logic, we have biases and sometimes stumble upon logic.

[+] gfodor|3 years ago|reply

DALL-E either doesn't generate images you want to see, or if it does, it does a bad job, because it generates many images you don't want to see.

In other words, the claim you've set up is basically unfalsifiable, given that thre's no way to form strong counterevidence from its outputs. (I would argue that if there was, we'd already have it in the vast majority of outputs that aren't images people want.)

If I were to refine what you're saying, is that DALL-E is constrained to generating images that make sense to the human visual system in a coherent way. This constraint is a far cry from what you need to be able to lift it up to claim it is "bullshitting" though, since this constraint is at a very low level in terms of constraining outputs.

[+] unknown|3 years ago|reply

[deleted]

[+] tomc1985|3 years ago|reply

Stop anthropomorphisizing computers! They hate that!

[+] garyrob|3 years ago|reply

I agree, but I disagree about one aspect. For the most part, humans don't use reason all that much or all that deeply. We usually use intuitive thinking, and there is research showing that immediate intuitive responses are often better than the result of long thinking. More negatively, is QAnon belief, or even Trump election claim belief, about reason? Or is it about associations between words and concepts, especially when those concepts are believed in by the people the believer tends to trust and associate with?

In other words, the takeaway here may not be that GPT-3 spews bullshit. It may be that most of the time, human "thinking" is a less-nuanced, biological version of GPT-3.

[+] burlesona|3 years ago|reply

Slight side tangent but reading this article it hit me how much this generation of work may be reinforcing English as the global language for generations to come. It seems like we are headed towards a phase of technology where learning how to feed well-crafted prompts into the AI system will be a highly valuable productivity skill. And since the major AI systems seem to be built around English, that would make English fluency even more valuable than it already is. I’m sure that’s obvious to non-native speakers who have worked hard to master English, I just hadn’t thought of it before.

Less likely but still interesting, I wonder if the way we’re building these models will at some point begin to layer on top of each other such that English as it is used now becomes something deeply embedded in AI, and whether that will evolve with the spoken language or not. It’s funny to imagine a future where people would need to master an archaic flavor of English to get the best results working with their AI helpers.

[+] mtlmtlmtlmtl|3 years ago|reply

Also worth noting that the internet has massively accelerated the importance of English already.

As an ESL speaker who grew up on the internet, Norwegian was more or less useless to me outside school and family. Most of my time was spent on the internet, reading and writing lots of English. Norwegian wikipedia is pretty much useless unless you don't know English. That's still true today for the vast majority of articles, but back then was universally the case.

There were Norwegian forums, but with a population of just 4 million and change at the time, they were never as interesting or active as international/American forums and IRC channels.

In fact I'd say Norwegian is only my native language in spoken form, whereas English feels more natural to me to write and read. Doesn't help that Norwegian has two very divergent written forms, either.

I even write my private notes in English, even though I will be the only one reading them.

[+] Ajedi32|3 years ago|reply

Perhaps, but another possibility is that the more advanced models all end up being polyglots. The state of the art in machine translation already uses a single model trained on multiple languages[1], which results in better translations between languages it doesn't have a lot of examples for. If the same principle applies to other types of models, then training them on every possible dataset available regardless of language might yield better results. That could result in models that are fluent in hundreds of languages. (I'd be curious as to whether DALL-E understands prompts in languages other than English, has anyone tried?)

[1]: https://ai.googleblog.com/2019/10/exploring-massively-multil...

[+] affgrff2|3 years ago|reply

But the same work is also removing the language barriers at the same time with really good translation tools. I rather guess being fluent in English will be not as important as it is now.

Edit: the same work = transformer based language models

[+] Workaccount2|3 years ago|reply

I envision the skill of giving good AI prompts to be as short lived as the skill of T9 texting. Probably even shorter.

[+] avalys|3 years ago|reply

I think DALL-E is clearly just gluing things together that it found in a massive dataset, and doesn't have any understanding of the underlying concepts. I thought it was easy to see the signs of this in examining its output. Same for GPT-3.

However, what's amazing about DALL-E and these other statistical, generative models to me is that it's made me think about how much of my daily thought processes are actually just gluing things together from some kind of fuzzy statistical model in my head.

When I see an acquaintance on the street, I don't carefully consider and "think" about what to say to them. I just blurt something out from some database of stock greetings in my head - which are probably based on and weighted by how people have reacted in the past, what my own friends have used similar greetings, and what "cool" people say in TV and other media in similar circumstances. "Hey man how's it going?"

If I was asked to draw an airplane, I don't "think" about what an airplane looks like from first principles - I can just synthesize one in my head and start drawing. There are tons of daily activities like this that we do like this that don't involve anything I'd call "intelligent thought." I have several relatives that, in the realm of political thought, don't seem to have anything more in their head than a GPT-3 models trained on Fox News (that, just like GPT-3, can't detect any logical contradictions between sentences).

DALL-E has convinced me that even current deep learning models are probably very close to replicating the performance of a significant part of my brain. Not the most important part or the most "human" part, perhaps. But I don't see any major conceptual roadblock between that part and what we call conscious, intelligent thought. Just many more layers of connectivity, abstraction, and training data.

Before DALL-E I didn't believe that simply throwing more compute at the AGI problem would one day solve it. Now I do.

[+] nodja|3 years ago|reply

I've been following the image generation field for a couple months now and while the answer to the title is "yes for most things" it is easily fixed. Use a better text encoder.

My favorite anecdote for showing how having a text encoder that actually understands the world is important to image generation is when querying for "barack obama" on a model trained on a dataset that has never seen Barack Obama the model somehow generates images of random black men wearing suits[1]. This is, in my non-expert opinion, a clear indication that the model's knowledge of the world is leaking through to the image generator. So if my understanding is right, as long as a concept can be represented properly in the text embeddings of a model, the image generation will be able to use that.

If my anecdote doesn't convince you, consider that one of google's findings on the Imagen paper was that increasing the size of the text encoder had a much bigger effect on not only the quality of the image, but also have the image follow the prompt correctly, including having the image generator being able to spell words.

I think the next big step in the text to image generation field, aside from the current efforts to optimize the diffusion models, will be to train an efficient text encoder that can generate high quality embeddings.

[1] Results of querying "barack obama" to an early version of cene555's imagen reproduction effort. https://i.imgur.com/oUo3QdF.png

[+] snek_case|3 years ago|reply

> when querying for "barack obama" on a model trained on a dataset that has never seen Barack Obama the model somehow generates images of random black men wearing suits[1]. This is, in my non-expert opinion, a clear indication that the model's knowledge of the world is leaking through to the image generator.

That's super interesting. It's not just black men in suits either. It's older black men, with the American flag in the background who look like they might be speaking. Clearly the model has a pretty in-depth knowledge of the context surrounding Barack Obama.

I would say the image generation model is also doing a pretty great job at stitching those concepts together in a way that's coherent. It's not a random jumble. It's kind of what you would expect if you asked a human artist to draw a black American president.

[+] andybak|3 years ago|reply

I'm a big Dall-E fan but this is no surprise to anyone who's used it for more than 5 minutes.

It was one of the things highlighted by Google when they announced Imagen as a differentiator: https://imagen.research.google

The article touches on this but the headline is slightly deceptive.

[+] aeturnum|3 years ago|reply

The Fair Witness was a job that Heinlein made up for Stranger in a Strange Land. Fair Witnesses were supposed to reliably report what they saw without judgement - including their subjective judgement in their report. The example exchange is: "Is that house over there brown?" "It is brown on this side."

Dall-E (and other ML systems) feel like fair witnesses for our cultural milieu. They basically find a series of weighted connections between every phrase we've thought to write down or say about all images and can blend between those weights on the fly. By any assessment it's an amazing feat - as is the feat to view their own work and modify it (though ofc it's from their coordinate system so one does expect it would work).

In one sense - asking if the machine "understands" is beside the point. It does not need to 'understand' to be impressive (or even what people claim when they're not talking to Vice media or something).

In another sense, even among humans, "understanding" is both a contested term and a height that we all agree we don't all reach all of the time. One can use ideas very successfully for many things without "understanding" them.

Sometimes people will, like, turn this around and claim that: because humans don't always understand ideas when they use them, we should say that ML algorithms are doing a kind of understanding. I don't buy it - the map is not the territory. How ML algorithms interact with semantics is wholly unlike how humans interact with them (even though the learning patterns show a lot of structural similarity). Maybe we are glimpsing a whole new kind of intelligence that humans cannot approach - an element of Turing Machine Sentience - but it seems clear to me that "understanding" in the Human Sentience way (whatever that means) is not part of it.

[+] IceMetalPunk|3 years ago|reply

Now, let's be critical of possible reasons for this. It's important to remember two things: 1) Any NN has zero experience with the world beyond its training data. Things that seem obvious to us from our experience are not obvious to a system that has never experienced those things. And 2) DALL-E 2 was trained on image-caption pairs scraped from the internet, basically.

So it's quite possible the reason it doesn't understand things like "X under Y" very well is that its training set doesn't have a lot of captions describing positional information like that, as opposed to any failure in the architecture to even potentially understand these things.

[+] Marazan|3 years ago|reply

https://imgur.com/ggnm920

"a photo of 6 kittens sitting on a wooden floor. Each kitten is clearly visible. No weird stuff."

Like, lets start with the fact that there are 7 of them (2 of the 4 images from the prompt had 7 kittens). Now lets continue on with how awful they look.

It is startling the difference in image quality between DAlle-2 asked for a single subject vs DAlle-2 being asked for a group of stuff.

And its obvious, if you know how the tech works, why this is the case.

[+] pera|3 years ago|reply

Interesting, DALL-E Mini / Craiyon actually generates a pretty accurate result for "a monkey touching an iguana" (at least semantically):

https://i.imgur.com/Oq62gQI.png

[+] arey_abhishek|3 years ago|reply

AI will never understand the actual context because not everything we feel/experience can be captured and communicated to a machine. For example, human language is incomplete and doesn’t encode every information because it doesn’t need to when used with other humans.

I think it’s a romantic notion to imagine that AI will not be a Chinese room.

Even human intelligence feels like a Chinese room. Especially noticeable when using complicated devices like flight controls. I’ve been playing the MSFT Flight simulator, and I don’t fully understand the relationship between the different instruments. But I can still fly planes(virtually).

We’d be better off if we considered AI similar to an appliance like a microwave or a refrigerator. Does a fridge need to understand or taste what’s inside it to be helpful?

[+] boredumb|3 years ago|reply

GPT/DALL-E/etc... All of these models are of course gluing things together in some manner, but who cares? That's the point right? The AI pill i've taken is that you don't need AGI in order to make things that are useful for people and business. If you've ever ran a business and had to dive into creatives for blogs, SEO content, social media posts, etc then you spent an inordinate amount of time creating it or outsourced it and in both cases the final copy is NOT going to get you a literary prize any time soon but it is absolutely enough to inform potential customers, start ranking on google and start gaining social media clout. GPT will also not garner you a literary award but can absolutely get you quality copy that users, customers, google, facebook users will not be the wiser that you generated it with AI instead of paid a third party to hack it together for you.

(https://neuralmates.com/ I recently started putting together a web app to MVP this, and I hope to be able to integrate DALLE-2 soon to be able to start generating images for folks as well.)

[+] BiteCode_dev|3 years ago|reply

Amusingly, I get creative new ideas when I glue things together and ignore what I thought I understood about their relationships.

[+] TremendousJudge|3 years ago|reply

Happens to me too -- it's a great way to make new things. However, the "creation" I'd argue happens when you look at the pile of random stuff and generate a new understanding, and decide that it is valuable. The difference between trash and art only exists in the head of the artist. Same thing happens with DALL-E output, really.

[+] otikik|3 years ago|reply

How does one define “understand their relationships”?

To me it is a matter of degrees and has multiple axes.

When my 6yo son draws a chair, it’s not the same as when Van Gogh draws one, which is different to when an expert furniture designer draws one. They all “understand” in different ways. A machine can also “understand”. It might do it in different degrees and across different axes that the ones humans usually have, that’s all. How we transform that understanding into action is what is important I think.

[+] QuadmasterXLII|3 years ago|reply

My intuition is that DALL-E is more a demonstration of how hard image synthesis is for humans, than how intelligent the algorithm is. The image generation models have orders of magnitude fewer parameters than the large language models.

[+] upupandup|3 years ago|reply

Blake Lemon claimed that Google's chatbot was sentient which I disagreed with and its demonstrated through this article. AI can be optimized to respond in a way that can easily fool someone into thinking they are talking to a human but at the end of the day sentiency requires consciousness and that is not something that can be digitally produced.

You can teach a parrot to respond to basic arithmetic but they are not aware of the concept of math rather they are acting in pathways set to induce the desired response.

A truly conscious entity would simply have a mind of its own and will not do our bidding just like any other humans. They would be extremely selfish and apathetic, the idea that bunch of GPUs sitting in a datacenter is sentient is sheer lunacy.

This Blake Lemon character will not be the last, there are always those that seek to be in the lime light with zero regards for authenticity. Such is sentient behavior.

[+] siglesias|3 years ago|reply

I discovered something like this recently when I tried the prompt "man throwing his smartphone into a river," and for the life of me I could not get DALL-E to render the phone separated from the hand (I tried "like a boomerang," "tossing," "into an ocean," "like a baseball," etc). And then it occurred to me that by the training data, there are virtually no pictures of a person and a phone where the phone is separated! So DALL-E might have thought that the phone was just an appendage to the body, the way the hand is (which, what does this say about society!). I might as well have asked DALL-E to render someone throwing their elbow into a river.

Another interesting case is animal-on-animal interactions. A prompt like, "small french bulldog confronts a deer in the woods" often yields weird things like the bulldog donning antlers! As far as the algorithm is concerned, it sees a bulldog, ticking the box for it, and it sees the antlers, ticking the box for "deer." The semantics don't seem to be fully formed.

[+] gwern|3 years ago|reply

I dunno man, I punched that exact prompt ("man throwing his smartphone into a river") in DALL-E 2 just now, and in 2/4 samples, the smartphone is clearly separate from the hand: labs.openai.com/s/uIldzs2efWWnm3i9XjsHI7or labs.openai.com/s/jSk4qhAxSiL7QJo7zeGp6m9f

> The semantics don't seem to be fully formed.

Yes, not so much 'formed' as 'formed and then scrambled'. This is due to unCLIP, as clearly documented in the DALL-E 2 paper, and even clearer when you contrast to the GLIDE paper (which DALL-E 2 is based on) or Imagen or Parti. Injecting the contrastive embedding to override a regular embedding tradesoff visual creativity/diversity for the semantics, so if you insist on exact semantics, DALL-E 2 samples are only a lower bound on what the model can do. It does a reasonable job, better than many systems up until like last year, but not as good as it could if you weren't forced to use unCLIP. You're only seeing what it can do after being scrambled through unCLIP. (This is why Imagen or Parti can accurately pull off what feels like absurdly complex descriptions - seriously, look at the examples in their papers! - but people also tend to describe them as 'bland'.)

[+] adamsmith143|3 years ago|reply

Why is there this obsession with systems or algorithms having "understanding"? No one thinks these things have internal states equivalent to "understanding". "Understanding" or not you can't deny the capability of these systems.

[+] knighthack|3 years ago|reply

I don't see how the 'understanding of relationships' should be taken as the key intent of DALL-E 2.

Consider procedural generation: it can create abstractions of both utter beauty or garbage without understanding context. You need to guide it towards something meaningful.

Just the fact that DALL-E can 'glue things together' without need for human inspiration - yet where its output and intent can be understood by a human appraising it - that is not only a feat in itself, but I would say its key feature.

[+] rocgf|3 years ago|reply

Does anyone actually believe DALL-E "understands" what it's doing? For any reasonable definition of "understands", I assume most people would be skeptical.

So if we go with that, then yes, it just glues things together without understanding their relationship. I'd just be tempted to say it doesn't really matter that it doesn't understand, except maybe for some philosophical questions. It's still incredible based on its output.

354 comments