I don't know if I would call it a "language" (which to me implies grammar and some level of composability), but I suspect most people have these hidden subconscious languages too.
Think about the "bouba"/"kiki" effect [1], or how the spells in Harry Potter sound recognizably "spell-like" even though spells aren't a real thing. (In the latter case, it's because they're phonetically Latin-adjacent.)
For example, here are some nonsense terms:
* Swith'eil Aerveid
* Karflon B43
* Hooboo skramp
If I were to ask you all which is a dangerous gas, a sex act, and an Elven city, I suspect there would be fairly wide agreement. Inventing words that are evocative this way is a fundamental part of fiction writing, especially in fantasy, horror, and sci-fi. There's nothing profound going on here, it's just that language recognition is fuzzy and highly associative even at the level of individual phonemes.
Immediately after realizing which term is which, I can't help but try to use them all in a sentence together, like "Those damn kids in Swith'eil Aerveid just sit around all day hooboo skramping and smoking their Karflon B43", which I think makes it even more clear how different the words "feel" from each other.
Swith'eil Aerveid, which I took to be the Elven city, immediately made me think of Morrowind. I think it's a perfectly fitting name for a Dunmer or Dunmer city. Maybe a dwarven ruin.
Also, there is evidence that there might be some hidden meta-language on humans as well from studies on how bilingual people/polyglots speak, code-switch, translate, etc.
It seems obvious to me that there exists something like this shared across all homo sapiens. What we see now as complex language allowing for the broadcast of abstract thoughts and memes, as well as human culture, must both have evolved alongside each other a period of time that would appear long to us, but is meaningless in terms of genetic drift and "meat evolution". But in order for that process to begin, there must have been an extremely basic structure on which to begin it, which by definition must be in our DNA.
Everybody even has a non-verbal language they mostly are unaware of. It uses feelings of specific meanings, not words. You'll see the next time you know exactly what you wan to say but can't recall a proper word in any of the languages you speak.
Yeah, that reminded me of Rorschach test and other projective tests. That thread was an example that shows that people benefit from general education and erudition… and that people need a bit more humility to not put forward grandiose claims.
The tweet is in response to a preliminary paper [1] [2] studying text found in images generated by, e.g., "Two whales talking about food, with
subtitles." DALL-E doesn't generate meaningful text strings in the images, but if you feed the gibberish text it produces -- "Wa ch zod ahaakes rea." back into the system as a prompt, you would get semantically meaningful images, e.g., pictures of fish and shrimp.
I think the tweeter is being a bit too pedantic. Personally I spent some time thinking about embeddings, manifolds, the structure of language, scientific naming, and what the decoding of the points near the center of clusters in embedding spaces look like (archetypes), after seeing this paper. I think making networks and asking them to explain themselves using their own capabilities is a wonderful idea that will turn out to be a fruitful area of research in its own right.
Given that DALL-E is a giant matrix multiplication that correlates fuzzy concepts in text to fuzzy concepts in images, wouldn't one expect that there will be hotspots of nonsensical (to us) correlations, eg between "apoploe vesrreaitais" and "bird"? Intuitively feels like an aspect of the no free lunch theorem.
Exactly this. At a high level, DALL-E is mapping text to a (continuous) matrix and then mapping that matrix to an image (another a matrix). All text inputs will map to _something_. DALL-E doesn't care if that mapping makes sense, it has been trained to produce high-quality outputs, not to ensure the validity of mappings.
None of this makes DALL-E any less impressive to me. High quality image generation is a truly amazing result. Results from foundational models (GPT-3, PaLM, DALL-E, etc) are so impressive that they're forcing us to reconsider the nature of intelligence and raise the bar. That's a sign of a job well done to me.
It makes sense that it would have weird connections but the big claim here is that it outputs those connections as rendered text despite failing to output actual text is was trained on and prompted with. That sounds very unexpected to me and requiring a lot of evidence (that would be easy to cherry pick), though this debunking wasn’t convincing either.
Yeah. The problem here is that the network only has room for concepts, and hasn't been trained to see meaningless crap. Nor does it really have any way to respond with "This isn't a sentence I know", it just has to come up with an image that best matches whatever prompt it has been fed.
This just feels like one of these topics where you'd really want a liguist. Someone who really understands the construction and evolution of langauge to observe some of the underlying reasons for why language is constructed the way it is. Because I guess that's what DALL-E partly is, it's trying to approximate that, and the interesting thing would be where it differs from real language, rather than matches it. If I give it a made up word that looks like the latin phrase that looks like a species of bird, then it working like I've given it a latin phrase that is a species of bird is pretty reasonable. If you said to me "Homo heidelbergensis" I wouldn't know that was a species of pre-historic human, but I would feel pretty comfortable making that kind of leap.
I also think you could probably hire a team of linguists pretty cheap compared to a team of AI engineers.
I don't think that this related to language, at all. First, let's ask, is there a way for DALL-E to refuse an output (as in, this makes no sense). Then, what would we expect the output for gibberish to be like? Isn't this still subject to filtering for best "clarity" and best signals? While I don't think that these are collisions in the traditional sense of a hash collision, any input must produce a signal, as there is no null path, and what we see is sort of a result of "collisions" with "legitimate" paths. Still, this may tell us some about the inner structure.
Also, there is no way for vocabulary to exist on its own without grammar, as these are two sides of the phenomenon, we call language. Some signs of grammar had to emerge together with this, at once. However…
----
Edit: Let's imagine a typical movie scene. Our nondescript individual points at himself and utters "Atuk" (yes, Ringo Starr!) and then points at his counterpart in this conversation, who utters "Carl Benjamin von Richterslohe". This involves quite an elaborate system of grammar, where we already know that we're asking for a designator, that this is not the designator for the act of pointing, and that by decidedly pointing at a specific object, we'd ask for a specific designator not a general one. Then, C.B. von Richterslohe, our fearless explorer, waves his hand over the backdrop of the jungle, asking for "blittiri" in an attempt to verify that this means "bird", for which Atuk readily points out a monkey. – While only nouns have been exchanged, there's a ton of grammar in this.
And we haven't even arrived at things like, "a monkey sitting at the foot of a tree". Which is mostly about the horizontal and vertical axes of grammar, along which we align things and where we can substitute one thing for another in a specific position, which ultimately provides them with meaning (by what combinations and substitutions are legitimate ones and which are not).
Now, in light of this, that specific compounds are changing their alleged "meaning" radically, when aligned (composed), doesn't allow for high hopes for this to be language.
I believe the original claim has substance behind and it was a very interesting non-trivial observation.
Also, it proposes a very exciting idea for an emergent phenomena that, if understood, could have deep consequences to our understanding of knowledge, language and a lot of other related topics.
So his argument is that the text clearly maps to concepts in the latent space, but when composing them the results are unexpected, so it isn't language? Why isn't this better described as 'the rules of composition are unknown'?
These conversations so routinely devolve into crowdsourced attempts to define notoriously tricky words like “language”, and “intelligence”.
These absurdly big, semi-supervised transformers are predicting what the next pixel or word or Atari move is. They’re strikingly good at it. To accomplish this they build up a latent space where all the pictures of sunglasses and the word “shades” are cosine similar, and quite different to “dog” or a picture of a dog, and have an operator (in word2vec, addition, in DALL-E, something nonlinear) that can put sunglasses on a dog.
Is that latent space and all the embeddings into it a “language”? Who cares? It works and it’s fucking cool.
It acts like a reverse Rorschach test, where they give you a nonsensical picture and ask for a forced caption from the subject. If you set the task to generate something no matter what, you get something no matter what.
It is trivial to make it reject gibberish prompts. Just use a generative model to estimate the probability of the input, it's what language models do by definition.
Is this and a previous tweet a ML-guys discussion? My layman understanding of neural networks is that the core operation is you basically kick a figure down the hill and see where it ends up, but both the figure and the hill are N-dimensional objects, where N is too huge to comprehend. Of course some nonsensical figures end up at valid locations, but can you really expect some stable inner structure of the hill-figure interaction? I think it’s unlikely that there is a place in a learning method to produce one. NNs can give interesting results, but they don’t magically rewrite their own design yet.
Would still be interesting to see how the output changes with little changes to these inputs. If my vague understanding is at all close, this will reveal the “faces” that are more “noisy” than the others. Not sure what that gives though.
The tweet is wrong and this is important. There’s a difference between DALL-E and DALL-E 2. This phenomenon is intuitive if you know how diffusion models work (DALL-E 2)
1. Tokenized text embeddings can map to similar points in latent space. The text encoder is autoregressive so this won’t work for all random sequence of tokens but it can work for the right ones. I wonder if anyone has tried reverse decoding the embeddings of interest to see if they cluster around known words that are relevant.
2. Diffusion models are trained by pushing off manifold points onto the manifold so to speak. It is not surprising that off manifold points map onto known concepts during the reversal process.
Too pedantic - still obviously something interesting going on here and I don't find myself convinced otherwise just because the original claim isn't as clean as initially presented.
IMO these words were part of some training images (e.g. taken from nature atlases) and DALL-E learned to associate them with birds, although in gibberish form.
There's some form of language here... the correlations are evidence enough. The Grammar I believe is complex and likely not human grammar thus certain words when paired with other words can negate the meaning of the word all together or even completely change it.
For example "hedge" combined with "hog" is neither a "hedge" nor is it a "hog" nor is it some sort of horrific hybrid mixture of hedges and hogs. A hedgehog is tiny rodent. Most likely this is what's going on here.
The domain is almost infinite. And the range is even greater. Thus it's actually realistic to say that there are must be hundreds of input and output sets that form alternative languages.
That Twitter thread made me MORE of a believer that Dall-E has a language its own. As others said, seems like the argument is more about defining "language".
it's not about defining language, it's about refuting the original claim, which was that piping these symbols _back in_ could trigger similar semantic categories - which it clearly can't.
A bunch of people didn't read the original study and just saw the pictures and assumed the gibberish is the only result being discussed.
[+] [-] dang|3 years ago|reply
DALL-E 2 has a secret language - https://news.ycombinator.com/item?id=31573282 - May 2022 (109 comments)
[+] [-] munificent|3 years ago|reply
Think about the "bouba"/"kiki" effect [1], or how the spells in Harry Potter sound recognizably "spell-like" even though spells aren't a real thing. (In the latter case, it's because they're phonetically Latin-adjacent.)
For example, here are some nonsense terms:
* Swith'eil Aerveid
* Karflon B43
* Hooboo skramp
If I were to ask you all which is a dangerous gas, a sex act, and an Elven city, I suspect there would be fairly wide agreement. Inventing words that are evocative this way is a fundamental part of fiction writing, especially in fantasy, horror, and sci-fi. There's nothing profound going on here, it's just that language recognition is fuzzy and highly associative even at the level of individual phonemes.
[1]: https://en.wikipedia.org/wiki/Bouba/kiki_effect
[+] [-] saghm|3 years ago|reply
[+] [-] mod|3 years ago|reply
Swith'eil Aerveid, which I took to be the Elven city, immediately made me think of Morrowind. I think it's a perfectly fitting name for a Dunmer or Dunmer city. Maybe a dwarven ruin.
[+] [-] fgdelcueto|3 years ago|reply
[+] [-] Sophistifunk|3 years ago|reply
[+] [-] alx__|3 years ago|reply
[+] [-] beepbooptheory|3 years ago|reply
https://www.bartleby.com/284/1.html
[+] [-] qwerty456127|3 years ago|reply
[+] [-] LudwigNagasena|3 years ago|reply
[+] [-] oneoff786|3 years ago|reply
Abra kedabra
[+] [-] aliswe|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] aaron695|3 years ago|reply
[deleted]
[+] [-] jawarner|3 years ago|reply
[1] https://giannisdaras.github.io/publications/Discovering_the_...
[2] https://twitter.com/giannis_daras/status/1531693093040230402
[+] [-] dekhn|3 years ago|reply
[+] [-] numpad0|3 years ago|reply
“Watch those sea creatures.”?
[+] [-] belugacat|3 years ago|reply
[+] [-] axg11|3 years ago|reply
None of this makes DALL-E any less impressive to me. High quality image generation is a truly amazing result. Results from foundational models (GPT-3, PaLM, DALL-E, etc) are so impressive that they're forcing us to reconsider the nature of intelligence and raise the bar. That's a sign of a job well done to me.
[+] [-] tbalsam|3 years ago|reply
This is by far the worst.
[+] [-] tgb|3 years ago|reply
[+] [-] smeagull|3 years ago|reply
[+] [-] SilverBirch|3 years ago|reply
I also think you could probably hire a team of linguists pretty cheap compared to a team of AI engineers.
[+] [-] masswerk|3 years ago|reply
Also, there is no way for vocabulary to exist on its own without grammar, as these are two sides of the phenomenon, we call language. Some signs of grammar had to emerge together with this, at once. However…
----
Edit: Let's imagine a typical movie scene. Our nondescript individual points at himself and utters "Atuk" (yes, Ringo Starr!) and then points at his counterpart in this conversation, who utters "Carl Benjamin von Richterslohe". This involves quite an elaborate system of grammar, where we already know that we're asking for a designator, that this is not the designator for the act of pointing, and that by decidedly pointing at a specific object, we'd ask for a specific designator not a general one. Then, C.B. von Richterslohe, our fearless explorer, waves his hand over the backdrop of the jungle, asking for "blittiri" in an attempt to verify that this means "bird", for which Atuk readily points out a monkey. – While only nouns have been exchanged, there's a ton of grammar in this.
And we haven't even arrived at things like, "a monkey sitting at the foot of a tree". Which is mostly about the horizontal and vertical axes of grammar, along which we align things and where we can substitute one thing for another in a specific position, which ultimately provides them with meaning (by what combinations and substitutions are legitimate ones and which are not).
Now, in light of this, that specific compounds are changing their alleged "meaning" radically, when aligned (composed), doesn't allow for high hopes for this to be language.
[+] [-] moralestapia|3 years ago|reply
I believe the original claim has substance behind and it was a very interesting non-trivial observation.
Also, it proposes a very exciting idea for an emergent phenomena that, if understood, could have deep consequences to our understanding of knowledge, language and a lot of other related topics.
[+] [-] redredrobot|3 years ago|reply
[+] [-] rcoveson|3 years ago|reply
[+] [-] benreesman|3 years ago|reply
These absurdly big, semi-supervised transformers are predicting what the next pixel or word or Atari move is. They’re strikingly good at it. To accomplish this they build up a latent space where all the pictures of sunglasses and the word “shades” are cosine similar, and quite different to “dog” or a picture of a dog, and have an operator (in word2vec, addition, in DALL-E, something nonlinear) that can put sunglasses on a dog.
Is that latent space and all the embeddings into it a “language”? Who cares? It works and it’s fucking cool.
[+] [-] mjburgess|3 years ago|reply
I dont think the latter have much interesting to say about the former; but, having done no research, they think they do.
[+] [-] skybrian|3 years ago|reply
Also, I'm wondering if there is some way that these models could have a decent error response rather than responding to every input?
[+] [-] visarga|3 years ago|reply
It is trivial to make it reject gibberish prompts. Just use a generative model to estimate the probability of the input, it's what language models do by definition.
[+] [-] joshmarlow|3 years ago|reply
[+] [-] wruza|3 years ago|reply
Would still be interesting to see how the output changes with little changes to these inputs. If my vague understanding is at all close, this will reveal the “faces” that are more “noisy” than the others. Not sure what that gives though.
[+] [-] twayt|3 years ago|reply
1. Tokenized text embeddings can map to similar points in latent space. The text encoder is autoregressive so this won’t work for all random sequence of tokens but it can work for the right ones. I wonder if anyone has tried reverse decoding the embeddings of interest to see if they cluster around known words that are relevant.
2. Diffusion models are trained by pushing off manifold points onto the manifold so to speak. It is not surprising that off manifold points map onto known concepts during the reversal process.
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] mattwilsonn888|3 years ago|reply
[+] [-] severak_cz|3 years ago|reply
IMO these words were part of some training images (e.g. taken from nature atlases) and DALL-E learned to associate them with birds, although in gibberish form.
[+] [-] KaoruAoiShiho|3 years ago|reply
[+] [-] deltaonefour|3 years ago|reply
For example "hedge" combined with "hog" is neither a "hedge" nor is it a "hog" nor is it some sort of horrific hybrid mixture of hedges and hogs. A hedgehog is tiny rodent. Most likely this is what's going on here.
The domain is almost infinite. And the range is even greater. Thus it's actually realistic to say that there are must be hundreds of input and output sets that form alternative languages.
[+] [-] muzani|3 years ago|reply
> Tries a lot of prompts that generates things to a common theme.
> "To me this is all starting to look a lot more like stochastic, random noise, than a secret DALL-E language."
Some of the whale dialogue is clearly transcribable, but he regenerates it again until he gets "Evve waeles" and answers that resemble "Wales".
[+] [-] martyvis|3 years ago|reply
[+] [-] belter|3 years ago|reply
https://nitter.net/benjamin_hilton/status/153178089297217536...
[+] [-] wills_forward|3 years ago|reply
[+] [-] ShamelessC|3 years ago|reply
A bunch of people didn't read the original study and just saw the pictures and assumed the gibberish is the only result being discussed.