Let's try to understand AI monosemanticity

[+] lukev|2 years ago|reply

There's actually a somewhat reasonable analogy to human cognitive processes here, I think, in the sense that humans tend to form concepts defined by their connectivity to other concepts (c.f. Ferdinand de Saussure & structuralism).

Human brains are also a "black box" in the sense that you can't scan/dissect one to build a concept graph.

Neural nets do seem to have some sort of emergent structural concept graph, in the case of LLMs it's largely informed by human language (because that's what they're trained on.) To an extent, we can observe this empirically through their output even if the first principles are opaque.

[+] wahern|2 years ago|reply

> Neural nets do seem to have some sort of emergent structural concept graph, in the case of LLMs it's largely informed by human language (because that's what they're trained on.) To an extent, we can observe this empirically through their output even if the first principles are opaque.

Alternatively, what you're seeing are the structures inherent within human culture as manifested through its literature[1], with LLMs simply being a new and useful tool which makes these structures more apparent.

[1] And also its engineers' training choices

[+] Kapura|2 years ago|reply

This gets at the fundamental issue I have with AI: We're trying to get machines to think, but we only have the barest understanding of how we, as humans, think. The concept of a "neuron" that can activate other neurons comes straight from real neurology, but given how complex the human brain is, it's no surprise that we can only create something that is fundamentally "lesser."

However, I think that real neurology and machine-learning can be mutually reinforcing fields: structures discovered in the one could be applied to the other, and vice versa. But thinking we can create "AGI" without first increasing our understanding of "wet" neural nets is the height of hubris.

[+] _as_text|2 years ago|reply

I just skimmed through it for now, but it has seemed kinda natural to me for a few months now that there would be a deep connection between neural networks and differential or algebraic geometry.

Each ReLU layer is just a (quasi-)linear transformation, and a pass through two layers is basically also a linear transformation. If you say you want some piece of information to stay (numerically) intact as it passes through the network, you say you want that piece of information to be processed in the same way in each layer. The groups of linear transformations that "all process information in the same way, and their compositions do, as well" are basically the Lie groups. Anyone else ever had this thought?

I imagine if nothing catastrophic happens we'll have a really beautiful theory of all this someday, which I won't create, but maybe I'll be able to understand it after a lot of hard work.

[+] KhoomeiK|2 years ago|reply

You might be interested in this workshop: https://www.neurreps.org/

And a possibly relevant paper from it:

https://openreview.net/forum?id=Ag8HcNFfDsg

[+] zozbot234|2 years ago|reply

ReLU is quite far from linear, adding ReLU activations to a linear layer amounts to fitting a piecewise-segmented model of the underlying data.

[+] jimsimmons|2 years ago|reply

Everything is something. Question is what this nomenclature gymnastics buys you? Unless you answer that this is no different than claiming neural networks are a projection of my soul

[+] mathematicaster|2 years ago|reply

I did a little spelunking some time ago reacting to the same urge. Tropical geometry appears to be where the math talk is at.

Just dropping the reference here, I don't grok the literature.

[+] SpaceManNabs|2 years ago|reply

> deep connection between neural networks and differential or algebraic geometry

I disagree with how you came to this conclusion (because it ignores non-linearity of neural networks), but this is pretty true. Look up gauge invariant neural networks.

Bruna et al. Mathematics of deep learning course might also be interesting to you.

[+] blovescoffee|2 years ago|reply

What? The very point of neural networks is representing non-linear functions.

[+] dmd|2 years ago|reply

The idea of a smaller NN outperforming what you might think it could do by simulating a larger one reminds me of something I read about Portia spiders (going on a deep dive into them after reading the excellent 'Children of Time' by Adrian Tchaikovsky). The idea is that they're able to do things with their handful - on the order of tens of thousands - of neurons that you'd think would require 4 or 5 orders of magnitude more, by basically time-sharing them; do some computation, store it somehow, then reuse the same neurons in a totally different way.

[+] DonsDiscountGas|2 years ago|reply

Isn't this also(just?) a description of how high-dimensional embedding spaces work? Putting every kind of concept all in the same space is going to lead to some weird stuff. Different regions of the latent space will cover different concepts, with very uneven volumes, and local distances will generally be meaningful (red vs green) but long-distances won't (red vs. ennui).

I guess we could also look at it the other way; embedding spaces work this way because the underlying neurons work this way.

[+] bjackman|2 years ago|reply

I had the same feeling, I have always seen it like:

1. Neurons encode a concept and activate when it shows up.

2. No, it's way more complicated and mysterious than that.

And now this seems to add:

3. Actually, it's only more complicated than that in a fairly straightforward mathematical sense. It's not that mysterious at all.

I suspect this means that either I'm not picking up on subtleties in the article, or Scott is representing it in a way that slightly oversimplified the situation!

On the other hand, the last quote in the article from the researchers does seem to be hitting the "it's not that mysterious" note. A simple matter of very hard engineering. So, I dunno. Cool!

[+] erikerikson|2 years ago|reply

Before finishing my read, I need to register an objection to the opening which reads to me so as to imply it is the only means:

> Researchers simulate a weird type of pseudo-neural-tissue, “reward” it a little every time it becomes a little more like the AI they want, and eventually it becomes the AI they want.

This isn't the only way. Back propagation is a hack around the oversimplification of neural models. By adding a sense of location into the network, you get linearly inseparable functions learned just fine.

Hopfield networks with Hebbian learning are sufficient and are implemented by the existing proofs of concept we have.

[+] janalsncm|2 years ago|reply

This is true. We use backpropagation not because it’s the only way or because it’s biologically plausible (the brain doesn’t have any backward passes) but because it works. Neural networks aren’t special because of any sort of connection to the brain, we use them because we have hardware (GPUs) which can train them pretty quickly.

I feel the same way about transformers vs RNNs: even if RNNs are more “correct” in some sense of having theoretically infinite memory it takes forever to train them so transformers won. And then we developed techniques like Long LoRA which make theoretical disadvantages functionally irrelevant.

[+] dekhn|2 years ago|reply

I think it would be really exciting if somebody could show that ANNs that more resembled biological neurons could learn function approximation as well as (or better than!) current DNNs. However, my understanding of math and engineering suggests that for the time being, the mechanisms we currently use and invest so much time and effort into will exceed more biologically inspired neurons, for utterly banal reasons.

[+] blovescoffee|2 years ago|reply

Back propagation isn't a hack. It's a triumph. It's powering the revolution we're experiencing.

[+] gmuslera|2 years ago|reply

At least the first part reminded me of Hyperion and how AIs evolved there (I think the actual explanation is in The Fall of Hyperion), smaller but more interconnected "code".

Not sure about actual implementation, but at least for us concepts or words are not pure nor isolated, they have multiple meanings that collapse into specific ones as you put several together

[+] error9348|2 years ago|reply

> No one knows how it works. Researchers simulate a weird type of pseudo-neural-tissue, “reward” it a little every time it becomes a little more like the AI they want, and eventually it becomes the AI they want.

There is a distinction to be made in "knowing how it works" on architecture vs weights themselves.

[+] nl|2 years ago|reply

Personally I find the original paper much better written and easier to understand: https://transformer-circuits.pub/2023/monosemantic-features/...

[+] zone411|2 years ago|reply

If you have a background in ML, then yes, a paper is almost always better. I've recommended papers over sources like Towards Data Science etc. here. However, for laypeople, I doubt it would be as effective - they'd need to look up terms like MLP, ReLU, UMAP, Logit, or even what an activation function is, and they are the target audience of this post.

[+] turtleyacht|2 years ago|reply

By the same token, thinking in memes all the time may be a form of impoverished cognition.

Or, is it enhanced cognition, on the part of the interpreter having to unpack much from little?

[+] throwanem|2 years ago|reply

Darmok and Jalad at Mar-a-Lago.

[+] dekhn|2 years ago|reply

>By the same token, thinking in memes all the time may be a form of impoverished cognition.

I would recast this: any thinking is a linear superposition of weighted tropes. If you read TVTropes enough you'll start to realize that the site doesn't just describe TV plots, but basically all human interaction and thought, nicely clustered into nearly orthogonal topics. Almost anything you can say can be expressed by taking a few tropes and combining them with weights.

[+] aatd86|2 years ago|reply

Some kind of single context abstract interpretation maybe.

[+] shermantanktop|2 years ago|reply

As described in the post, this seems quite analogous to the operation of a bloom filter, except each "bit" is more than a single bit's worth of information, and the match detection has to do some thresholding/ranking to select a winner.

That said, the post is itself clearly summarizing much more technical work, so my analogy is resting on shaky ground.

[+] daveguy|2 years ago|reply

All this anthropomorphizing of activation networks strikes me as very odd. None of these neurons "want" to do anything. They respond to specific input. Maybe humans are the same, but in the case of artificial neural networks we at least know it's a simple mathematical function. Also, an artificial neuron is nothing like a biological neuron. At the most basic -- artificial neurons don't "fire" except in direct response to inputs. Biological neurons fire because of their internal state, state which is modified by biological signaling chemicals. It's like comparing apples to gorillas.

[+] famouswaffles|2 years ago|reply

>None of these neurons "want" to do anything. They respond to specific input.

Yes well, your neurons don't "want" to do anything either.

>Maybe humans are the same, but in the case of artificial neural networks we at least know it's a simple mathematical function

So what, magic ? a soul ? If the brain is computing then the substrate is entirely irrelevant. Silicon, biology, pulleys and gears. all can be arranged to make the same or similar computations. If you genuinely believe the latter, it's fine. The point is that "simple" mathematical function is kind of irrelevant. Either the brain computes and any substrate is fine or it doesn't.

>Also, an artificial neuron is nothing like a biological neuron.

They're not the same but "nothing like" is pushing it a lot. They're inspired by biological neurons and the only reason modern NNs aren't closer to their biological counterparts is because they genuinely suffer for it, not because we can't.

>Biological neurons fire because of their internal state, state which is modified by biological signaling chemicals

Brains aren't breaking break causality. The fire because of input.

[+] gremlinunderway|2 years ago|reply

I think you're being a bit pedantic here.

Neurons also don't "respond" to specific input either. They can't speak or provide an answer to your input.

These are all just abstract metaphors and analogies. Literally everything in computer science at some point or another is an abstract metaphor or analogy.

When you look up the definition and etymology of "input", it says to "put on" or "impose" or "feed data into the machine". We're not literally feeding the machine data, it doesn't eat the data and subsist on it.

You could go on and on and nitpick every single one of these, and I don't think the use of "want" (i.e. anthropomorphizing the networks to have intent) is all that bad.

[+] ShamelessC|2 years ago|reply

Wait what are you referring to specifically? Any anthropomorphism in the article is _clearly_, clearly the author's admitted simplification due to the incredible density of the subject matter.

Given that, I honestly can't find anything too upsetting.

In any case, anthropomorphism is something I don't mind, mostly. Is it misleading? For the layman. But the domain is one of modeling intelligence itself and there are many instances where an existing definition simply makes sense. This happens in lots of fields and causes similar amounts of frustration in those fields. So it goes.

[+] Notatheist|2 years ago|reply

I feel anthropomorphizing is perfectly reasonable especially in this context. How would you like it described?

[+] clbrmbr|2 years ago|reply

I didn’t hear much anthropomorphism outside of the subtitle.

Anyways, this is Scott’s writing style. I recall an earlier ACX post on alignment that was real heavy on ascribing desires and goals to AI models.

[+] aatd86|2 years ago|reply

Do you mean that artificial neurons are inherently passive while biological neurons are inherently active i.e. they would act in spite of external input?

Just wondering if I understood you, I don't know anything on the subject.

[+] benchaney|2 years ago|reply

Can you be more specific about what particular anthropomorphizing you object to? The only place the author uses the word want is in describing the wants of humans.

[+] lainga|2 years ago|reply

Wait till you find out how we talk about evolutionary adaptation... a couple analogies and elided concepts here and there and you'd swear Lamarck had smothered Darwin in his cradle

[+] drones|2 years ago|reply

All this anthropomorphizing of humans strikes me as very odd. None of these humans "want" to do anything. They respond to specific input. Maybe artificial neural networks are the same, but in the case of humans we at least know it's a simple reaction to neurotransmitter signals.

[+] s1gnp0st|2 years ago|reply

> Shouldn’t the AI be keeping the concept of God, Almighty Creator and Lord of the Universe, separate from God-

This seems wrong. God-zilla is using the concept of God as a superlative modifier. I would expect a neuron involved in the concept of godhood to activate whenever any metaphorical "god-of-X" concept is being used.

[+] Sniffnoy|2 years ago|reply

I mean, it's not actually. It's just a somewhat unusual transcription (well, originally somewhat unusual, now obviously it's the official English name) of what might be more usually transcribed as "Gojira".

[+] csours|2 years ago|reply

I feel like we're a few more paradigm shifts away from self-driving cars, and this is one of them - being able to actually understand neural nets and modify them in a constructive way more directly - aka engineering.

Some more:

    cheaper sensors (happening now)
    better sensor integration (happening now, kind of)
    better tools for ml grokking and intermediate engineering (this article, kind of)
    better tools for layering ml (probably the same thing as above)
    a new model for insurance/responsibility/something like this (unsure)
    better communication with people inside and outside the car (barely on the radar)

[+] Nelkins|2 years ago|reply

This reminds of research done on category-specific semantic deficits, where there can be neurodegeneration that impacts highly specific or related knowledge (for example, brain trauma that affects a person's ability to understand living things like zebras and carrots, but not non-living things like helicopters or pliers).

https://academic.oup.com/brain/article/130/4/1127/278057

[+] kevdozer1|2 years ago|reply

such a wonderful article, really enjoyed reading it

[+] Merrill|2 years ago|reply

When LLMs are trained on text, are the words annotated to indicate the semantic meaning, or is the LLM training process expected to disambiguate the possibly hundreds of semantic meanings of an individual common word such as "run"?

[+] OkayPhysicist|2 years ago|reply

The task LLMs are trained on is "predict the next word", which elegantly is included for free in your training set of text. Typically no annotation is provided, since that would involve a ton of human labor doing the annotations.

[+] pas|2 years ago|reply

see https://www.nature.com/articles/nature12160 "mixed selectivity neurons"

[+] adamnemecek|2 years ago|reply

Superposition makes sense when you understand all of ML as a convolution.

[+] blovescoffee|2 years ago|reply

You can just note the fact that two bits can represent four values.

[+] chrissnow2023|2 years ago|reply

COOL~ interpreting a big AI with a bigger is like interpreting 42 with Earth~

[+] drc500free|2 years ago|reply

The original is quite long, but quite interesting.[1] Reading it makes me feel like I did reading A Brief History of Time as a middle schooler - concepts that are mainly just out of reach, with a few flashes that I actually understand.

One particularly interesting topic is the "theories of superposition" section, which gets into how LLMs categorize concepts. Are concepts all distinct or indistinct? Are they independent or do they cluster? It seems that the answer is all of the above.

This ties into linguistic theories of categorization[2] that I saw referenced in (of all places) a book about the partition of Judaeo-Christianity in the first centuries CE.

Some categories have hard lines - something is a "bird" or it is not. Some categories have soft lines - like someone being "tall." Some categories work on prototypes, making them have different intensities within the space - A sparrow, swallow, or robin is more "birdy" than a chicken, emu, or turkey. Apparently Wittgenstein was the first to really explore with Family Resemblances that a category might not have hard boundaries, according to people who study these things.[3] These sorts of "manifolds" seem to appear, where some concepts are not just distinct points that are or aren't.

It's exciting to see that LLMs may give us insights into how our brains store concepts. I've heard people criticize them as "just predicting the next most likely token," but I've found myself lost when speaking in the middle of a garden path sentence many times. I don't know how a sentence will end before I start saying it, and it's certainly plausible that LLMs actually do match they way we speak.

Probably the most exciting piece is seeing how close they seem to get to mimicking how we communicate and think, while being fully limited to language with no other modeling behind it - no concept of the physical world, no understanding of counting or math, just words. It's clear when you scratch the surface that LLM outputs are bullshit with no thought underneath them, but it's amazing how much is covered by linking concepts with no logic other than how you've heard them linked before.

[1] https://transformer-circuits.pub/2023/monosemantic-features/...

[2] https://www.sciencedirect.com/science/article/abs/pii/001002...

[3] https://en.wikipedia.org/wiki/Family_resemblance

178 comments