Language models pack billions of concepts into 12k dimensions

[+] cgadski|6 months ago|reply

> The implications of these geometric properties are staggering. Let's consider a simple way to estimate how many quasi-orthogonal vectors can fit in a k-dimensional space. If we define F as the degrees of freedom from orthogonality (90° - desired angle), we can approximate the number of vectors as [...]

If you're just looking at minimum angles between vectors, you're doing spherical codes. So this article is an analysis of spherical codes… that doesn't reference any work on spherical codes… seems to be written in large part by a language model… and has a bunch of basic inconsistencies that make me doubt its conclusions. For example: in the graph showing the values of C for different values of K and N, is the x axis K or N? The caption says the x axis is N, the number of vectors, but later they say the value C = 0.2 was found for "very large spaces," and in the graph we only get C = 0.2 when N = 30,000 and K = 2---that is, 30,000 vectors in two dimensions! On the other hand, if the x axis is K, then this article is extrapolating a measurement done for 2 vectors in 30,000 dimensions to the case of 10^200 vectors in 12,888 dimensions, which obviously is absurd.

I want to stay positive and friendly about people's work, but the amount of LLM-driven stuff on HN is getting really overwhelming.

[+] sdenton4|6 months ago|reply

Spherical codes are kinda of obscure: I haven't heard of them before, and Wikipedia seems to have barely heard of them. And most of the Google results seem to be about playing golf in small dimensions (ie, how many can we optimally pack in n<32 dimensions?).

People do indeed rediscover previously existing math, especially when the old content is hidden under non-obvious jargon.

[+] jvanderbot|6 months ago|reply

The problem with saying something is LLM generated is it cannot be proven and is a less-helpful way of saying it has errors.

Pointing out the errors is a more helpful way if stating problems with the article, which you have also done.

In that particular picture, you're probably correct to interpret it as C vs N as stated.

[+] jryio|6 months ago|reply

Agree. What writing is better for understanding geometric properties or information in high dimensional vector spaces + spherical codes?

[+] mxkopy|6 months ago|reply

In the graph you’re referencing, K = 2 never reaches C = 0.2. K = 3 only reaches C = 0.3 before getting cut off.

I’m not even sure why that would be a problem, because he’s projecting N basis vectors into K dimensions and C is some measure of the error this introduces into mapping points in the N-space onto the K-space, or something. I’m glad to be shown why this is inconsistent with the graph, but your argument doesn’t talk about this idea at all.

[+] moralestapia|6 months ago|reply

Can't wait to see your post on the topic.

[+] yorwba|6 months ago|reply

I think the author is too focused on the case where all vectors are orthogonal and as a consequence overestimates the amount of error that would be acceptable in practice. The challenge isn't keeping orthogonal vectors almost orthogonal, but keeping the distance ordering between vectors that are far from orthogonal. Even much smaller values of epsilon can give you trouble there.

So the claim that "This research suggests that current embedding dimensions (1,000-20,000) provide more than adequate capacity for representing human knowledge and reasoning." is way too optimistic in my opinion.

[+] sigmoid10|6 months ago|reply

Since vectors are usually normalized to the surface of an n-sphere and the relevant distance for outputs (via loss functions) is cosine similarity, "near orthogonality" is what matters in practice. This means during training, you want to move unrelated representations on the sphere such that they become "more orthogonal" in the outputs. This works especially well since you are stuck with limited precision floating point numbers on any realistic hardware anyways.

Btw. this is not an original idea from the linked blog or the youtube video it references. The relevance of this lemma for AI (or at least neural machine learning) was brought up more than a decade ago by C. Eliasmith as far as I know. So it has been around long before architectures like GPT that could actually be realistically trained on such insanely high dimensional world knowledge.

[+] westurner|6 months ago|reply

I also doubt that all vectors are Orthogonal and/or Independent.

Re: distance metrics and curvilinear spaces and skew coordinates: https://news.ycombinator.com/item?id=41873650 :

> How does the distance metric vary with feature order?

> Do algorithmic outputs diverge or converge given variance in sequence order of all orthogonal axes? Does it matter which order the dimensions are stated in; is the output sensitive to feature order, but does it converge regardless? [...]

>> Are the [features] described with high-dimensional spaces really all 90° geometrically orthogonal?

> If the features are not statistically independent, I don't think it's likely that they're truly orthogonal; which might not affect the utility of a distance metric that assumes that they are all orthogonal

Which statistical models disclaim that their output is insignificant if used with non-independent features? Naieve Bayes, Linear Regression and Logistic Regression, LDA, PCA, and linear models in general are unreliable with non-independent features.

What are some of the hazards of L1 Lasso and L2 Ridge regularization? What are some of the worst cases with outliers? What does regularization do if applied to non-independent and/or non-orthogonal and/or non-linear data?

Impressive but probably insufficient because [non-orthogonality] cannot be so compressed.

There is also the standing question of whether there can be simultaneous encoding in a fundamental gbit.

[+] bjornsing|6 months ago|reply

I agree the OPs argument is a bad one. But I’m still optimistic about the representational capacity of those 20k dimensions.

[+] wer232essf|6 months ago|reply

[deleted]

[+] dwohnitmok|6 months ago|reply

These set of intuitions and the Johnson-Lindenstrauss lemma in particular are what power a lot of the research effort behind SAEs (Sparse Autoencoders) in the field of mechanistic interpretability in AI safety.

A lot of the ideas are explored in more detail in Anthropic's 2022 paper that's one of the foundational papers in SAE research: https://transformer-circuits.pub/2022/toy_model/index.html

[+] emil-lp|6 months ago|reply

Where can I read the actual paper? Where is it published?

[+] rossant|6 months ago|reply

Tangential, but the ChatGPT vibe of most of the article is very distracting and annoying. And I say this as someone who consistently uses AI to refine my English. However, I try to avoid letting it reformulate too dramatically, asking it specifically to only fix grammar and non-idiomatic parts while keeping the tone and formulation as much as possible.

Beyond that, this mathematical observation is genuinely fascinating. It points to a crucial insight into how large language models and other AI systems function. By delving into the way high-dimensional data can be projected into lower-dimensional spaces while preserving its structure, we see a crucial mechanism that allows these models to operate efficiently and scale effectively.

[+] airstrike|6 months ago|reply

Ironically, the use of "fascinating", "crucial" and "delving" in your second paragraph, as well as its overall structure, make it read very much like it was filtered through ChatGPT

[+] GolDDranks|6 months ago|reply

Which parts felt GPT'y to you? The list-happy style?

[+] gpjanik|6 months ago|reply

Language models don't "pack concepts" into the C dimension of one layer (I guess that's where the 12k number came from), neither do they have to be orthogonal to be viewed as distinct or separate. LLMs generally aren't trained to make distinct concepts far apart in the vector space either. The whole point of dense representations, is that there's no clear separation between which concept lives where. People train sparse autoencoders to work out which neurons fire based on the topics involved. Neuronpedia demonstrates it very nicely: https://www.neuronpedia.org/.

[+] sdenton4|6 months ago|reply

The spare autoencoder work is /exactly/ premised on the kind of near-orthogonality that this article talks about. It's called the 'superposition hypothesis' originally: https://transformer-circuits.pub/2022/toy_model/index.html

The SAE's job is to try to pull apart the sparse nearly-orthogonal 'concepts' from a given embedding vector, by decomposing the dense vector into a sparsely activation over-complete basis. They tend to find that this works well, and even allows matching embedding spaces between different LLMs efficiently.

[+] prmph|6 months ago|reply

Agreed, if you relax the requirement for perfect orthogonality, then, yes, you can pack in much more info. You basically introduced additional (fractional) dimensions clustered with the main dimensions. Put another way, many concepts are not orthogonal, but have some commonality or correlation.

So nothing earth shattering here. The article is also filled with words like "remarkable", "fascinating", "profound", etc. that make me feel like some level of subliminal manipulation is going on. Maybe some use of an LLM?

[+] aabhay|6 months ago|reply

My intuition of this problem is much simpler — assuming there’s some rough hierarchy of concepts, you can guesstimate how many concepts can exist in a 12,000-d space by taking the combinatorial of the number of dimensions. In that world, each concept is mutually orthgonal with every other concept in at least some dimension. While that doesn’t mean their cosine distance is large, it does mean you’re guaranteed a function that can linearly separate the two concepts.

It means you get 12,000! (Factorial) concepts in the limit case, more than enough room to fit a taxonomy

[+] OgsyedIE|6 months ago|reply

You can only get 12,000! concepts if you pair each concept with an ordering of the dimensions, which models do not do. A vector in a model that has [weight_1, weight_2, ... weight_12000] is identical to the vector [weight_2, weight_1, ..., weight_12000] within the larger model.

Instead, a naive mental model of a language model is to have a positive, negative or zero trit in each axis: 3^12,000 concepts, which is a much lower number than 12000!. Then in practice, almost every vector in the model has all but a few dozen identified axes zeroed because of the limitations of training time.

[+] Morizero|6 months ago|reply

That number is far, far, far greater than the number of atoms in the universe (~10^43741 >>>>>>>> ~10^80).

[+] bjornsing|6 months ago|reply

> While that doesn’t mean their cosine distance is large

There’s a lot of devil in this detail.

[+] twotwotwo|6 months ago|reply

Sort of trivial but fun thing: you can fit billions of concepts into this much space, too. Let's say four bits of each component of the vector are important, going by how some providers do fp4 inference and it isn't entirely falling apart. So an fp4 dimension-12K vector takes up 6KB, like a few pages of UTF-8 text, more compressed text, or 3K tokens in a 64K-token embedding. How many possible multi-page 'thought's are there? A lot!

(And in handling one token, the layers give ~60 chances to mix in previous 'thoughts' via the attention mechanism, and mix in stuff from training via the FFNs! You can start to see how this whole thing ends able to convert your Bash to Python or do word problems.)

Of course, you don't expect it to be 100% space-efficient, detailed mathematical arguments aside. You want blending two vectors with different strengths to work well, and I wouldn't expect the training to settle into the absolute most efficient way to pack the RAM available. But even if you think of this as an upper bound, it's a very different reference point for what 'ought' to be theoretically possible to cram into a bunch of high-dimensional vectors.

[+] singularity2001|6 months ago|reply

If you ever played 20Questions you know that you don't need 1000 dimensions for a billion concepts. These huge vectors can represent way more complex information than just a billion concepts.

In fact they can pack complete poems with or without typos and you can ask where in the poem the typo is, which is exactly what happens if you paste that into GPT: somewhere in an internal layer it will distinguish exactly that.

[+] giveita|6 months ago|reply

That's not the vector doing that though it is the model. The model is like a trillion dimensional vector.

[+] DougBTX|6 months ago|reply

With binary vectors, 20 dimensions will get you just over a million concepts. For a billion you’ll need 30 questions.

[+] unknown|6 months ago|reply

[deleted]

[+] stared|6 months ago|reply

If vectors life in an effectively lower space that they could, they don't live up to their n-dimensional potential.

Sometimes these things are patched with cosine distance (or even - Pearson correlation), vide https://p.migdal.pl/blog/2025/01/dont-use-cosine-similarity. Ideally when we don't need to and vectors occupy the space.

I am kind of surprised that the original article does not mention batch normalization and similar operations - these are pretty much created to automatically de-bias and de-correlate values at each layer.

[+] bigdict|6 months ago|reply

What's the point of the relu in the loss function? Its inputs are nonnegative anyway.

[+] Nevermark|6 months ago|reply

Let's try to keep things positive.

[+] GolDDranks|6 months ago|reply

I wondered the same. Seems like it would just make a V-shaped loss around the zero, but abs has that property already!

[+] andy_ppp|6 months ago|reply

In reality it’s probably not a RELU modern LLMs use GeLU or something more advanced.

[+] meindnoch|6 months ago|reply

Sometimes a cosmic ray might hit the sign bit of the register and flip it to a negative value. So it is useful to pass it through a rectifier to ensure it's never negative, even in this rare case.

[+] fancyfredbot|6 months ago|reply

I thought the belt and braces approach was a valuable contribution to AI safety. Better safe than sorry with these troublesome negative numbers!

[+] gibsonf1|6 months ago|reply

A key error is there literally are no where close to billions of concepts. Its a misunderstanding of what a concept is as used by us humans. There are an unlimited number of instances and entities, but the concepts we use to think about them is very limited by comparison.

[+] djoldman|6 months ago|reply

A continuing, probably unending, opportunity/tragedy is the under-appreciation of representation learning / embeddings.

The magic of many current valuable models is simply that they can combine abstract "concepts" like "ruler" + "male" and get "king."

This is perhaps the easiest way to understand the lossy text compression that constitutes many LLMs. They're operating in the embedding space, so abstract concepts can be manipulated between input and output. It's like compiling C using something like LLVM: there's an intermediate representation. (obviously not exactly because generally compiler output is deterministic).

This is also present in image models: "edge" + "four corners" is square, etc.

[+] ignobletruth|6 months ago|reply

Question for you experts here.

This article uses theory to imply a high bound for semantic capacity in a a vector space.

However, this recent article (https://arxiv.org/pdf/2508.21038) empirically characterizes the semantic capacity of embedding vectors, finding inadequate capacity for some use cases.

These two articles seem at odds. Can anyone help put these two findings in context and explain their seeming contradictions?

[+] niemandhier|6 months ago|reply

Wow, I think I might just have grasped one of the sources of the problems we keep seeing with LLMs.

Johnson-Lichtenstrauss guarantees a distance preserving embedding for a finite set of points into a space with a dimension based on the number of points.

It does not say anything about preserving the underlying topology of the contious high dimensional manifold, that would be Takens/Whitney-style embedding results (and Sauer–Yorke for attractors).

The embedding dimensions needed to fulfil Takens are related to the original manifolds dimension and not the number of points.

It’s quite probable that we observe violations of topological features of the original manifold, when using our to low dimensional embedded version to interpolate.

I used AI to sort the hodge pudge of math in my head into something another human could understand, edited result is below:

=== AI in use === If you want to resolve an attractor down to a spatial scale rho, you need about n ≈ C * rho^(-d_B) sample points (here d_B is the box-counting/fractal dimension).

The Johnson–Lindenstrauss (JL) lemma says that to preserve all pairwise distances among n points within a factor 1±ε, you need a target dimension

k ≳ (d_B / ε^2) * log(C / rho).

So as you ask for finer resolution (rho → 0), the required k must grow. If you keep k fixed (i.e., you embed into a dimension that’s too low), there is a smallest resolvable scale

rho* (roughly rho* ≳ C * exp(-(ε^2/d_B) * k), up to constants),

below which you can’t keep all distances separated: points that are far on the true attractor will show up close after projection. That’s called “folding” and might be the source of some of the problems we observe .

=== AI end ===

Bottom line: JL protects distance geometry for a finite sample at a chosen resolution; if you push the resolution finer without increasing k, collisions are inevitable. This is perfectly consistent with the embedding theorems for dynamical systems, which require higher dimensions to get a globally one-to-one (no-folds) representation of the entire attractor.

If someone is bored and would like to discuss this, feel free to email me.

[+] sdl|6 months ago|reply

So basically the map projection problem [1] in higher dimensions?

[1] https://en.m.wikipedia.org/wiki/Map_projection

[+] rini17|6 months ago|reply

I became bit lost between "C is a constant that determines the probability of success" and then they set C between 4 and 8. Probability should be between 0 and 1, how it relates to C?

[+] emil-lp|6 months ago|reply

It's the epsilon^-2 term that actually talks about success, but that is tightly linked with the C term. If you want to decrease epsilon, C goes up.

[+] Mithriil|6 months ago|reply

For those that argue that concepts are not orthogonal or quasi-orthogonal, then see the quasi-orthogonal case as the worst-case: "if all concepts were black and white, then how many can we fit in k dimensions". When there are nuanced concepts, then they will fit in between these quasi-orthogonals ones. What's argued here is thus a lower-bound.

[+] prerok|6 months ago|reply

So, a lot of comments have already poked lots of holes in the article, but just wanted to chime in with a very basic observation: the mere statement that the 12k dimensions can pack in 10^200 concepts is staggering in how wrong it is.

Sure, 12k vector space has a significant amount of individual values, but not concepts. This is ridiculous. I mean Shannon would like to have a word with you.

[+] WithinReason|6 months ago|reply

The vectors don't need to be orthogonal due to the use of non-linearities in neural networks. The softmax in attention let's you effectively pack as many vectors in 1D as you want and unambiguously pick them out.

[+] cpldcpu|6 months ago|reply

The dimensions should actually be closer to 12000 * (no of tokens*no of layers / x)

(where x is a number dependent on architectural features like MLHA, QGA...)

There is this thing called KV cache which holds an enormous latent state.

[+] lvl155|6 months ago|reply

They don’t capture concepts at all. They capture writings of concepts.

142 comments