> The implications of these geometric properties are staggering. Let's consider a simple way to estimate how many quasi-orthogonal vectors can fit in a k-dimensional space. If we define F as the degrees of freedom from orthogonality (90° - desired angle), we can approximate the number of vectors as [...]
If you're just looking at minimum angles between vectors, you're doing spherical codes. So this article is an analysis of spherical codes… that doesn't reference any work on spherical codes… seems to be written in large part by a language model… and has a bunch of basic inconsistencies that make me doubt its conclusions. For example: in the graph showing the values of C for different values of K and N, is the x axis K or N? The caption says the x axis is N, the number of vectors, but later they say the value C = 0.2 was found for "very large spaces," and in the graph we only get C = 0.2 when N = 30,000 and K = 2---that is, 30,000 vectors in two dimensions! On the other hand, if the x axis is K, then this article is extrapolating a measurement done for 2 vectors in 30,000 dimensions to the case of 10^200 vectors in 12,888 dimensions, which obviously is absurd.
I want to stay positive and friendly about people's work, but the amount of LLM-driven stuff on HN is getting really overwhelming.
Spherical codes are kinda of obscure: I haven't heard of them before, and Wikipedia seems to have barely heard of them. And most of the Google results seem to be about playing golf in small dimensions (ie, how many can we optimally pack in n<32 dimensions?).
People do indeed rediscover previously existing math, especially when the old content is hidden under non-obvious jargon.
In the graph you’re referencing, K = 2 never reaches C = 0.2. K = 3 only reaches C = 0.3 before getting cut off.
I’m not even sure why that would be a problem, because he’s projecting N basis vectors into K dimensions and C is some measure of the error this introduces into mapping points in the N-space onto the K-space, or something. I’m glad to be shown why this is inconsistent with the graph, but your argument doesn’t talk about this idea at all.
I think the author is too focused on the case where all vectors are orthogonal and as a consequence overestimates the amount of error that would be acceptable in practice. The challenge isn't keeping orthogonal vectors almost orthogonal, but keeping the distance ordering between vectors that are far from orthogonal. Even much smaller values of epsilon can give you trouble there.
So the claim that "This research suggests that current embedding dimensions (1,000-20,000) provide more than adequate capacity for representing human knowledge and reasoning." is way too optimistic in my opinion.
Since vectors are usually normalized to the surface of an n-sphere and the relevant distance for outputs (via loss functions) is cosine similarity, "near orthogonality" is what matters in practice. This means during training, you want to move unrelated representations on the sphere such that they become "more orthogonal" in the outputs. This works especially well since you are stuck with limited precision floating point numbers on any realistic hardware anyways.
Btw. this is not an original idea from the linked blog or the youtube video it references. The relevance of this lemma for AI (or at least neural machine learning) was brought up more than a decade ago by C. Eliasmith as far as I know. So it has been around long before architectures like GPT that could actually be realistically trained on such insanely high dimensional world knowledge.
> How does the distance metric vary with feature order?
> Do algorithmic outputs diverge or converge given variance in sequence order of all orthogonal axes? Does it matter which order the dimensions are stated in; is the output sensitive to feature order, but does it converge regardless? [...]
>> Are the [features] described with high-dimensional spaces really all 90° geometrically orthogonal?
> If the features are not statistically independent, I don't think it's likely that they're truly orthogonal; which might not affect the utility of a distance metric that assumes that they are all orthogonal
Which statistical models disclaim that their output is insignificant if used with non-independent features? Naieve Bayes, Linear Regression and Logistic Regression, LDA, PCA, and linear models in general are unreliable with non-independent features.
What are some of the hazards of L1 Lasso and L2 Ridge regularization? What are some of the worst cases with outliers? What does regularization do if applied to non-independent and/or non-orthogonal and/or non-linear data?
Impressive but probably insufficient because [non-orthogonality] cannot be so compressed.
There is also the standing question of whether there can be simultaneous encoding in a fundamental gbit.
These set of intuitions and the Johnson-Lindenstrauss lemma in particular are what power a lot of the research effort behind SAEs (Sparse Autoencoders) in the field of mechanistic interpretability in AI safety.
Tangential, but the ChatGPT vibe of most of the article is very distracting and annoying. And I say this as someone who consistently uses AI to refine my English. However, I try to avoid letting it reformulate too dramatically, asking it specifically to only fix grammar and non-idiomatic parts while keeping the tone and formulation as much as possible.
Beyond that, this mathematical observation is genuinely fascinating. It points to a crucial insight into how large language models and other AI systems function. By delving into the way high-dimensional data can be projected into lower-dimensional spaces while preserving its structure, we see a crucial mechanism that allows these models to operate efficiently and scale effectively.
Ironically, the use of "fascinating", "crucial" and "delving" in your second paragraph, as well as its overall structure, make it read very much like it was filtered through ChatGPT
Language models don't "pack concepts" into the C dimension of one layer (I guess that's where the 12k number came from), neither do they have to be orthogonal to be viewed as distinct or separate. LLMs generally aren't trained to make distinct concepts far apart in the vector space either. The whole point of dense representations, is that there's no clear separation between which concept lives where. People train sparse autoencoders to work out which neurons fire based on the topics involved. Neuronpedia demonstrates it very nicely: https://www.neuronpedia.org/.
The SAE's job is to try to pull apart the sparse nearly-orthogonal 'concepts' from a given embedding vector, by decomposing the dense vector into a sparsely activation over-complete basis. They tend to find that this works well, and even allows matching embedding spaces between different LLMs efficiently.
Agreed, if you relax the requirement for perfect orthogonality, then, yes, you can pack in much more info. You basically introduced additional (fractional) dimensions clustered with the main dimensions. Put another way, many concepts are not orthogonal, but have some commonality or correlation.
So nothing earth shattering here. The article is also filled with words like "remarkable", "fascinating", "profound", etc. that make me feel like some level of subliminal manipulation is going on. Maybe some use of an LLM?
My intuition of this problem is much simpler — assuming there’s some rough hierarchy of concepts, you can guesstimate how many concepts can exist in a 12,000-d space by taking the combinatorial of the number of dimensions. In that world, each concept is mutually orthgonal with every other concept in at least some dimension. While that doesn’t mean their cosine distance is large, it does mean you’re guaranteed a function that can linearly separate the two concepts.
It means you get 12,000! (Factorial) concepts in the limit case, more than enough room to fit a taxonomy
You can only get 12,000! concepts if you pair each concept with an ordering of the dimensions, which models do not do. A vector in a model that has [weight_1, weight_2, ... weight_12000] is identical to the vector [weight_2, weight_1, ..., weight_12000] within the larger model.
Instead, a naive mental model of a language model is to have a positive, negative or zero trit in each axis: 3^12,000 concepts, which is a much lower number than 12000!. Then in practice, almost every vector in the model has all but a few dozen identified axes zeroed because of the limitations of training time.
Sort of trivial but fun thing: you can fit billions of concepts into this much space, too. Let's say four bits of each component of the vector are important, going by how some providers do fp4 inference and it isn't entirely falling apart. So an fp4 dimension-12K vector takes up 6KB, like a few pages of UTF-8 text, more compressed text, or 3K tokens in a 64K-token embedding. How many possible multi-page 'thought's are there? A lot!
(And in handling one token, the layers give ~60 chances to mix in previous 'thoughts' via the attention mechanism, and mix in stuff from training via the FFNs! You can start to see how this whole thing ends able to convert your Bash to Python or do word problems.)
Of course, you don't expect it to be 100% space-efficient, detailed mathematical arguments aside. You want blending two vectors with different strengths to work well, and I wouldn't expect the training to settle into the absolute most efficient way to pack the RAM available. But even if you think of this as an upper bound, it's a very different reference point for what 'ought' to be theoretically possible to cram into a bunch of high-dimensional vectors.
If you ever played 20Questions you know that you don't need 1000 dimensions for a billion concepts. These huge vectors can represent way more complex information than just a billion concepts.
In fact they can pack complete poems with or without typos and you can ask where in the poem the typo is, which is exactly what happens if you paste that into GPT: somewhere in an internal layer it will distinguish exactly that.
I am kind of surprised that the original article does not mention batch normalization and similar operations - these are pretty much created to automatically de-bias and de-correlate values at each layer.
Sometimes a cosmic ray might hit the sign bit of the register and flip it to a negative value. So it is useful to pass it through a rectifier to ensure it's never negative, even in this rare case.
A key error is there literally are no where close to billions of concepts. Its a misunderstanding of what a concept is as used by us humans. There are an unlimited number of instances and entities, but the concepts we use to think about them is very limited by comparison.
A continuing, probably unending, opportunity/tragedy is the under-appreciation of representation learning / embeddings.
The magic of many current valuable models is simply that they can combine abstract "concepts" like "ruler" + "male" and get "king."
This is perhaps the easiest way to understand the lossy text compression that constitutes many LLMs. They're operating in the embedding space, so abstract concepts can be manipulated between input and output. It's like compiling C using something like LLVM: there's an intermediate representation. (obviously not exactly because generally compiler output is deterministic).
This is also present in image models: "edge" + "four corners" is square, etc.
This article uses theory to imply a high bound for semantic capacity in a a vector space.
However, this recent article (https://arxiv.org/pdf/2508.21038) empirically characterizes the semantic capacity of embedding vectors, finding inadequate capacity for some use cases.
These two articles seem at odds. Can anyone help put these two findings in context and explain their seeming contradictions?
Wow, I think I might just have grasped one of the sources of the problems we keep seeing with LLMs.
Johnson-Lichtenstrauss guarantees a distance preserving embedding for a finite set of points into a space with a dimension based on the number of points.
It does not say anything about preserving the underlying topology of the contious high dimensional manifold, that would be Takens/Whitney-style embedding results (and Sauer–Yorke for attractors).
The embedding dimensions needed to fulfil Takens are related to the original manifolds dimension and not the number of points.
It’s quite probable that we observe violations of topological features of the original manifold, when using our to low dimensional embedded version to interpolate.
I used AI to sort the hodge pudge of math in my head into something another human could understand, edited result is below:
=== AI in use ===
If you want to resolve an attractor down to a spatial scale rho, you need about
n ≈ C * rho^(-d_B) sample points (here d_B is the box-counting/fractal dimension).
The Johnson–Lindenstrauss (JL) lemma says that to preserve all pairwise distances among n points within a factor 1±ε, you need a target dimension
k ≳ (d_B / ε^2) * log(C / rho).
So as you ask for finer resolution (rho → 0), the required k must grow. If you keep k fixed (i.e., you embed into a dimension that’s too low), there is a smallest resolvable scale
rho* (roughly rho* ≳ C * exp(-(ε^2/d_B) * k), up to constants),
below which you can’t keep all distances separated: points that are far on the true attractor will show up close after projection. That’s called “folding” and might be the source of some of the problems we observe .
=== AI end ===
Bottom line: JL protects distance geometry for a finite sample at a chosen resolution; if you push the resolution finer without increasing k, collisions are inevitable. This is perfectly consistent with the embedding theorems for dynamical systems, which require higher dimensions to get a globally one-to-one (no-folds) representation of the entire attractor.
If someone is bored and would like to discuss this, feel free to email me.
I became bit lost between "C is a constant that determines the probability of success" and then they set C between 4 and 8. Probability should be between 0 and 1, how it relates to C?
For those that argue that concepts are not orthogonal or quasi-orthogonal, then see the quasi-orthogonal case as the worst-case: "if all concepts were black and white, then how many can we fit in k dimensions". When there are nuanced concepts, then they will fit in between these quasi-orthogonals ones. What's argued here is thus a lower-bound.
So, a lot of comments have already poked lots of holes in the article, but just wanted to chime in with a very basic observation: the mere statement that the 12k dimensions can pack in 10^200 concepts is staggering in how wrong it is.
Sure, 12k vector space has a significant amount of individual values, but not concepts. This is ridiculous. I mean Shannon would like to have a word with you.
The vectors don't need to be orthogonal due to the use of non-linearities in neural networks. The softmax in attention let's you effectively pack as many vectors in 1D as you want and unambiguously pick them out.
[+] [-] cgadski|6 months ago|reply
If you're just looking at minimum angles between vectors, you're doing spherical codes. So this article is an analysis of spherical codes… that doesn't reference any work on spherical codes… seems to be written in large part by a language model… and has a bunch of basic inconsistencies that make me doubt its conclusions. For example: in the graph showing the values of C for different values of K and N, is the x axis K or N? The caption says the x axis is N, the number of vectors, but later they say the value C = 0.2 was found for "very large spaces," and in the graph we only get C = 0.2 when N = 30,000 and K = 2---that is, 30,000 vectors in two dimensions! On the other hand, if the x axis is K, then this article is extrapolating a measurement done for 2 vectors in 30,000 dimensions to the case of 10^200 vectors in 12,888 dimensions, which obviously is absurd.
I want to stay positive and friendly about people's work, but the amount of LLM-driven stuff on HN is getting really overwhelming.
[+] [-] sdenton4|6 months ago|reply
People do indeed rediscover previously existing math, especially when the old content is hidden under non-obvious jargon.
[+] [-] jvanderbot|6 months ago|reply
Pointing out the errors is a more helpful way if stating problems with the article, which you have also done.
In that particular picture, you're probably correct to interpret it as C vs N as stated.
[+] [-] jryio|6 months ago|reply
[+] [-] mxkopy|6 months ago|reply
I’m not even sure why that would be a problem, because he’s projecting N basis vectors into K dimensions and C is some measure of the error this introduces into mapping points in the N-space onto the K-space, or something. I’m glad to be shown why this is inconsistent with the graph, but your argument doesn’t talk about this idea at all.
[+] [-] moralestapia|6 months ago|reply
[+] [-] yorwba|6 months ago|reply
So the claim that "This research suggests that current embedding dimensions (1,000-20,000) provide more than adequate capacity for representing human knowledge and reasoning." is way too optimistic in my opinion.
[+] [-] sigmoid10|6 months ago|reply
Btw. this is not an original idea from the linked blog or the youtube video it references. The relevance of this lemma for AI (or at least neural machine learning) was brought up more than a decade ago by C. Eliasmith as far as I know. So it has been around long before architectures like GPT that could actually be realistically trained on such insanely high dimensional world knowledge.
[+] [-] westurner|6 months ago|reply
Re: distance metrics and curvilinear spaces and skew coordinates: https://news.ycombinator.com/item?id=41873650 :
> How does the distance metric vary with feature order?
> Do algorithmic outputs diverge or converge given variance in sequence order of all orthogonal axes? Does it matter which order the dimensions are stated in; is the output sensitive to feature order, but does it converge regardless? [...]
>> Are the [features] described with high-dimensional spaces really all 90° geometrically orthogonal?
> If the features are not statistically independent, I don't think it's likely that they're truly orthogonal; which might not affect the utility of a distance metric that assumes that they are all orthogonal
Which statistical models disclaim that their output is insignificant if used with non-independent features? Naieve Bayes, Linear Regression and Logistic Regression, LDA, PCA, and linear models in general are unreliable with non-independent features.
What are some of the hazards of L1 Lasso and L2 Ridge regularization? What are some of the worst cases with outliers? What does regularization do if applied to non-independent and/or non-orthogonal and/or non-linear data?
Impressive but probably insufficient because [non-orthogonality] cannot be so compressed.
There is also the standing question of whether there can be simultaneous encoding in a fundamental gbit.
[+] [-] bjornsing|6 months ago|reply
[+] [-] wer232essf|6 months ago|reply
[deleted]
[+] [-] dwohnitmok|6 months ago|reply
A lot of the ideas are explored in more detail in Anthropic's 2022 paper that's one of the foundational papers in SAE research: https://transformer-circuits.pub/2022/toy_model/index.html
[+] [-] emil-lp|6 months ago|reply
[+] [-] rossant|6 months ago|reply
Beyond that, this mathematical observation is genuinely fascinating. It points to a crucial insight into how large language models and other AI systems function. By delving into the way high-dimensional data can be projected into lower-dimensional spaces while preserving its structure, we see a crucial mechanism that allows these models to operate efficiently and scale effectively.
[+] [-] airstrike|6 months ago|reply
[+] [-] GolDDranks|6 months ago|reply
[+] [-] gpjanik|6 months ago|reply
[+] [-] sdenton4|6 months ago|reply
The SAE's job is to try to pull apart the sparse nearly-orthogonal 'concepts' from a given embedding vector, by decomposing the dense vector into a sparsely activation over-complete basis. They tend to find that this works well, and even allows matching embedding spaces between different LLMs efficiently.
[+] [-] prmph|6 months ago|reply
So nothing earth shattering here. The article is also filled with words like "remarkable", "fascinating", "profound", etc. that make me feel like some level of subliminal manipulation is going on. Maybe some use of an LLM?
[+] [-] aabhay|6 months ago|reply
It means you get 12,000! (Factorial) concepts in the limit case, more than enough room to fit a taxonomy
[+] [-] OgsyedIE|6 months ago|reply
Instead, a naive mental model of a language model is to have a positive, negative or zero trit in each axis: 3^12,000 concepts, which is a much lower number than 12000!. Then in practice, almost every vector in the model has all but a few dozen identified axes zeroed because of the limitations of training time.
[+] [-] Morizero|6 months ago|reply
[+] [-] bjornsing|6 months ago|reply
There’s a lot of devil in this detail.
[+] [-] twotwotwo|6 months ago|reply
(And in handling one token, the layers give ~60 chances to mix in previous 'thoughts' via the attention mechanism, and mix in stuff from training via the FFNs! You can start to see how this whole thing ends able to convert your Bash to Python or do word problems.)
Of course, you don't expect it to be 100% space-efficient, detailed mathematical arguments aside. You want blending two vectors with different strengths to work well, and I wouldn't expect the training to settle into the absolute most efficient way to pack the RAM available. But even if you think of this as an upper bound, it's a very different reference point for what 'ought' to be theoretically possible to cram into a bunch of high-dimensional vectors.
[+] [-] singularity2001|6 months ago|reply
In fact they can pack complete poems with or without typos and you can ask where in the poem the typo is, which is exactly what happens if you paste that into GPT: somewhere in an internal layer it will distinguish exactly that.
[+] [-] giveita|6 months ago|reply
[+] [-] DougBTX|6 months ago|reply
[+] [-] unknown|6 months ago|reply
[deleted]
[+] [-] stared|6 months ago|reply
Sometimes these things are patched with cosine distance (or even - Pearson correlation), vide https://p.migdal.pl/blog/2025/01/dont-use-cosine-similarity. Ideally when we don't need to and vectors occupy the space.
I am kind of surprised that the original article does not mention batch normalization and similar operations - these are pretty much created to automatically de-bias and de-correlate values at each layer.
[+] [-] bigdict|6 months ago|reply
[+] [-] Nevermark|6 months ago|reply
[+] [-] GolDDranks|6 months ago|reply
[+] [-] andy_ppp|6 months ago|reply
[+] [-] meindnoch|6 months ago|reply
[+] [-] fancyfredbot|6 months ago|reply
[+] [-] gibsonf1|6 months ago|reply
[+] [-] djoldman|6 months ago|reply
The magic of many current valuable models is simply that they can combine abstract "concepts" like "ruler" + "male" and get "king."
This is perhaps the easiest way to understand the lossy text compression that constitutes many LLMs. They're operating in the embedding space, so abstract concepts can be manipulated between input and output. It's like compiling C using something like LLVM: there's an intermediate representation. (obviously not exactly because generally compiler output is deterministic).
This is also present in image models: "edge" + "four corners" is square, etc.
[+] [-] ignobletruth|6 months ago|reply
This article uses theory to imply a high bound for semantic capacity in a a vector space.
However, this recent article (https://arxiv.org/pdf/2508.21038) empirically characterizes the semantic capacity of embedding vectors, finding inadequate capacity for some use cases.
These two articles seem at odds. Can anyone help put these two findings in context and explain their seeming contradictions?
[+] [-] niemandhier|6 months ago|reply
Johnson-Lichtenstrauss guarantees a distance preserving embedding for a finite set of points into a space with a dimension based on the number of points.
It does not say anything about preserving the underlying topology of the contious high dimensional manifold, that would be Takens/Whitney-style embedding results (and Sauer–Yorke for attractors).
The embedding dimensions needed to fulfil Takens are related to the original manifolds dimension and not the number of points.
It’s quite probable that we observe violations of topological features of the original manifold, when using our to low dimensional embedded version to interpolate.
I used AI to sort the hodge pudge of math in my head into something another human could understand, edited result is below:
=== AI in use === If you want to resolve an attractor down to a spatial scale rho, you need about n ≈ C * rho^(-d_B) sample points (here d_B is the box-counting/fractal dimension).
The Johnson–Lindenstrauss (JL) lemma says that to preserve all pairwise distances among n points within a factor 1±ε, you need a target dimension
k ≳ (d_B / ε^2) * log(C / rho).
So as you ask for finer resolution (rho → 0), the required k must grow. If you keep k fixed (i.e., you embed into a dimension that’s too low), there is a smallest resolvable scale
rho* (roughly rho* ≳ C * exp(-(ε^2/d_B) * k), up to constants),
below which you can’t keep all distances separated: points that are far on the true attractor will show up close after projection. That’s called “folding” and might be the source of some of the problems we observe .
=== AI end ===
Bottom line: JL protects distance geometry for a finite sample at a chosen resolution; if you push the resolution finer without increasing k, collisions are inevitable. This is perfectly consistent with the embedding theorems for dynamical systems, which require higher dimensions to get a globally one-to-one (no-folds) representation of the entire attractor.
If someone is bored and would like to discuss this, feel free to email me.
[+] [-] sdl|6 months ago|reply
[1] https://en.m.wikipedia.org/wiki/Map_projection
[+] [-] rini17|6 months ago|reply
[+] [-] emil-lp|6 months ago|reply
[+] [-] Mithriil|6 months ago|reply
[+] [-] prerok|6 months ago|reply
Sure, 12k vector space has a significant amount of individual values, but not concepts. This is ridiculous. I mean Shannon would like to have a word with you.
[+] [-] WithinReason|6 months ago|reply
[+] [-] cpldcpu|6 months ago|reply
(where x is a number dependent on architectural features like MLHA, QGA...)
There is this thing called KV cache which holds an enormous latent state.
[+] [-] lvl155|6 months ago|reply