This seems confusingly phrased. When they say things like "500 Vision Transformers", what they mean is 500 finetunes of the same base model, downloaded from the huggingface accounts of anonymous randos. These spaces are only "universal" to a single pretrained base model AFAICT. Is it really that surprising that finetunes would be extremely similar to each other? Especially LoRAs?
This is an important clarification; from the abstract and title I was super confused how they identified a "subspace" that could be consistently identified across model structures (I was assuming they meant that they saw stability in the dimension of the weight subspace or something), but if they're just referring to one model class that clears things up substantially. It's definitely also a much weaker result IMO, basically just confirming that the model's loss function has a well-posed minima, which...duh? I mean I guess I'm glad someone checked that, but called it "the universal weight subspace hypothesis" seems a bit dramatic.
I agree - the results on the finetunes are not very surprising. The trained-from-scratch ResNets (Figure 2 and Section 3.2.1) are definitely more interesting, though somewhat limited in scope.
In any case, my impression is that this is not immediately more useful than a LoRA (and is probably not intended to be), but is maybe an avenue for further research.
Each fine tune drags the model weights away from the base model in a certain direction.
Given 500 fine tune datasets, we could expect the 500 drag directions to span a 500 dimensional space. After all, 500 random vectors in a high dimensional space are likely to be mutually orthogonal.
The paper shows, however, that the 500 drag directions live in a ~40 dimensional subspace.
Another way to say it is that you can compress fine tune weights into a vector of 40 floats.
Imagine if, one day, fine tunes on huggingface were not measured in gigabytes, megabytes, or even kilobytes. Suppose you started to see listings like 160 bytes. Would that be surprising?
I’m leaving out the detail that the basis direction vectors themselves would have to be on your machine and each basis direction is as big as the model itself. And I’m also taking for granted that the subspace dimension will not increase as the number of fine tune datasets increases.
I agree that the authors decision to use random models on hugging face is unfortunate. I’m hopeful that this paper will inspire follow up works that train large models from scratch.
For those trying to understand the most important parts of the paper, here's what I think is the most significant two statements, subquoted out of two (consecutive) paragraphs midway through the paper:
> we selected five additional, previously unseen pretrained ViT models for which we had access to evaluation data. These models, considered out-of-domain relative to the initial set, had all their weights reconstructed by projecting onto the identified 16-dimensional universal subspace. We then assessed their classification accuracy and found no significant drop in performance
> we can replace these 500 ViT models with a single Universal Subspace model. Ignoring the task-variable first and last layer [...] we observe a requirement of 100 × less memory, and these savings are prone to increase as the number of trained models increases. We note that we are, to the best of our knowledge, the first work, to be able to merge 500 (and theoretically more) Vision Transformer into a single universal subspace model. This result implies that hundreds of ViTs can be represented using a single subspace model
So, they found an underlying commonality among the post-training structures in 50 LLaMA3-8B models, 177 GPT-2 models, and 8 Flan-T5 models; and, they demonstrated that the commonality could in every case be substituted for those in the original models with no loss of function; and noted that they seem to be the first to discover this.
For a tech analogy, imagine if you found a bzip2 dictionary that reduced the size of every file compressed by 99%, because that dictionary turns out to be uniformly helpful for all files. You would immediately open a pull request to bzip2 to have the dictionary built-in, because it would save everyone billions of CPU hours. [*]
[*] Except instead of 'bzip2 dictionary' (strings of bytes), they use the term 'weight subspace' (analogy not included here[**]) — and, 'file compression' hours becomes 'model training' hours. It's just an analogy.
[**] 'Hilbert subspaces' is just incorrect enough to be worth appending as a footnote[***].
Edit: actually this paper is the canonical reference (?): https://arxiv.org/abs/2007.00810 models converge to same space up to a linear transformation. Makes sense that a linear transformation (like PCA) would be able to undo that transformation.
You can show for example that siamese encoders for time-series, with MSE loss on similarity, without a decoder, will converge to the the same latent space up to orthogonal transformations (as MSE is kinda like gaussian prior which doesn’t distinguish between different rotations).
Similarly I would expect that transformers trained on the same loss function for predicting the next word, if the data is at all similar (like human language), would converge to approx the same space, up to some, likely linear, transformations. And to represent that same space probably weights are similar, too. Weights in general seem to occupy low-dimensional spaces.
All in all, I don’t think this is that surprising, and I think the theoretical angle should be (have been?) to find mathematical proofs like this paper https://openreview.net/forum?id=ONfWFluZBI
They also have a previous paper (”CEBRA”) published in Nature with similar results.
> So, they found an underlying commonality among the post-training structures in 50 LLaMA3-8B models, 177 GPT-2 models, and 8 Flan-T5 models; and, they demonstrated that the commonality could in every case be substituted for those in the original models with no loss of function; and noted that they seem to be the first to discover this.
Could someone clarify what this means in practice? If there is a 'commonality' why would substituting it do anything? Like if there's some subset of weights X found in all these models, how would substituting X with X be useful?
I see how this could be useful in principle (and obviously it's very interesting), but not clear on how it works in practice. Could you e.g. train new models with that weight subset initialized to this universal set? And how 'universal' is it? Just for like like models of certain sizes and architectures, or in some way more durable than that?
I think the paper in general completely oversells the idea of "universality".
For CNNs, the 'Universal Subspace' is simply the strong inductive bias (locality) forcing filters into standard signal processing shapes (Laplacian/Gabor) regardless of the data. Since CNNs are just a constrained subset of operations, this convergence is not that surprising.
For Transformers, which lack these local constraints, the authors had to rely on fine-tuning (shared initialization) to find a subspace. This confirms that 'Universality' here is really just a mix of CNN geometric constraints and the stability of pre-training, rather than a discovered intrinsic property of learning.
For me at least, I wasn't even under the impression that this was a possible research angle to begin with. Crazy stuff that people are trying, and very cool too!
It's basically way better than LoRA under all respects and could even be used to speed up inference. I wonder whether the big models are not using it already... If not we'll see a blow up in capabilities very, very soon.
What they've shown is that you can find the subset of parameters responsible for transfer of capability to new tasks.
Does it apply to completely novel tasks? No, that would be magic. Tasks that need new features or representations break the method, but if it fits in the same domain then the answer is "YES".
Here's a very cool analogy from GPT 5.1 which hits the nail in the head in explaining the role of subspace in learning new tasks by analogy with 3d graphics.
Think of 3D character animation rigs:
• The mesh has millions of vertices (11M weights).
• Expressions are controlled via:
• “smile”
• “frown”
• “blink”
Each expression is just:
mesh += α_i \* basis_expression_i
Hundreds of coefficients modify millions of coordinates.
> Does it apply to completely novel tasks? No, that would be magic.
Are there novel tasks? Inside the limits of physics, tasks are finite, and most of them are pointless. One can certainly entertain tasks that transcend physics, but that isn't necessary if one merely wants an immortal and indomitable electronic god.
I’ve had a hard time parsing what exactly the paper is trying to explain. So far I’ve understood that their comparison seems to be models within the same family and same weight tensor dimensions, so they aren’t showing a common subspace when there isn’t a 1:1 match between weight tensors in a ViT and GPT2. The plots showing the distribution of principal component values presumably does this on every weight tensor, but this seems to be an expected result that the principal component values shows a decaying curve like a log curve where only a few principal components are the most meaningful.
What I don’t get is what is meant by a universal shared subspace, because there is some invariance regarding the specific values in weights and the directions of vectors in the model. For instance, if you were doing matrix multiplication with a weight tensor, you could swap two rows/columns (depending on the order of multiplication) and all that would do is swap two values in the resulting product, and whatever uses that output could undo the effects of the swap so the whole model has identical behavior, yet you’ve changed the direction of the principal components. There can’t be fully independently trained models that share the exact subspace directions for analogous weight tensors because of that.
interesting.. this could make training much faster if there’s a universal low dimensional space that models naturally converge into, since you could initialize or constrain training inside that space instead of spending massive compute rediscovering it from scratch every time
You can show for example that siamese encoders for time-series, with MSE loss on similarity, without a decoder, will converge to the the same latent space up to orthogonal transformations (as MSE is kinda like gaussian prior which doesn’t distinguish between different rotations).
Similarly I would expect that transformers trained on the same loss function for predicting the next word, if the data is at all similar (like human language), would converge to approx the same space. And to represent that same space probably weights are similar, too. Weights in general seem to occupy low-dimensional spaces.
All in all, I don’t think this is that surprising, and I think the theoretical angle should be (have been?) to find mathematical proofs like this paper https://openreview.net/forum?id=ONfWFluZBI
>instead of spending massive compute rediscovering it from scratch every time
it's interesting that this paper was discovered by JHU, not some groups from OAI/Google/Apple, considering that the latter probably have spent 1000x more resource on "rediscovering"
I have a real soft spot for the genetic algorithm as a result of reading Levy's "Artificial Life" when I was a kid. The analogy to biological life is more approachable to my poor math education than neural networks. I can grok crossover and mutation pretty easily. Backpropagation is too much for my little brain to handle.
I've been messing around with GA recently, esp indirect encoding methods. This paper seems in support of perspectives I've read while researching. In particular, that you can decompose weight matrices into spectral patterns - similar to JPEG compression and search in compressed space.
Something I've been interested in recently is - I wonder if it'd be possible to encode a known-good model - some massive pretrained thing - and use that as a starting point for further mutations.
Like some other comments in this thread have suggested, it would mean we can distill the weight patterns of things like attention, convolution, etc. and not have to discover them by mutation - so - making use of the many phd-hours it took to develop those patterns, and using them as a springboard. If papers like this are to be believed, more advanced mechanisms may be able to be discovered.
That would be an excellent use of GA and all the other 'not based on training a network' methods, now that we have a target and can evaluate against it!
I got crazy obsessed with EvoLisa¹ back in the day and although there is nothing in common between that algorithm and those that make up training an LLM, I can't help but feel like they are similar.
I hope someone much smarter than I answers this. I’ve been noticing an uptick platonic and neo-platonic discourse in the zeitgeist and am wondering if we’re converging on something profound.
From what I can tell, they are very closely related (i.e. the shared representational structures would likely make good candidates for Platonic representations, or rather, representations of Platonic categories). In any case, it seems like there should be some sort of interesting mapping between the two.
Same hat, except 18 months later, assuming it survives peer review, reproduction, etc. (or: "The newer one proposes evidence that appears to support the older one.")
The authors study a bunch of wild low rank fine tunes and discover that they share a common... low rank! ... substructure which is itself base model dependent. Humans are (genetically) the same. You need only a handful of PCs to represent the cast majority of variation. But that's because of our shared ancestry. And maybe the same thing is going on here.
On a hike this weekend my daughter and I talked about the similarities of the branching and bifurcating patterns in the melting ice on a pond, the branches of trees, still photos of lightning, the circulatory system, and the filaments in fractals.
I read the abstract (not the whole paper) and the great summarizing comments here.
Beyond the practical implications of this (i.e. reduced training and inference costs), I'm curious if this has any consequences for "philosophy of the mind"-type of stuff. That is, does this sentence from the abstract, "we identify universal subspaces capturing majority variance in just a few principal directions", imply that all of these various models, across vastly different domains, share a large set of common "plumbing", if you will? Am I understanding that correctly? It just sounds like it could have huge relevance to how various "thinking" (and I know, I know, those scare quotes are doing a lot of work) systems compose their knowledge.
Somewhat of a tangent, but if you enjoy the philosophy of AI and mathematics, I highly recommend reading Gödel, Escher, Bach: an Eternal Golden Braid by D. Hofstadter. It is primarily about the Incompleteness Theorem, but does touch on AI and what we understand as being an intelligence
I saw a similar (I think!) paper "Grassmannian Optimization Drives Generalization in Overparameterized DNN" at OPT-ML at neurips last week[0]
This is a little outside my area, but I think the relevant part of that abstract is "Gradient-based optimization follows horizontal lifts across low-dimensional subspaces in the Grassmannian Gr(r, p), where r p is the rank of the Hessian at the optimum"
I think this question is super interesting though: why can massively overparametrised models can still generalise?
"The Platonic Representation Hypothesis [17] conjectures that all image models of sufficient size have the same latent representation. We propose a stronger, constructive version of this hypothesis for text models: the universal latent structure of text representations can be learned and, furthermore, harnessed to translate representations from one space to another without any paired data or encoders.
In this work, we show that the Strong Platonic Representation Hypothesis holds in practice. Given unpaired examples of embeddings from two models with different architectures and training data, our method learns a latent representation in which the embeddings are almost identical"
Also from the OP's Paper we see this on statement:
"Why do these universal subspaces emerge? While the precise mechanisms driving this phenomenon remain an open area of investigation, several theoretical factors likely contribute to the emergence of these shared structures.
First, neural networks are known to exhibit a spectral bias toward low frequency functions, creating a polynomial decay in eigenvalues that concentrates learning dynamics into a small number of dominant directions (Belfer et al., 2024; Bietti et al., 2019).
Second, modern architectures impose strong inductive biases that constrain the solution space: convolutional structures
inherently favor local, Gabor-like patterns (Krizhevsky et al., 2012; Guth et al., 2024), while attention
mechanisms prioritize recurring relational circuits (Olah et al., 2020; Chughtai et al., 2023).
Third, the ubiquity of gradient-based optimization – governed by kernels that are largely invariant to task specifics in the infinite-width limit (Jacot et al., 2018) – inherently prefers smooth solutions, channeling diverse learning trajectories toward shared geometric manifolds (Garipov et al., 2018).
If these hypotheses hold, the universal subspace likely captures fundamental computational patterns
that transcend specific tasks, potentially explaining the efficacy of transfer learning and why diverse
problems often benefit from similar architectural modifications."
Dr. Levin’s work is so fascinating. Glad to see his work referenced. If anyone wishes to learn more while idle or commuting, check out Lex Friedman’s podcast episode with him linked above
Many discriminative models converge to same representation space up to a linear transformation. Makes sense that a linear transformation (like PCA) would be able to undo that transformation.
Without properly reading the linked article, if thats all this is, not a particularly new result. Nevertheless this direction of proofs is imo at the core of understanding neural nets.
The central claim, or "Universal Weight Subspace Hypothesis," is that deep neural networks, even when trained on completely different tasks (like image recognition vs. text generation) and starting from different random conditions, tend to converge to a remarkably similar, low-dimensional "subspace" in their massive set of weights.
> We analyze over 1,100 deep neural networks—including 500 Mistral-7B LoRAs and 500 Vision Transformers. We provide the first large-scale empirical evidence that networks systematically converge to shared, low-dimensional spectral subspaces, regardless of initialization, task, or domain.
I instantly thought of muon optimizer which provides high-rank gradient updates and Kimi-k2 which is trained using muon, and see no related references.
The 'universal' in the title is not that universal.
I hope that this leads to more efficient models. And it’s intuitive- it seems as though you could find the essence of a good model and a model reduced to that essence would be more efficient. But, this is theoretical. I can also theorize flying cars- many have, it seems doable and achievable, but yet I see no flying cars on my way to work.
Read the paper end to end today. I think its the most outrageous ideas of 2025 - at least amongst the papers I've read. So counterintuitive initially and yet so intuitive. Personally, kinda hate the implications. But, a paper like this was definitely needed.
I have been trying to reproduce ("vibecoded" with some care) their results for the 500 loras part which I am familiar with and unfortunately can not see that drop at rank 16 that they show in their Figure and use for further claims. Looking forward to their code :)
> Principal component analysis of 200 GPT2, 500 Vision Transformers, 50 LLaMA-
8B, and 8 Flan-T5 models reveals consistent sharp spectral decay - strong evidence that a small number of weight
directions capture dominant variance despite vast differences in training data, objectives, and initialization.
Well intuitively it makes sense that within each independent model, a small number of weights / parameters are very dominant, but it’s still super interesting that these can be swapped between all the models without loss of performance.
It isn’t obvious that these parameters are universal across all models.
This general idea shows up all over the place though. If you do 3D scans on thousands of mammal skulls, you'll find that a few PCs account for the vast majority of the variance. If you do frequency domain analysis of various physiological signals...same thing. Ditto for many, many other natural phenomena in the world. Interesting (maybe not surprising?) to see it in artificial phenomena as well
Not really. If the models are trained on different dataset - like one ViT trained on satellite images and another on medical X-rays - one would expect their parameters, which were randomly initialized to be completely different or even orthogonal.
This is a good point, but I think this only works for D*A, where D=Sigma is a diagonal matrix with learnable parameters. It probably doesn't work for a full singular value decomposition (SVD) UDV^T.
Basically, what if we're not actually "training" the model, but rather the model was randomly initialized and the learning algorithm is just selecting the vectors that happen to point into the right direction? A left multiplication of the form D*A with a diagonal matrix is equivalent to multiplying each row in A with the corresponding diagonal element. Low values mean the vector in question was a lottery blank and unnecessary. High values means that this turns out to be correct vector, yay!
But this trivial explanation doesn't work for the full SVD, because you now have a right multiplication U*D. This means each column gets multiplied against the corresponding diagonal element. Both the column in U and row vector in V^T have to perfectly coincide to make the "selection" theory work, which is unlikely to be true for small models, which happen to work just fine.*
To use an analogy: Imagine a spreadsheet with 500 smoothie recipes one in each row, each with a dozen ingredients as the columns.
Now imagine you discover that all 500 are really just the same 11 base ingredients plus something extra.
What they've done here is use SVD, (which is normally used for image compression and noise reduction), to find that "base recipe". Now we can reproduce those other recipes by only recording the one igredient that differs.
More interestingly it might tell us something new about smoothies in general to know that they all share a common base. Maybe we can even build a simpler base using this info.
At least in theory. The code hasn't actually been released yet.
They identified that the compressed representation has structure to it that could potentially be discovered more quickly. It’s unclear if it would also make it easier to compress further but that’s possible.
They are analyzing models trained on classification tasks. At the end of the day, classification is about (a) engineering features that separate the classes and (b) finding a way to represent the boundary. It's not surprising to me that they would find these models can be described using a small number of dimensions and that they would observe similar structure across classification problems. The number of dimensions needed is basically a function of the number of classes. Embeddings in 1 dimension can linearly separate 2 classes, 2 dimensions can linearly separate 4 classes, 3 dimensions can linearly separate 8 classes, etc.
Pretty funny if you ask me. Maybe we can start to realize now: "The common universal subspace between human individuals makes it easier for all of them to do 'novel' tasks so long as their ego and personality doesn't inhibit that basic capacity."
And that: "Defining 'novel' as 'not something that you've said before even though your using all the same words, concepts, linguistic tools, etc., doesn't actually make it 'novel'"
Point being, yeah duh, what's the difference between what any of these models are doing anyway? It would be far more surprising if they discovered a *different* or highly-unique subspace for each one!
Someone gives you a magic lamp and the genie comes out and says "what do you wish for"?
That's still the question. The question was never "why do all the genies seem to be able to give you whatever you want?"
Now that we know about this, that the calculations in the trained models follow some particular forms, is there an approximation algorithm to run the models without GPUs?
So, while the standard models are like herbivores grazing on the internet data, they built a model that is a carnivore or a predator species trained on other models? Sounds like an evolution of the species.
modeless|2 months ago
I visited one of the models they reference and huggingface says it has malware in it: https://huggingface.co/lucascruz/CheXpert-ViT-U-MultiClass
tech_ken|2 months ago
daemonologist|2 months ago
In any case, my impression is that this is not immediately more useful than a LoRA (and is probably not intended to be), but is maybe an avenue for further research.
markisus|2 months ago
Given 500 fine tune datasets, we could expect the 500 drag directions to span a 500 dimensional space. After all, 500 random vectors in a high dimensional space are likely to be mutually orthogonal.
The paper shows, however, that the 500 drag directions live in a ~40 dimensional subspace.
Another way to say it is that you can compress fine tune weights into a vector of 40 floats.
Imagine if, one day, fine tunes on huggingface were not measured in gigabytes, megabytes, or even kilobytes. Suppose you started to see listings like 160 bytes. Would that be surprising?
I’m leaving out the detail that the basis direction vectors themselves would have to be on your machine and each basis direction is as big as the model itself. And I’m also taking for granted that the subspace dimension will not increase as the number of fine tune datasets increases.
I agree that the authors decision to use random models on hugging face is unfortunate. I’m hopeful that this paper will inspire follow up works that train large models from scratch.
mlpro|2 months ago
Havoc|2 months ago
altairprime|2 months ago
> we selected five additional, previously unseen pretrained ViT models for which we had access to evaluation data. These models, considered out-of-domain relative to the initial set, had all their weights reconstructed by projecting onto the identified 16-dimensional universal subspace. We then assessed their classification accuracy and found no significant drop in performance
> we can replace these 500 ViT models with a single Universal Subspace model. Ignoring the task-variable first and last layer [...] we observe a requirement of 100 × less memory, and these savings are prone to increase as the number of trained models increases. We note that we are, to the best of our knowledge, the first work, to be able to merge 500 (and theoretically more) Vision Transformer into a single universal subspace model. This result implies that hundreds of ViTs can be represented using a single subspace model
So, they found an underlying commonality among the post-training structures in 50 LLaMA3-8B models, 177 GPT-2 models, and 8 Flan-T5 models; and, they demonstrated that the commonality could in every case be substituted for those in the original models with no loss of function; and noted that they seem to be the first to discover this.
For a tech analogy, imagine if you found a bzip2 dictionary that reduced the size of every file compressed by 99%, because that dictionary turns out to be uniformly helpful for all files. You would immediately open a pull request to bzip2 to have the dictionary built-in, because it would save everyone billions of CPU hours. [*]
[*] Except instead of 'bzip2 dictionary' (strings of bytes), they use the term 'weight subspace' (analogy not included here[**]) — and, 'file compression' hours becomes 'model training' hours. It's just an analogy.
[**] 'Hilbert subspaces' is just incorrect enough to be worth appending as a footnote[***].
[***] As a second footnote.
tsurba|2 months ago
You can show for example that siamese encoders for time-series, with MSE loss on similarity, without a decoder, will converge to the the same latent space up to orthogonal transformations (as MSE is kinda like gaussian prior which doesn’t distinguish between different rotations).
Similarly I would expect that transformers trained on the same loss function for predicting the next word, if the data is at all similar (like human language), would converge to approx the same space, up to some, likely linear, transformations. And to represent that same space probably weights are similar, too. Weights in general seem to occupy low-dimensional spaces.
All in all, I don’t think this is that surprising, and I think the theoretical angle should be (have been?) to find mathematical proofs like this paper https://openreview.net/forum?id=ONfWFluZBI
They also have a previous paper (”CEBRA”) published in Nature with similar results.
westoncb|2 months ago
Could someone clarify what this means in practice? If there is a 'commonality' why would substituting it do anything? Like if there's some subset of weights X found in all these models, how would substituting X with X be useful?
I see how this could be useful in principle (and obviously it's very interesting), but not clear on how it works in practice. Could you e.g. train new models with that weight subset initialized to this universal set? And how 'universal' is it? Just for like like models of certain sizes and architectures, or in some way more durable than that?
scotty79|2 months ago
or is it just that 16 was arbitrarily chosen by them as close enough to the actual minimal number of dimensions necessary?
N_Lens|2 months ago
- Training costs: We might discover these universal subspaces without training thousands of models
- Storage requirements: Models could share common subspace representations
scotty79|2 months ago
augment_me|2 months ago
For CNNs, the 'Universal Subspace' is simply the strong inductive bias (locality) forcing filters into standard signal processing shapes (Laplacian/Gabor) regardless of the data. Since CNNs are just a constrained subset of operations, this convergence is not that surprising.
For Transformers, which lack these local constraints, the authors had to rely on fine-tuning (shared initialization) to find a subspace. This confirms that 'Universality' here is really just a mix of CNN geometric constraints and the stability of pre-training, rather than a discovered intrinsic property of learning.
sigbottle|2 months ago
masteranza|2 months ago
Here's a very cool analogy from GPT 5.1 which hits the nail in the head in explaining the role of subspace in learning new tasks by analogy with 3d graphics.
topspin|2 months ago
Are there novel tasks? Inside the limits of physics, tasks are finite, and most of them are pointless. One can certainly entertain tasks that transcend physics, but that isn't necessary if one merely wants an immortal and indomitable electronic god.
mlpro|2 months ago
alyxya|2 months ago
What I don’t get is what is meant by a universal shared subspace, because there is some invariance regarding the specific values in weights and the directions of vectors in the model. For instance, if you were doing matrix multiplication with a weight tensor, you could swap two rows/columns (depending on the order of multiplication) and all that would do is swap two values in the resulting product, and whatever uses that output could undo the effects of the swap so the whole model has identical behavior, yet you’ve changed the direction of the principal components. There can’t be fully independently trained models that share the exact subspace directions for analogous weight tensors because of that.
seeknotfind|2 months ago
kacesensitive|2 months ago
tsurba|2 months ago
Similarly I would expect that transformers trained on the same loss function for predicting the next word, if the data is at all similar (like human language), would converge to approx the same space. And to represent that same space probably weights are similar, too. Weights in general seem to occupy low-dimensional spaces.
All in all, I don’t think this is that surprising, and I think the theoretical angle should be (have been?) to find mathematical proofs like this paper https://openreview.net/forum?id=ONfWFluZBI
moelf|2 months ago
it's interesting that this paper was discovered by JHU, not some groups from OAI/Google/Apple, considering that the latter probably have spent 1000x more resource on "rediscovering"
bigbuppo|2 months ago
odyssey7|2 months ago
VikingCoder|2 months ago
But I always want Genetic Algorithms to show up in any discussion about neural networks...
EvanAnderson|2 months ago
dcrimp|2 months ago
Something I've been interested in recently is - I wonder if it'd be possible to encode a known-good model - some massive pretrained thing - and use that as a starting point for further mutations.
Like some other comments in this thread have suggested, it would mean we can distill the weight patterns of things like attention, convolution, etc. and not have to discover them by mutation - so - making use of the many phd-hours it took to develop those patterns, and using them as a springboard. If papers like this are to be believed, more advanced mechanisms may be able to be discovered.
altairprime|2 months ago
joquarky|2 months ago
¹ https://www.rogeralsing.com/2008/12/07/genetic-programming-e...
CalChris|2 months ago
canjobear|2 months ago
unionjack22|2 months ago
MarkusQ|2 months ago
altairprime|2 months ago
https://arxiv.org/abs/2405.07987
nothrowaways|2 months ago
inciampati|2 months ago
api|2 months ago
EvanAnderson|2 months ago
mwkaufma|2 months ago
hn_throwaway_99|2 months ago
Beyond the practical implications of this (i.e. reduced training and inference costs), I'm curious if this has any consequences for "philosophy of the mind"-type of stuff. That is, does this sentence from the abstract, "we identify universal subspaces capturing majority variance in just a few principal directions", imply that all of these various models, across vastly different domains, share a large set of common "plumbing", if you will? Am I understanding that correctly? It just sounds like it could have huge relevance to how various "thinking" (and I know, I know, those scare quotes are doing a lot of work) systems compose their knowledge.
gedy|2 months ago
themaxice|2 months ago
statusfailed|2 months ago
This is a little outside my area, but I think the relevant part of that abstract is "Gradient-based optimization follows horizontal lifts across low-dimensional subspaces in the Grassmannian Gr(r, p), where r p is the rank of the Hessian at the optimum"
I think this question is super interesting though: why can massively overparametrised models can still generalise?
[0]: https://opt-ml.org/papers/2025/paper90.pdf
AIorNot|2 months ago
E.g
https://youtu.be/Qp0rCU49lMs?si=UXbSBD3Xxpy9e3uY
https://thoughtforms.life/symposium-on-the-platonic-space/
e.g see this paper on Universal Embeddings https://arxiv.org/html/2505.12540v2
"The Platonic Representation Hypothesis [17] conjectures that all image models of sufficient size have the same latent representation. We propose a stronger, constructive version of this hypothesis for text models: the universal latent structure of text representations can be learned and, furthermore, harnessed to translate representations from one space to another without any paired data or encoders.
In this work, we show that the Strong Platonic Representation Hypothesis holds in practice. Given unpaired examples of embeddings from two models with different architectures and training data, our method learns a latent representation in which the embeddings are almost identical"
Also from the OP's Paper we see this on statement:
"Why do these universal subspaces emerge? While the precise mechanisms driving this phenomenon remain an open area of investigation, several theoretical factors likely contribute to the emergence of these shared structures.
First, neural networks are known to exhibit a spectral bias toward low frequency functions, creating a polynomial decay in eigenvalues that concentrates learning dynamics into a small number of dominant directions (Belfer et al., 2024; Bietti et al., 2019).
Second, modern architectures impose strong inductive biases that constrain the solution space: convolutional structures inherently favor local, Gabor-like patterns (Krizhevsky et al., 2012; Guth et al., 2024), while attention mechanisms prioritize recurring relational circuits (Olah et al., 2020; Chughtai et al., 2023).
Third, the ubiquity of gradient-based optimization – governed by kernels that are largely invariant to task specifics in the infinite-width limit (Jacot et al., 2018) – inherently prefers smooth solutions, channeling diverse learning trajectories toward shared geometric manifolds (Garipov et al., 2018).
If these hypotheses hold, the universal subspace likely captures fundamental computational patterns that transcend specific tasks, potentially explaining the efficacy of transfer learning and why diverse problems often benefit from similar architectural modifications."
unionjack22|2 months ago
tsurba|2 months ago
https://arxiv.org/abs/2007.00810
Without properly reading the linked article, if thats all this is, not a particularly new result. Nevertheless this direction of proofs is imo at the core of understanding neural nets.
mlpro|2 months ago
nextworddev|2 months ago
RandyOrion|2 months ago
> We analyze over 1,100 deep neural networks—including 500 Mistral-7B LoRAs and 500 Vision Transformers. We provide the first large-scale empirical evidence that networks systematically converge to shared, low-dimensional spectral subspaces, regardless of initialization, task, or domain.
I instantly thought of muon optimizer which provides high-rank gradient updates and Kimi-k2 which is trained using muon, and see no related references.
The 'universal' in the title is not that universal.
horsepatties|2 months ago
mlpro|2 months ago
hagsdp00|2 months ago
nothrowaways|2 months ago
Isn't it obvious?
stingraycharles|2 months ago
It isn’t obvious that these parameters are universal across all models.
levocardia|2 months ago
mlpro|2 months ago
unknown|2 months ago
[deleted]
farhanhubble|2 months ago
imtringued|2 months ago
Basically, what if we're not actually "training" the model, but rather the model was randomly initialized and the learning algorithm is just selecting the vectors that happen to point into the right direction? A left multiplication of the form D*A with a diagonal matrix is equivalent to multiplying each row in A with the corresponding diagonal element. Low values mean the vector in question was a lottery blank and unnecessary. High values means that this turns out to be correct vector, yay!
But this trivial explanation doesn't work for the full SVD, because you now have a right multiplication U*D. This means each column gets multiplied against the corresponding diagonal element. Both the column in U and row vector in V^T have to perfectly coincide to make the "selection" theory work, which is unlikely to be true for small models, which happen to work just fine.*
CGMthrowaway|2 months ago
Not a technical person just trying to put it in other words.
mapontosevenths|2 months ago
Now imagine you discover that all 500 are really just the same 11 base ingredients plus something extra.
What they've done here is use SVD, (which is normally used for image compression and noise reduction), to find that "base recipe". Now we can reproduce those other recipes by only recording the one igredient that differs.
More interestingly it might tell us something new about smoothies in general to know that they all share a common base. Maybe we can even build a simpler base using this info.
At least in theory. The code hasn't actually been released yet.
https://toshi2k2.github.io/unisub/#key-insights
vlovich123|2 months ago
Simplita|2 months ago
ibgeek|2 months ago
mlpro|2 months ago
unknown|2 months ago
[deleted]
lucid-dev|2 months ago
And that: "Defining 'novel' as 'not something that you've said before even though your using all the same words, concepts, linguistic tools, etc., doesn't actually make it 'novel'"
Point being, yeah duh, what's the difference between what any of these models are doing anyway? It would be far more surprising if they discovered a *different* or highly-unique subspace for each one!
Someone gives you a magic lamp and the genie comes out and says "what do you wish for"?
That's still the question. The question was never "why do all the genies seem to be able to give you whatever you want?"
pmkary|2 months ago
zkmon|2 months ago
odyssey7|2 months ago
Atlas667|2 months ago
tim333|2 months ago
tempestn|2 months ago
unknown|2 months ago
[deleted]
ycombigrator|2 months ago
[deleted]
YouAreWRONGtoo|2 months ago
[deleted]
zirt|2 months ago
[deleted]
pagekicker|2 months ago
[deleted]
100721|2 months ago
unknown|2 months ago
[deleted]
zkmon|2 months ago
IAmBroom|2 months ago
- I know what I do not know.
-- I do not know AI.