Without a way to tune it, this visualization is as much about the dimensionality reduction algorithm used as the embeddings themselves, because trade-offs are unavoidable when you go from a very high dimensional space to a 2D one. I would not read too much into it.
You can choose which dimensions to show, pick which embeddings to show, and play with vector maths between them in a visual way
It doesn't show the whole set of embeddings, though I am sure someone could fix that, as well as adapting it to use the gpt-oss model instead of the custom (?) mini set it uses.
its based on Three.js and creates great 3D graph visualisations GPU rendered (webgl). This could make it alot more interresting to watch because it could display actual depth (your gpu is gonne run hot but i guess worth it)
Not everything has to be directly informative or solve a problem. Sometimes data visualization can look pretty for pretty's sake.
Dimensionality reduction/clustering like this may be less useful for identifying trends in token embeddings, but for other types of embeddings it's extremely useful.
I lets you inspect what actually constitutes a given cluster, for example it seems like the outer clusters are variations of individual words and their direct translations, rather than synonyms (the ones I saw at least).
> What do people learn from visualizations like this?
Applying the embeddings model to some dataset of yours of interest, and then a similar visualization, is where it gets cool because you can visually look at clusters and draw conclusions about the closeness of items in your own dataset
Embedding visualizations have helped identify bias in word embeddings (Word2Vec), debug entity resolution systems, and optimize document retrieval by revealing semantic clusters that inform better indexing strategies.
They are incomparable. Token embeddings generated with something like word2vec worked well because the networks are shallow and therefore the learned semantic data can be contained solely and independently within the embeddings themselves. Token embeddings as a part of an LLM (e.g. gpt-oss-20b) are conditioned on said LLM and do not have fully independent learned data, although as shown here there still can be some relationships preserved.
Embeddings derived from autoregressive language models apply full attention mechanisms to get something different entirely.
Usually PCA doesn't look quite like this so this is likely done using TSNE or UMAP, which are non parametric embeddings (they optimize a loss by modifying the embedded points directly). I can see labels if I mouseover the dots.
esafak|6 months ago
promiseofbeans|6 months ago
You can choose which dimensions to show, pick which embeddings to show, and play with vector maths between them in a visual way
It doesn't show the whole set of embeddings, though I am sure someone could fix that, as well as adapting it to use the gpt-oss model instead of the custom (?) mini set it uses.
voodooEntity|6 months ago
https://github.com/vasturiano/3d-force-graph
a try, for the text labels you can use
https://github.com/vasturiano/three-spritetext
its based on Three.js and creates great 3D graph visualisations GPU rendered (webgl). This could make it alot more interresting to watch because it could display actual depth (your gpu is gonne run hot but i guess worth it)
just a suggestion.
numpad0|6 months ago
int_19h|6 months ago
_def|6 months ago
graphviz|6 months ago
What is the most important problem anyone has solved this way?
Speaking as somewhat of a co-defendant.
minimaxir|6 months ago
Dimensionality reduction/clustering like this may be less useful for identifying trends in token embeddings, but for other types of embeddings it's extremely useful.
jablongo|6 months ago
TuringNYC|6 months ago
Applying the embeddings model to some dataset of yours of interest, and then a similar visualization, is where it gets cool because you can visually look at clusters and draw conclusions about the closeness of items in your own dataset
ethan_smith|6 months ago
lawlessone|6 months ago
That they're related or connected or it arbitrary?
Why does it look like a fried egg?
edit: must be related in some way as one of the "droplets" in the bottom left quadrant seems to consist of various versions of the word "parameter"
minimaxir|6 months ago
The density of the clusters tend to have trends. In this case, the "yolk" has a lot of bizarre unicode tokens.
suprjami|6 months ago
https://stock.adobe.com/images/asteroid-hitting-the-earth-ai...
ashvardanian|6 months ago
minimaxir|6 months ago
Embeddings derived from autoregressive language models apply full attention mechanisms to get something different entirely.
eddywebs|6 months ago
minimaxir|6 months ago
kingstnap|6 months ago
My guess is its the 2 largest principle components of the embedding.
But none of the points are labelled? There isn't a writeup on the page or anything?
jablongo|6 months ago
terhechte|6 months ago