Some notes on how embeddings/DistilBERT embeddings work since the other comments are confused:
1) There are two primary ways to have models generate embeddings: implicitly from an LLM by mean-pooling its last hidden state since it has to learn how to map text in a distinct latent space anyways to work correctly (i.e. DistilBERT), or you can use a model which can generate embeddings directly which are trained using something like triplet loss to explicitly incentivise learning similarity/dissimilarity. Popular text-embedding models like BAAI/bge-large-en-v1.5 tend to use the latter approach.
2) The famous word2Vec examples of e.g. woman + king = queen only work because word2vec is a shallow network and the model learns the word embeddings directly, instead of it being emergent. The latent space still maps them closely as shown with this demo, but there isn't any algebraic intuition. You can get close with algebra but no cigar.
3) DistilBERT is pretty old (2019) and based on a 2018 model trained on Wikipedia and books, so there will be significant text drift in addition to being less robust with newer modeling techniques and a more robust dataset. I do not recommend using it for production applications nowadays.
4) There is an under-discussed opportunity for dimensionality reduction techniques like PCA (which this demo uses to get the data into 3D) to both improve signal-to-noise and improve distinctiveness. I am working on a blog post of a new technique to handle dimensionality reduction for text embeddings better which may have interesting and profound usability implications.
I’ve been ruminating on the postulation of a universal signature for every entity across sensory complexes (per sense organ reality, vision, touch, mind) which translates to the problem of entities represented in binary needing to be related across modalities as in “butterfly” vs a picture of a butterfly vs the audio of butterfly vs the thought pointing to one of those or other.
I was wondering if there was a universal signal that can be used as the identity and then based on that signal one could measure the distance to any other signal based on the principle relation of not(other). That is to say the identity would be precisely not all else for any X. Said another way, every thing is because it is exactly not everything else.
So thinking as first principles as possible I wondered if it were possible to represent everything as some frequency? A Fourier transform analog for every “time slice” of a thing? This is where it gets slightly slippery.
So the idea was trying to build relationship and identity and labeling from a simple rule set of things arising out of relation of not being other things.
In my mind I saw nodes on a graph forming in higher dimensions as half way points for any comparison. Comparisons create new nodes and implicitly have a distance metric to all other things. It made sense in my mind that there was an algorithmic annealing to new nodes in a “low density higher energetic state” allowing them to move faster in this universal emergent ontology/spatial space; eventually getting more dense and slower as it gets cold.
So the system implicitly also has a snapshot of events or interactions based on that where every comparison has a “tick” that encodes a particular density relation for some set of nodes it’s in association with.
The idea that cemented it all together was to treat each node like an address:chord. Similar to chording keys like a-b-c in some ux programs, but also exactly like chords in music too.
The idea being that when multiple “things” are dialed in at same time it becomes its own emergent label by proximity and association of those things being triggered to new information coming in classified as a distance to not(signal).
I didn’t really realize how close this idea was to what encoders/decoders seem to be doing although I do know I’m trying to think myself towards a universal solution that doesn’t require special encoders for every media type. Hence the Fourier transform path.
Edit: I think this is fascinating. If you use words, like dog, electric, life, and human, all of them appear in one mass however, the words like greet, chicken, and “a“ appear in a different mass density section. I think it’s interesting that the words have diverged in location, with some seeming relationship in the way, the words are used. If this were truly random, I would expect those words to be mixed into the other ones.
I have this except you can see every single word in any dictionary at once in space, it renders individual glyphs. It can show an entire dictionary of words - definitions and roots - and let you fly around in them. It’s fun. I built a sample that “plays” a sentence and its definitions. GitHub.com/tikimcfee/LookAtThat
The more I see stuff like this, the more i want to complete it. It’s heartening to see so many people fancied with seeing words… I just wish I knew where to find these people to like.. befriend and get better. Im getting the feeling I just kinda exist between worlds of lofty ideas and people that are incredibly smart sticking around other people that are incredibly smart.
You're actually kinda hitting the nail on the head. _Generally_, the word2vec woman + king = queen thing was cute but not very real.
People rarely have to get down to the real true metal on the embeddings models, and they're not what people think they are from their memory of word2vec. Ex. there's actually one vector emitted _per token_, the final vector is the mean. And cosine distance for similarity is the only metric anyone is training for.
In summary, there's ~no reason to think a visualization trying to show multiple vectors will ever be meaningful. Even just starting from "they have way way way more dimensions than we can represent visually" is enough to rule it out
Mini LM v2, foundation of most vector dbs, is 384 dims.
n.b. dear reader, if you've heard of that: you should be using v3! V3 is for asymmetric search, aka query => result docs. V2 is for symmetric search, aka chunk of text => similarly worded chunks of texts. It's very very funny how few people read the docs, in this case, the sentence transformers site.
If I gave you a live GPU shader that let you arbitrarily position any of say a few million words with simple Cartesian coordinates, what would you do with it? Whole words expressed as Individual letters - not symbols, representations, or abstractions. Just letters arranged in a specific order to form words.
Hey there! Sincerely cool stuff, I’m glad it’s fun for you.
It’s actually quite approachable to play with, and some of the comments about, “wut?” may be best answered by a little more experimentation on the user’s side, haha. I think the content itself is tricky, which may trip people up.
Something I’ve seen before that may be interesting is doing something with the definitions of words. ATM, you’re using a source list of words and using the embedded vectors to visualize. But what if you visualized not just the words themselves, but the ordered list of words that make up the definition(s) of that word visible in some spatial relationship. This would look interesting because around (connected to?) each word is its meaning in this case; changing the definition (the context use of the word) would also change the definition… and also change the connected word nodes in the graph. I envision ordered lines and colored words in this style.
If you end up doing something like that, start with like.. a “sentence player”. At the moment you show the words at once. What would it look like to “animate” the appearance of the words and their relationships by definition?
Anyway. Thanks for getting this far, haha. This is a really fascinating project and I’m glad you shared it. Please do tell if any of this is close or far off from something you might be interested in!
Typically these types of single word embedding visualizations work much better with non contextualized models such as the more traditional gensim or w2v approaches, as contextual encoder-based embedding models like BERT don't 'bake in' as much to the token (word) itself, and rather rely on its context to define it.
Also, often PCA for contextual models like BERT end up with $PC_0$ aligned with the length of the document.
By running the same multiple times, I get different visualization. I don't really understand what's going on, but I like the idea of visualizing embeddings.
I’m looking for more resources like this that attempt to visually explain vectors, as I’ll be giving some talks around vector search. Does anyone have related suggestions?
I think that's going to be a geodesic in a hyper-dimensional manifold. There was an article here about 'wordlets' on a hyper-sphere and a piece on time and LLM and the relating manifold. Visualising LLM topology (multi-dimensional topological manifolds) is a very rich area for exploration. I'm waiting for someone to use PHATE to do the dimension reduction, it's used in neuroscience to reduce dimensionality providing information not visible using PCA, t-SNE, LDA or UMAP.
In the bottom left "?" button it says it performs PCA down to 3 dimensions. That's going to lose a ton of information, rendering the space mostly useless.
I wonder what patterns we could create from the word glyphs that produce meaningful patterns. You said “edge of the galaxy” based on the position of the words. I wonder what else you’d come up with different embeddings and organizations.
minimaxir|2 years ago
1) There are two primary ways to have models generate embeddings: implicitly from an LLM by mean-pooling its last hidden state since it has to learn how to map text in a distinct latent space anyways to work correctly (i.e. DistilBERT), or you can use a model which can generate embeddings directly which are trained using something like triplet loss to explicitly incentivise learning similarity/dissimilarity. Popular text-embedding models like BAAI/bge-large-en-v1.5 tend to use the latter approach.
2) The famous word2Vec examples of e.g. woman + king = queen only work because word2vec is a shallow network and the model learns the word embeddings directly, instead of it being emergent. The latent space still maps them closely as shown with this demo, but there isn't any algebraic intuition. You can get close with algebra but no cigar.
3) DistilBERT is pretty old (2019) and based on a 2018 model trained on Wikipedia and books, so there will be significant text drift in addition to being less robust with newer modeling techniques and a more robust dataset. I do not recommend using it for production applications nowadays.
4) There is an under-discussed opportunity for dimensionality reduction techniques like PCA (which this demo uses to get the data into 3D) to both improve signal-to-noise and improve distinctiveness. I am working on a blog post of a new technique to handle dimensionality reduction for text embeddings better which may have interesting and profound usability implications.
pyinstallwoes|2 years ago
I was wondering if there was a universal signal that can be used as the identity and then based on that signal one could measure the distance to any other signal based on the principle relation of not(other). That is to say the identity would be precisely not all else for any X. Said another way, every thing is because it is exactly not everything else.
So thinking as first principles as possible I wondered if it were possible to represent everything as some frequency? A Fourier transform analog for every “time slice” of a thing? This is where it gets slightly slippery.
So the idea was trying to build relationship and identity and labeling from a simple rule set of things arising out of relation of not being other things.
In my mind I saw nodes on a graph forming in higher dimensions as half way points for any comparison. Comparisons create new nodes and implicitly have a distance metric to all other things. It made sense in my mind that there was an algorithmic annealing to new nodes in a “low density higher energetic state” allowing them to move faster in this universal emergent ontology/spatial space; eventually getting more dense and slower as it gets cold.
So the system implicitly also has a snapshot of events or interactions based on that where every comparison has a “tick” that encodes a particular density relation for some set of nodes it’s in association with.
The idea that cemented it all together was to treat each node like an address:chord. Similar to chording keys like a-b-c in some ux programs, but also exactly like chords in music too.
The idea being that when multiple “things” are dialed in at same time it becomes its own emergent label by proximity and association of those things being triggered to new information coming in classified as a distance to not(signal).
I didn’t really realize how close this idea was to what encoders/decoders seem to be doing although I do know I’m trying to think myself towards a universal solution that doesn’t require special encoders for every media type. Hence the Fourier transform path.
Know anything like this or am I spitting idiocy?
tikimcfee|2 years ago
I have this except you can see every single word in any dictionary at once in space, it renders individual glyphs. It can show an entire dictionary of words - definitions and roots - and let you fly around in them. It’s fun. I built a sample that “plays” a sentence and its definitions. GitHub.com/tikimcfee/LookAtThat The more I see stuff like this, the more i want to complete it. It’s heartening to see so many people fancied with seeing words… I just wish I knew where to find these people to like.. befriend and get better. Im getting the feeling I just kinda exist between worlds of lofty ideas and people that are incredibly smart sticking around other people that are incredibly smart.
wrsh07|2 years ago
Eg what is the real distance between the two vectors? That should be easy to compute
Similarly: what do I get from summing two vectors and what are some nearby vectors?
Maybe just generally: what are some nearby vectors?
Without any additional context it's just a point cloud with a couple of randomly labeled elements
refulgentis|2 years ago
People rarely have to get down to the real true metal on the embeddings models, and they're not what people think they are from their memory of word2vec. Ex. there's actually one vector emitted _per token_, the final vector is the mean. And cosine distance for similarity is the only metric anyone is training for.
In summary, there's ~no reason to think a visualization trying to show multiple vectors will ever be meaningful. Even just starting from "they have way way way more dimensions than we can represent visually" is enough to rule it out
Mini LM v2, foundation of most vector dbs, is 384 dims.
n.b. dear reader, if you've heard of that: you should be using v3! V3 is for asymmetric search, aka query => result docs. V2 is for symmetric search, aka chunk of text => similarly worded chunks of texts. It's very very funny how few people read the docs, in this case, the sentence transformers site.
tikimcfee|2 years ago
granawkins|2 years ago
I hadn't planned to keep building this but if I do, what should I add/change?
tikimcfee|2 years ago
It’s actually quite approachable to play with, and some of the comments about, “wut?” may be best answered by a little more experimentation on the user’s side, haha. I think the content itself is tricky, which may trip people up.
Something I’ve seen before that may be interesting is doing something with the definitions of words. ATM, you’re using a source list of words and using the embedded vectors to visualize. But what if you visualized not just the words themselves, but the ordered list of words that make up the definition(s) of that word visible in some spatial relationship. This would look interesting because around (connected to?) each word is its meaning in this case; changing the definition (the context use of the word) would also change the definition… and also change the connected word nodes in the graph. I envision ordered lines and colored words in this style.
If you end up doing something like that, start with like.. a “sentence player”. At the moment you show the words at once. What would it look like to “animate” the appearance of the words and their relationships by definition?
Anyway. Thanks for getting this far, haha. This is a really fascinating project and I’m glad you shared it. Please do tell if any of this is close or far off from something you might be interested in!
bravura|2 years ago
chaxor|2 years ago
kvakkefly|2 years ago
wrsh07|2 years ago
thom|2 years ago
pamelafox|2 years ago
tetris11|2 years ago
tikimcfee|2 years ago
eurekin|2 years ago
> man woman king queen ruler force powerful care
and couldn't reliably determine position of any of them
smrtinsert|2 years ago
tudorw|2 years ago
larodi|2 years ago
kaoD|2 years ago
cuttysnark|2 years ago
tikimcfee|2 years ago