(no title)
thisiszilff | 1 year ago
Presto -- embeddings! And you can use cosine similarity with them and all that good stuff and the results aren't totally terrible.
The rest of "embeddings" builds on top of this basic strategy (smaller vectors, filtering out words/tokens that occur frequently enough that they don't signify similarity, handling synonyms or words that are related to one another, etc. etc.). But stripping out the deep learning bits really does make it easier to understand.
HarHarVeryFunny|1 year ago
The classic example is word embeddings such as word2vec, or GloVE, where due to the embeddings being meaningful in this way, one can see vector relationships such as "man - woman" = "king - queen".
thisiszilff|1 year ago
In this case each dimension is the presence of a word in a particular text. So when you take the dot product of two texts you are effectively counting the number of words the two texts have in common (subject to some normalization constants depending on how you normalize the embedding). Cosine similarity still works for even these super naive embeddings which makes it slightly easier to understand before getting into any mathy stuff.
You are 100% right this won't give you the word embedding analogies like king - man = queen or stuff like that. This embedding has no concept of relationships between words.
IanCal|1 year ago
They're making a vector for a text that's the term frequencies in the document.
It's one step simpler than tfidf which is a great starting point.
pstorm|1 year ago
svieira|1 year ago
_akhe|1 year ago
So the rage about a vector db is that it's a database for arrays of numbers (vectors) designed for computing them against each other, optimized for that instead of say a SQL or NoSQL which are all about retrieval etc.
So king vs prince etc. - When you take into account the 1536 numbers, you can imagine how compared to other words in training data they would actually be similar, always used in the same kinds of ways, and are indeed semantically similar - you'd be able to "arrive" at that fact, and arrive at antonyms, synonyms, their French alternatives, etc. but the system doesn't "know" that stuff. Throw in Burger King training data and talk about French fries a lot though, and you'd mess up the embeddings when it comes arriving at the French version of a king! You might get "pomme de terre".
jncfhnb|1 year ago
It also leaves out the old “tf idf” normalization of considering how common a word is broadly (less interesting) vs in that particular document. Kind of like a shittier attention. Used to make a big difference.
sdwr|1 year ago
im3w1l|1 year ago
zachrose|1 year ago
dekhn|1 year ago
thisiszilff|1 year ago
Most fancy modern embedding strategies basically start with this and then proceed to build on top of it to reduce dimensions, represent words as vectors in their own right, pass this into some neural layer, etc.
mschulkind|1 year ago
https://en.wikipedia.org/wiki/Bag-of-words_model
thisiszilff|1 year ago
afro88|1 year ago
twelfthnight|1 year ago
sell_dennis|1 year ago
Someone|1 year ago
What makes embeddings useful is that they do dimensionality reduction (https://en.wikipedia.org/wiki/Dimensionality_reduction) while keeping enough information to keep dissimilar texts away from each other.
I also doubt your claim “and the results aren't totally terrible”. In most texts, the dimensions with highest values will be for very common words such as “a”, “be”, etc (https://en.wikipedia.org/wiki/Most_common_words_in_English)
A slightly better simple view of how embeddings can work in search is by using principal component analysis. If you take a corpus, compute TF-IDF vectors (https://en.wikipedia.org/wiki/Tf–idf) for all texts in it, then compute the n ≪ 50,000 top principal components of the set of vectors and then project each of your 50,000-dimensional vectors on those n vectors, you’ve done the dimension reduction and still, hopefully, are keeping similar texts close together and distinct texts far apart from each other.
vineyardmike|1 year ago
You can simplify this with a map, and only store non-zero values, but also you can be in-efficient: this is for learning. You can choose to store more valuable information than just word count. You can store any "feature" you want - various tags on a post, cohort topics for advertising, bucketed time stamps, etc.
For learning just storing word count gives you the mechanics you need of understanding vectors without actually involving neural networks and weights.
> I also doubt your claim “and the results aren't totally terrible”.
> In most texts, the dimensions with highest values will be for very common words such as “a”, “be”, etc
(1) the comment suggested filtering out these words, and (2) the results aren't terrible. This is literally the first assignment in Stanfords AI class [1], and the results aren't terrible.
> A slightly better simple view of how embeddings can work in search is by using principal component analysis. If you take a corpus, compute TF-IDF vectors (https://en.wikipedia.org/wiki/Tf–idf) for all texts in it, then compute the n ≪ 50,000 top principal components of the set of vectors and then project each of your 50,000-dimensional vectors on those n vectors, you’ve done the dimension reduction and still, hopefully, are keeping similar texts close together and distinct texts far apart from each other.
Wow that seems a lot more complicated for something that was supposed to be a learning exercise.
[1] https://stanford-cs221.github.io/autumn2023/assignments/sent...
_giorgio_|1 year ago
wjholden|1 year ago
thisiszilff|1 year ago
Give it a shot! I’d grab a corpus like https://scikit-learn.org/stable/datasets/real_world.html#the... to play with and see what you get. It’s not going to be amazing, but it’s a great way to build some baseline intuition for nlp work with text that you can do on a laptop.
unknown|1 year ago
[deleted]