top | item 40067955

(no title)

One straightforward way to get started is to understand embedding without any AI/deep learning magic. Just pick a vocabulary of words (say, some 50k words), pick a unique index between 0 and 49,999 for each of the words, and then produce embedding by adding +1 to the given index for a given word each time it occurs in a text. Then normalize the embedding so it adds up to one.

Presto -- embeddings! And you can use cosine similarity with them and all that good stuff and the results aren't totally terrible.

The rest of "embeddings" builds on top of this basic strategy (smaller vectors, filtering out words/tokens that occur frequently enough that they don't signify similarity, handling synonyms or words that are related to one another, etc. etc.). But stripping out the deep learning bits really does make it easier to understand.

discuss

HarHarVeryFunny|1 year ago

Those would really just be identifiers. I think the key property of embeddings is that the dimensions each individually mean/measure something, and therefore the dot product of two embeddings (similarity of direction of the vectors) is a meaningful similarity measure of the things being represented.

The classic example is word embeddings such as word2vec, or GloVE, where due to the embeddings being meaningful in this way, one can see vector relationships such as "man - woman" = "king - queen".

thisiszilff|1 year ago

> I think the key property of embeddings is that the dimensions each individually mean/measure something, and therefore the dot product of two embeddings (similarity of direction of the vectors) is a meaningful similarity measure of the things being represented.

In this case each dimension is the presence of a word in a particular text. So when you take the dot product of two texts you are effectively counting the number of words the two texts have in common (subject to some normalization constants depending on how you normalize the embedding). Cosine similarity still works for even these super naive embeddings which makes it slightly easier to understand before getting into any mathy stuff.

You are 100% right this won't give you the word embedding analogies like king - man = queen or stuff like that. This embedding has no concept of relationships between words.

IanCal|1 year ago

They're not, I get why you think that though.

They're making a vector for a text that's the term frequencies in the document.

It's one step simpler than tfidf which is a great starting point.

pstorm|1 year ago

I'm trying to understand this approach. Maybe I am expecting too much out of this basic approach, but how does this create a similarity between words with indices close to each other? Wouldn't it just be a popularity contest - the more common words have higher indices and vice versa? For instance, "king" and "prince" wouldn't necessarily have similar indices, but they are semantically very similar.

svieira|1 year ago

You are expecting too much out of this basic approach. The "simple" similarity search in word2vec (used in https://semantle.com/ if you haven't seen it) is based on _multiple_ embeddings like this one (it's a simple neural network not a simple embedding).

_akhe|1 year ago

This is a simple example where it scores their frequency. If you scored every word by their frequency only you might have embeddings like this:

  act: [0.1]
  as:  [0.4]
  at:  [0.3]
  ...

That's a very simple 1D embedding, and like you said would only give you popularity. But say you wanted other stuff like its: Vulgarity, prevalence over time, whether its slang or not, how likely it is to start or end a sentence, etc. you would need more than 1 number. In text-embedding-ada-002 there are 1536 numbers in the array (vector), so it's like:

  act: [0.1, 0.1, 0.3, 0.0001, 0.000003, 0.003, ... (1536 items)]
  ...

The numbers don't mean anything in-and-of-themselves. The values don't represent qualities of the words, they're just numbers in relation to others in the training data. They're different numbers in different training data because all the words are scored in relation to each other, like a graph. So when you compute them you arrive at words and meanings in the training data as you would arrive at a point in a coordinate space if you subtracted one [x,y,z] from another [x,y,z] in 3D.

So the rage about a vector db is that it's a database for arrays of numbers (vectors) designed for computing them against each other, optimized for that instead of say a SQL or NoSQL which are all about retrieval etc.

So king vs prince etc. - When you take into account the 1536 numbers, you can imagine how compared to other words in training data they would actually be similar, always used in the same kinds of ways, and are indeed semantically similar - you'd be able to "arrive" at that fact, and arrive at antonyms, synonyms, their French alternatives, etc. but the system doesn't "know" that stuff. Throw in Burger King training data and talk about French fries a lot though, and you'd mess up the embeddings when it comes arriving at the French version of a king! You might get "pomme de terre".

jncfhnb|1 year ago

King doesn’t need to appear commonly with prince. It just needs to appear in the same context as prince.

It also leaves out the old “tf idf” normalization of considering how common a word is broadly (less interesting) vs in that particular document. Kind of like a shittier attention. Used to make a big difference.

sdwr|1 year ago

It doesn't even work as described for popularity - one word starts at 49,999 and one starts at 0.

im3w1l|1 year ago

It's a document embedding, not a word embedding.

zachrose|1 year ago

Maybe the idea is to order your vocabulary into some kind of “semantic rainbow”? Like a one-dimensional embedding?

dekhn|1 year ago

Is that really an embedding? I normally think of an embedding as an approximate lower-dimensional matrix of coefficients that operate on a reduced set of composite variables that map the data from a nonlinear to linear space.

thisiszilff|1 year ago

You're right that what I described isn't what people commonly think about as embeddings (given we are more advanced now the above description), but broadly an embedding is anything (in nlp at least) that maps text into a fixed length vector. When you make embedding like this, the nice thing is that cosine similarity has an easy to understand similarity meaning: count the number of words two documents have in common (subject to some normalization constant).

Most fancy modern embedding strategies basically start with this and then proceed to build on top of it to reduce dimensions, represent words as vectors in their own right, pass this into some neural layer, etc.

mschulkind|1 year ago

Aren't you just describing a bag-of-words model?

https://en.wikipedia.org/wiki/Bag-of-words_model

thisiszilff|1 year ago

Yes! And the follow up that cosine similarity (for BoW) is a super simple similarity metric based on counting up the number of words the two vectors have in common.

afro88|1 year ago

How does this enable cosine similarity usage? I don't get the link between incrementing a word's index by it's count in a text and how this ends up with words that have similar meaning to have a high cosine similarity value

twelfthnight|1 year ago

I think they are talking about bag-of-words. If you apply a dimensionality reduction technique like SVD or even random projection on bag-of-words, you can effectively create a basic embedding. Check out latent semantic indexing / latent semantic analysis.

sell_dennis|1 year ago

You're right, that approach doesn't enable getting embeddings for an individual word. But it would work for comparing similarity of documents - not that well of course, but it's a toy example that might feel more intuitive

Someone|1 year ago

I think that strips away way too much. What you describe is “counting words”. It produces 50,000-dimensional vectors (most of them zero for the vast majority of texts) for each text, so it’s not a proper embedding.

What makes embeddings useful is that they do dimensionality reduction (https://en.wikipedia.org/wiki/Dimensionality_reduction) while keeping enough information to keep dissimilar texts away from each other.

I also doubt your claim “and the results aren't totally terrible”. In most texts, the dimensions with highest values will be for very common words such as “a”, “be”, etc (https://en.wikipedia.org/wiki/Most_common_words_in_English)

A slightly better simple view of how embeddings can work in search is by using principal component analysis. If you take a corpus, compute TF-IDF vectors (https://en.wikipedia.org/wiki/Tf–idf) for all texts in it, then compute the n ≪ 50,000 top principal components of the set of vectors and then project each of your 50,000-dimensional vectors on those n vectors, you’ve done the dimension reduction and still, hopefully, are keeping similar texts close together and distinct texts far apart from each other.

vineyardmike|1 year ago

> I think that strips away way too much. What you describe is “counting words”. It produces 50,000-dimensional vectors (most of them zero for the vast majority of texts) for each text, so it’s not a proper embedding.

You can simplify this with a map, and only store non-zero values, but also you can be in-efficient: this is for learning. You can choose to store more valuable information than just word count. You can store any "feature" you want - various tags on a post, cohort topics for advertising, bucketed time stamps, etc.

For learning just storing word count gives you the mechanics you need of understanding vectors without actually involving neural networks and weights.

> I also doubt your claim “and the results aren't totally terrible”.

> In most texts, the dimensions with highest values will be for very common words such as “a”, “be”, etc

(1) the comment suggested filtering out these words, and (2) the results aren't terrible. This is literally the first assignment in Stanfords AI class [1], and the results aren't terrible.

> A slightly better simple view of how embeddings can work in search is by using principal component analysis. If you take a corpus, compute TF-IDF vectors (https://en.wikipedia.org/wiki/Tf–idf) for all texts in it, then compute the n ≪ 50,000 top principal components of the set of vectors and then project each of your 50,000-dimensional vectors on those n vectors, you’ve done the dimension reduction and still, hopefully, are keeping similar texts close together and distinct texts far apart from each other.

Wow that seems a lot more complicated for something that was supposed to be a learning exercise.

[1] https://stanford-cs221.github.io/autumn2023/assignments/sent...

_giorgio_|1 year ago

Embeddings must be trained, otherwise they don't have any meaning, and are just random numbers.

wjholden|1 year ago

Really appreciate you explaining this idea, I want to try this! It wasn't clear to me until I read the discussion that you meant that you'd have similarity of entire documents, not among words.

thisiszilff|1 year ago

Yes! And that’s an oversight on my part — word embeddings are interesting but I usually deal with documents when doing nlp work and only deal with word embeddings when thinking about how to combine them into a document embedding.

Give it a shot! I’d grab a corpus like https://scikit-learn.org/stable/datasets/real_world.html#the... to play with and see what you get. It’s not going to be amazing, but it’s a great way to build some baseline intuition for nlp work with text that you can do on a laptop.

unknown|1 year ago

[deleted]