top | item 40070202

(no title)

thisiszilff | 1 year ago

> I think the key property of embeddings is that the dimensions each individually mean/measure something, and therefore the dot product of two embeddings (similarity of direction of the vectors) is a meaningful similarity measure of the things being represented.

In this case each dimension is the presence of a word in a particular text. So when you take the dot product of two texts you are effectively counting the number of words the two texts have in common (subject to some normalization constants depending on how you normalize the embedding). Cosine similarity still works for even these super naive embeddings which makes it slightly easier to understand before getting into any mathy stuff.

You are 100% right this won't give you the word embedding analogies like king - man = queen or stuff like that. This embedding has no concept of relationships between words.

discuss

order

HarHarVeryFunny|1 year ago

But that doesn't seem to be what you are describing in terms of using incrementing indices and adding occurrence counts.

If you want to create a bag of words text embedding then you set the number of embedding dimensions to the vocabulary size and the value of each dimension to the global count of the corresponding word.

thisiszilff|1 year ago

Heh -- my explanation isn't the clearest I realize, but yes, it is BoW.

Eg fix your vocab of 50k words (or whatever) and enumerate it.

Then to make an embedding for some piece of text

1. initialize an all zero vector of size 50k 2. for each word in the text, add one to the index of the corresponding word (per our enumeration). If the word isn't in the 50k words in your vocabulary, then discard it 3. (optionally), normalize the embedding to 1 (though you don't really need this and can leave it off for the toy example). initialize an embedding (for a single text) as an all zero vector of size 50k