(no title)
cproctor | 1 year ago
For example, if I have a set of words and I want to consider their relative location on an axis between two anchor words (e.g. "good" and "evil"), it makes sense to me to project all the words onto the vector from "good" to "evil." Would comparing each word's "good" and "evil" cosine similarity be equivalent, or even preferable? (I know there are questions about the interpretability of this kind of geometry.)
Scene_Cast2|1 year ago
extasia|1 year ago
Consider [1,0] and [x,x] Normalised we get [1,0] and [sqrt(.5),sqrt(.5)] — clearly something has changed because the first vector is now larger in dimension zero than the second, despite starting off as an arbitrary value, x, which could have been smaller than 1. As such we have lost information about x’s magnitude which we cannot recover from just the normalized vector.
nostrademons|1 year ago
Imagine you have a simple bag-of-words model of a document, where you just count the number of occurrences of each word in the document. Numerically, this is represented as a vector where each dimension is one token (so, you might have one number for the word "number", another for "cosine", another for "the", and so on), and the magnitude of that component is the count of the number of times it occurs. Intuitively, cosine similarity is a measure of how frequently the same word appears in both documents. Words that appear in both documents get multiplied together, but words that are only in one get multiplied by zero and drop out of the cosine sum. So because "cosine", "number", and "vector" appear frequently in my post, it will appear similar to other documents about math. Because "words" and "documents" appear frequently, it will appear similar to other documents about metalanguage or information retrieval.
And intuitively, the reason the magnitude doesn't matter is that those counts will be much higher in longer documents, but the length of the document doesn't say much about what the document is about. The reason you take the cosine (which has a denominator of magnitude-squared) is a form of length normalization, so that you can get sensible results without biasing toward shorter or longer documents.
Most machine-learned embeddings are similar. The components of the vector are features that your ML model has determined are important. If the product of the same dimension of two items is large, it indicates that they are similar in that dimension. If it's zero, it indicates that that feature is not particularly representative of the item. Embeddings are often normalized, and for normalized vectors the fact that magnitude drops out doesn't really matter. But it doesn't hurt either: the magnitude will be one, so magnitude^2 is also 1 and you just take the pair-wise product of the vectors.
d110af5ccf|1 year ago
To be a bit more explicit (of my intuition). The vector is encoding a ratio, isn't it? You want to treat 3:2, 6:4, 12:8, ... as equivalent in this case; normalization does exactly that.
marginalia_nu|1 year ago
The J-L lemma is at least somewhat related, even though it doesn't to my understanding quite describe the same transformation.
https://en.m.wikipedia.org/wiki/Johnson%E2%80%93Lindenstraus...
see also https://en.m.wikipedia.org/wiki/Random_projection
magicalhippo|1 year ago
When I dabbled with latent semantic indexing[1], using cosine similarity made sense as the dimensions of the input vectors were words, for example a 1 if a word was present or 0 if not. So one would expect vectors that point in a similar direction to be related.
I haven't studied LLM embedding layers in depth, so yeah been wondering about using certain norms[2] instead to determine if two embeddings are similar. Does it depends on the embedding layer for example?
Should be noted it's been many years since I learned linear algebra, so getting somewhat rusty.
[1]: https://en.wikipedia.org/wiki/Latent_semantic_analysis
[2]: https://en.m.wikipedia.org/wiki/Norm_(mathematics)