Is Cosine-Similarity of Embeddings Really About Similarity?

[+] cfgauss2718|2 years ago|reply

Distance measures are only as good as the Pseudo-Riemannian metric they (implicitly)implement. If the manifold hypothesis is believed, then these metrics should be local because the manifold curvature is a local property. You would be mistaken to use an ordinary dot product to compare straight lines on a map of the globe, because those lines aren’t actually straight - they do not account for the rich information in the curvature tensor. Using the wrong inner product is akin to the flat-Earth fallacy.

[+] seanhunter|2 years ago|reply

I'm not sure I understand the underlying maths well enough to opine on your point but I can say for certain that no embedding space that I've ever seen used for any kind of ML is uniform in the sense that a Euclidian distance around one point means the same thing as the same Euclidian distance around another point. I'm not even sure that it would be possible to make an embedding that was uniform in that way because it would mean that we had a universal measure of similarity between concepts (which can obviously be enormously different).

The other potential issue is for all the embeddings that I have seen the resulting space once you have embedded some documents is sort of "clumpy" and very sparse overall. So you have very large areas with basically nothing at all I think because semantically there are many dimensions which only make sense for subsets of concepts so you end up with big voids where really the embedding space is totally unreachable so distance doesn't have any meaning at all.

In spite of all that there are a few similarity measures which work well enough to be useful for many practical purposes and cosine similarity is one of them. I don't think anyone thinks it's perfect.

[+] ashvardanian|2 years ago|reply

That sounds like a good point. Any research you know about extracting the curvature tensor from the a transformer-like model?

[+] groceryheist|2 years ago|reply

This is exactly right and is one (among many) reasons that reliance on cosine similarities in my field (computational social science) is so problematic. The curvature of the manifold must be accounted for in measuring distances. Other measures based on optimal transport are more theoretically sound, but are computationally expensive.

[+] trhway|2 years ago|reply

We implicitly train for minimizing of the distance in that style of metric - by using functions continuous and differentiable on classic manifolds (where continuity and differentiability is defined using the classic local maps into Euclidian space). I think if we were training using functions continuous and differentiable in say p-adic metric space (which looks extremely jagged/fractallian/non-continuos when embedded into Euclidian) then we'd have something like p-adic version of cosine (or other L-something metric) for similarity

[+] neoncontrails|2 years ago|reply

> In the following, we show that [taking cosine similarity between two features in a learned embedding] can lead to arbitrary results, and they may not even be unique.

Was uniqueness ever a guarantee? It's a distance metric. It's reasonable to assume that two features can be equidistant to the ideal solution to a linear system of equations. Maybe I'm missing something.

[+] nerdponx|2 years ago|reply

It's not even a distance metric, it doesn't obey the triangle inequality (hence the not-technically-meaningful name "similarity", like "collection" as opposed to "set").

[+] pletnes|2 years ago|reply

I sure hope noone claimed that. You’re doing potentially huge dimensionality reduction, uniqueness would be like saying you cannot have md5 collisions.

[+] SimplyUnknown|2 years ago|reply

I think maybe it's poorly phrased. As far as I can tell, their linear regression example for eq. 2 has an unique solution, but I think they state I that when optimizing for cosine similarity you can find non-unique solutions. But I haven't read in detail.

Then again, you could argue whether that is a problem when considering very high dimensional embeddings. Their conclusions seem to point in that direction but I would not agree on that.

[+] visarga|2 years ago|reply

Embeddings result from computing what word can appear in a given context, so words that would appear in the same spot will have higher cosine score between themselves.

But it doesn't differentiate further, so you can have "beautiful" and "ugly" embed very close to each other even though they are opposites - they tend to appear in similar places.

Another limitation of embeddings and cosine-similarity is that that they can't tell you "how similar" - is it equivalence or just relatedness? They make a mess of equivalent, antonymous and related things.

[+] minimaxir|2 years ago|reply

For word2vec-esque word embeddings, yes.

For modern embedding models which effectively mean-pool the last hidden state of LLMs (and therefore make use of its optimizations such as attention tricks), embeddings can be much more robust to different contexts both local and global.

[+] carschno|2 years ago|reply

That is why computational linguistics prefer the term related over similar here. Similarity is notoriously hard to define, for starters in terms of grammatical vs semantically similarity.

[+] mo_42|2 years ago|reply

Only if those two words appear in the same contexts with the same frequency. In natural language this is probably not the case. There are things typically considered beautiful and others as ugly.

[+] bruturis|2 years ago|reply

I think that yours comment is very interesting, I have reflected many times about how to differentiate things that appear in the same context of things that are similar. Any big idea here could be the spark to initiate a great startup.

[+] _t89y|2 years ago|reply

They make a mess of language. They are not a suitable representation. They are suitable for their efficiency in information retrieval systems and for sometimes crudely capturing semantic attributes in a way that is unreliable and uninterpretable. It ends there. Here's to ten more years of word2vec.

[+] bongodongobob|2 years ago|reply

Isn't that kind of the point though? That beautiful and ugly are encoded closely as an "idea" when viewed from the appropriate angle?

[+] SubiculumCode|2 years ago|reply

Distance metrics are an interesting topic. The field of ecology has a ton of them. For example see vegdist the Dissimilarity Indices for Community Ecologists function in the Vegan package in R: https://rdrr.io/cran/vegan/man/vegdist.html which includes, among others the "canberra", "clark", "bray", "kulczynski", "gower", "altGower", "morisita", "horn", "mountford", "raup", "chao", "cao", "mahalanobis", "chord", "hellinger", "aitchison", or "robust.aitchison".

Generic distance metrics can often be replaced with context-specific ones for better utility; it makes me wonder whether that insight could be useful in deep learning.

[+] hackerlight|2 years ago|reply

What are good distance metrics applied to latent embeddings as part of a diversity loss function to prevent model collapse?

[+] mo_42|2 years ago|reply

I quickly read through the paper. One thing to note is that they use the Frobenius norm (at least I suppose this from the index F) for the matrix factorization. That is for their learning algorithm. Then, they use the cosine-similarity to evaluate. A metric that wasn't used in the algorithm.

This is a long-standing question for me. Theoretically, I should use the CS in my optimization and then also in the evaluation. But I haven't tested this empirically.

For example, there is sperical K-meams that clusters the data on the unit sphere.

[+] mikewarot|2 years ago|reply

Why would anyone expect cosine-similarity to be a useful metric? In the real word, the arbitrary absolute position of an object in the universe (if it could be measured) isn't that important, it's the directions and distances to nearby objects that matters most.

It's my understanding that the delta between two word embeddings, gives a direction, and the magic is from using those directions to get to new words. The oft cited example is King-Man+Woman = Queen [1]

When did this view fall from favor?

[1] https://www.technologyreview.com/2015/09/17/166211/king-man-...

[+] VHRanger|2 years ago|reply

The scale of word embeddings (eg. Distance from 0) is mainly measuring how common the word is in the training corpus. This is a feature of almost all training objectives since word2vec (though some normalize the vectors).

Uncommon words have more information content than common words. So, common words having larger embedding scale is an issue here.

If you want to measure similarity you need a scale free measure. Cosine similarity (angle distance) does it without normalizing.

If you normalize your vectors, cosine similarity is the same as Euclidean distance. Normalizing your vectors also leads to information destruction, which we'd rather avoid.

There's no real hard theory why the angle between embeddings is meaningful beyond this practical knowledge to my understanding.

[+] montebicyclelo|2 years ago|reply

Cosine-similarity is a useful metric. The cases where it is useful are models that have been trained specifically to produce a meaningful cosine distance, (e.g. OpenAI's CLIP [1], Sentence Tranformers [2]) - but these are the types of models that the majority of people are using when they use cosine distances.

> It's my understanding that the delta between two word embeddings, gives a direction, and the magic is from using those directions to get to new words... it's the directions and distances to nearby objects that matters most

Cosine similarity is a kind of "delta" / inverse distance between the represenation of two entities, in the case of these models.

[1] https://arxiv.org/abs/2103.00020

[2] https://www.sbert.net/docs/training/overview.html

[+] necroforest|2 years ago|reply

cosine similarity is (isomorphic to) "distances to nearby objects". and not all embeddings are word embeddings.

[+] rdedev|2 years ago|reply

From my experience trying to train embeddings from transformers, using cosine similarity is less restrictive for the model than euclidean distance. Both works but cosine similarity seems to have slightly better performance.

Another thing you have to keep in mind is that these embeddings are in n dimensional space. Intuitions about the real world does not apply there

[+] kevindamm|2 years ago|reply

The word2vec inspired tricks like king-man+woman only work if the embedding is trained with synonym/antonym triplets to give them the semantic locality that allows that kind of vector math. This isn't always done, even some word2vec re-implementations skip this step completely. Also, not all embeddings are word embeddings.

[+] itronitron|2 years ago|reply

>> the delta between two word embeddings, gives a direction, and the magic is from using those directions

A direction can be given in terms of an angle measure, such as cosine.

[+] LifeIsBio|2 years ago|reply

The paper kinda leaves you hanging on the "alternatives" front, even though they have a section dedicated to it.

In addition to the _quality_ of any proposed alternative(s), computational speed also has to be a consideration. I've run into multiple situations where you want to measure similarities on the order of millions/billions of times. Especially for realtime applications (like RAG?) speed may even out weight quality.

[+] sweezyjeezy|2 years ago|reply

> While cosine similarity is invariant under such rotations R, one of the key insights in this paper is that the first (but not the second) objective is also invariant to rescalings of the columns of A and B

Ha interesting I wrote a blog post where I pointed this out a few years ago [1], and how we got around it for item-item similarity at an old job (essentially an implicit re-projection to original space as noted in section 3).

https://swarbrickjones.wordpress.com/2016/11/24/note-on-an-i...

[+] _t89y|2 years ago|reply

It's definitely not about semantics or language. As far as language is concerned similarity metrics are semantically vacuous and quantifying semantic similarity is a bogus enterprise.

[+] blackbear_|2 years ago|reply

Can you elaborate?

[+] apstroll|2 years ago|reply

Cosine Similarity is very much about similarity, but it's quite fickle and indirect.

Given a function f(l, r) that measures, say, the logprobability of observing both l and r, and that the function takes the form f(l, r) = <L(l), R(r)>, i.e. the dot product between embeddings of l and r, then cosine similarity of x and y, i.e. normalized dot product of L(x) and L(y) is very closely related to the correlation of f(x, Z) and f(y, Z) when we let Z vary.

[+] shiandow|2 years ago|reply

A well kept secret of linear algebra is that having an inner product at all isn't as self-evident as it might seem. Euclidean distance might seem like some canonical notion of distance, but it doesn't have to be meaningful, especially if the choice of coordinates has no geometrical meaning.

[+] 0xdeadbeefbabe|2 years ago|reply

It seems easiest to blame the embedding and not the cosine similarity.

[+] montebicyclelo|2 years ago|reply

Hmm, typically the models where people use cosine similarity on embeddings have been deliberately trained such that the cosine similarity is meaningful. It looks like this paper is looking at examples where the models have not been deliberately trained for the cosine similarities, and hence in these situations it would indeed be unreasonable to assume cosine similarities to be a good idea.. (but that's kind of a given?)

For example, here's the loss from the CLIP paper [1], which ensures cosine similarities are meaningful:

    # joint multimodal embedding [n, d_e]
    I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
    T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
    # scaled pairwise cosine similarities [n, n]
    logits = np.dot(I_e, T_e.T) * np.exp(t)
    # symmetric loss function
    labels = np.arange(n)
    loss_i = cross_entropy_loss(logits, labels, axis=0)
    loss_t = cross_entropy_loss(logits, labels, axis=1)
    loss = (loss_i + loss_t)/2

And Sentence Transformers [2] using CosineSimilarityLoss:

    train_loss = losses.CosineSimilarityLoss(model)
    # Tune the model
    model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)

[1] https://arxiv.org/pdf/2103.00020.pdf

[2] https://www.sbert.net/docs/training/overview.html

[+] 0xdeadbeefbabe|2 years ago|reply

Right, the embedding is contrived to give euclidean distance and cosine similarity the usual meaning.

[+] malshe|2 years ago|reply

Thanks for these examples.

[+] greesil|2 years ago|reply

It's about keeping stuff on the unit hypersphere

[+] toisanji|2 years ago|reply

We found similar results when working on our paper for creating LLM agents with metacognition and explicitly called this out in the paper: https://replicantlife.com

[+] h_koko|2 years ago|reply

What are better alternatives to cosine similarity?

[+] minimaxir|2 years ago|reply

There aren't any alternatives: cosine similarity is effectively an extension of Euclidian distance, which is the mathematical correct way for finding the distance between vectors.

You may not want to use cosine similarity as your only metric for rankings, however, and you may want to experiment with how you construct the embeddings.

[+] singularity2001|2 years ago|reply

What happens when you mix cosine similarity with euclidean distance? at least give a small penalty if the euclidean distance is too far off?

[+] _t89y|2 years ago|reply

For describing semantics in natural language? Pretty much anything else.

[+] jsn_5|2 years ago|reply

All models are wrong, some are useful.

[+] yantrarora|2 years ago|reply

[deleted]

[+] latency-guy2|2 years ago|reply

Not reading the paper, cosine similarity has little to no semantic understanding of sentences.

E.g. the following triple

1: "Yes, this is a demonstration"

2: "Yes, this isn't a demonstration"

3: "Here is an example"

<1, 2>, Has "higher" cosine similarity than <1, 3>, structurally equivalent except for one token/word, <1, 2> semantically means the opposite of each other depending on what you're targeting in that sentence. While <1, 3> means effectively the same thing.

If this paper is about persuading people about efficacy with regards to semantic understanding, OK, but that was always known. If its about something with relation to vectors and the underlying operations, then I'll be interested.

[+] vidarh|2 years ago|reply

Whether your not the cosine similarity of either pair is higher depends on the mapping you create from the strings to the embedding vector. That mapping can be whichever function you choose, and your result will be entirely dependent on that.

If you choose a straight linear mapping of tokens to a number, then you'd be right.

Extending that, if you choose any mapping which does not do a more extensive remapping from raw syntactic structure to some sort of semantic representation, you'd be right.

But hence why we increasingly use models to create embeddings instead of simpler approaches before applying a similarity metric, whether cosine similarity or other.

Put another way, there is no inherent reason why you couldn't have a model where the embeddings for 1 and 3 are identical even, and so it is meaningless to talk about the cosine similarity of your sentences without setting out your assumptions about how you will created embeddings from them.

[+] Tostino|2 years ago|reply

That is entirely dependant on the model for the embeddings. You can fine tune for pretty much any outcome you want.

[+] superkuh|2 years ago|reply

That might be true for one-hot vectors but it's not true for learned embedding through the lens of attention. That said, I only made to page 3/9 of the paper before the mark-up for the math went over my head.

[+] soarerz|2 years ago|reply

What is the cheapest way to capture similarity if not via dot product then?

[+] unknown|2 years ago|reply

[deleted]

115 comments