top | item 42811326

(no title)

thoughtlede | 1 year ago

tadkar did a good job at explaining ColBERT. I understood ColBERT well in the context of where it lies on the spectrum of choices.

On one side of the spectrum, you reduce each of the documents as well as the query to a lower-dimensional space (aka embeddings) and perform similarity. This has the advantage that the document embeddings could be precomputed. At query time, you only compute the query embedding and compare its similarity with document embeddings. The problem is that the lower-dimensional embedding acts as a decent, but not great, proxy for the documents as well as for the query. Your query-document similarity is only as good as the semantics that could be captured in those lower-dimensional embeddings.

On the other side of the spectrum, you consider the query with each document (as a pair) and see how much the query "attends" to each of the documents. The power of trained attention weights means that you get a much reliable similarity score. The problem is that this approach requires you to run attention-forward-pass as many times as there are documents -- for each query. In other words, this has a performance issue.

ColBERT sits in the middle of the spectrum. It "attends" to each of the documents separately and captures the lower-dimensional embedding for each token in each document. This we precompute. Once we have done that, we captured the essence of how tokens within a given document attend to each other, and is captured in the token embeddings.

Then, at query time, we do the same for each token in the query. And we see which query-token embedding is greatly similar to which document-token embedding. If we find that there is a document which has more tokens that are found to be greatly similar to the query tokens, then we consider that to the best document match. (The degree of similarity between each query-document token is used to score the ranking - it is called Sum of MaxSim).

Obviously, attention based similarity, like in the second approach, is better than reducing to token embeddings and scoring similarity. But ColBERT avoids the performance hit compared to the second approach. ColBERT also avoids the lower fidelity of "reducing the entire document to a lower-dimensional space issue" because it reduces each token in the document separately.

By the way, the first approach is what bi-encoders do. The second approach is cross-encoding.

discuss

order

No comments yet.