top | item 28155423

(no title)

rpedela | 4 years ago

Google has a distributed embedding matching service in preview: https://cloud.google.com/vertex-ai/docs/matching-engine/over...

I guess it depends on what you mean by "simple". The algorithms are complex but there are good tools that implement them. I would imagine smaller companies would use off the shelf tooling, and I would argue that is simpler. Vector embeddings are so unbelievably powerful and often yield better results than classical methods with one of the good tools + pretrained embeddings.

Specifically for search, I use them to completely replace stemming, synonyms, etc in ES. I match the query's embedding to the document embeddings, find the top 1000 or so. Then I ask ES for the BM25 score for that top 1000. I combine the embedding match score with BM25, recency, etc for final rank. The results are so much better than using stemming, etc and it's overall simpler because I can use off the shelf tooling and the data pipeline is simpler.

discuss

hintymad|4 years ago

> I match the query's embedding to the document embeddings,

I assume the doc size is relatively small, otherwise a document may contain too many different topics that make it hard to differentiate different queries?

rpedela|4 years ago

For my search use case, documents are mostly single topic and less than 10 pages. However I have found embeddings still work surprisingly well for longer documents with a few topics in them. But yes, multi-topic documents can certainly be an issue. Segmentation by sentence, paragraph, or page can help here. I believe there are ML-based topic segmentation algorithms too, but that certainly starts making it less simple.