top | item 37160859

(no title)

I found that stemming the text before generating vectors helps increase recall and the vectors still capture context, etc. However it does hurt precision because some information is lost by stemming. The more recent vector training algorithms are better able to capture semantic, syntactic, and contextual similarity without a lot of preprocessing. So I have found that vectors can replace all the nonsense that used to be needed to increase recall: stemming, manual synonym lists, etc.

However vector similarity search only helps with the literal text search not ranking. Tf/idf, bm25, page rank, learn to rank ML, etc are still needed to rank documents. Whenever I find a new vector search engine, I always look to see what ranking features it has beyond vector similarity.

discuss

bryanrasmussen|2 years ago

I would want to do sort of similar to Lucene's support for both stemmed and non-stemmed fields together - so that you could rank the hit in the non-stemmed field higher than the hit in the stemmed field - so helping the precision.

In my experience this is more useful in complicated document searches.