top | item 41799305

(no title)

warangal | 1 year ago

I myself have been working on a personal search engine for sometime, and one problem i faced was to have an effective fuzzy-search for all the diverse filenames/directories. All approaches i could find were based on Levenshtein distance , which would have led to storing of original strings/text content in the index, and neither would be practical for larger strings' comparison nor would be generic enough to handle all knowledge domains. This led me to start looking at (Local sensitive hashes) LSH approaches to measure difference b/w any two strings in constant time. After some work i finally managed to complete an experimental fuzzy search engine (keyword search is a just a special case!).

In my analysis of 1 Million hacker news stories, it worked much better than algolia search while running on a single core ! More details are provided in this post: https://eagledot.xyz/malhar.md.html . I tried to submit it here to gather more feedback but didn't work i guess!

discuss

iudqnolq|1 year ago

I'm super new to this so I'm probably missing something simple, but isn't a trigram index one of the canonical solutions for fuzzy search? Eg https://www.postgresql.org/docs/current/pgtrgm.html

That often involves recording original trigram position, but I think that's necessary to weigh "I like happy cats" higher than "I like happy dogs but I don't like cats" in a search for "happy cats".

warangal|1 year ago

Yes, trigram mainly but also bigram and/or combination of both are used generally to implement fuzzy search, zoekt also uses trigram index. But such indices depend heavily on the content being indexed, for example if ever encounter a rare "trigram" during querying not indexed, they would fail to return relevant results! LSH implementations on the other hand employ a more diverse collection of stats depending upon the number of buckets and N(-gram)/window-size used, to compare better with unseen content/bytes during querying. But it is not cheap as each hash is around 30 bytes, even more than the string/text being indexed most of the time ! But its leads to fixed size hashes independent of size of content indexed and acts as an "auxiliary" index which can be queried independently of original index! Comparison of hashes can be optimized leading to a quite fast fuzzy search .