This reminds me of a browser-side search I once wrote using a combination of regex, Soundex, and Damerau-Levenshtein. The reason I made this browser-side is because the app doesn't have that many records that it needed to be on the server; the records were in the hundreds to thousands, not millions.
Not the best code I've ever written by far, but it was a neat little experiment.
Basically, indexing each record with Soundex(phonetics) makes the search semi-tolerant to misspellings and also brings up similar-sounding results, while Damerau-Levenshtein is used to sort the results by closeness. I also had it index timestamps with
As you can see, you can misspell search terms and it will usually get it right. Granted, a lot of unrelated records usually come up in a search, but the important thing is the user should get what they're looking for in the first few results, usually the first.
Very cool! - Most search problems are solved this way - interesting index data modeling to the needed use cases, not magic machine learning neural network unicorns. Many want to sell you the latter. The thing that gets the job done is usually the former.
One big reasons for this is the need for really good training data to do supervised learning on search problems. If you’re Google or Wikipedia, you can get this data. But many struggle to understand what a good result was for a query based on clicks and other user data due to low volumes of traffic.
As an aside: most of the Wikimedia Foundation's analytics data is available for anyone to download (https://wikitech.wikimedia.org/wiki/Analytics#Datasets). There is some link-rot to wade through, but try asking on #wikimedia-analytics on freenode if you can't find something.
Then you're using magic machine learning wrong. I'm part of a small team who uses it exactly for the purpose of giving intelligent search to non-Googles and non-Wikipedias, and we have a very good track record.
The key is to use machine learning to generalize user behavior across multiple items in roughly the same category, rather than individual items. In fact, with tiny amounts of data you should pretty much never deal in individual items -- always in groups of related items.
>Most search problems are solved this way - interesting index data modeling to the needed use cases, not magic machin learning neural network unicorns. Many want to sell you the latter. The thing that gets the job done is usually the former.
Are there any good writeups on how to implement the trigram index side of this? For instance, the code alludes to the idea that you could store each trigram as a bitmap over documents, then the Boolean combinations become bitwise operations. I suppose you could keep the bitmaps in compressed (or even probabilistic sketch) form, then decompress only those required for a query, but is this the right intuition? Are there lower-level libraries or databases that do this kind of thing well, rather than rolling your own on a KV store or using something monstrous like ElasticSearch?
There's https://github.com/google/zoekt which provides a pretty straight-forward and memory-efficient implementation. Their design document goes a bit into what makes it fast:
Note that Elasticsearch is built on top of Lucene (lucene.apache.org), which is a straight forward search indexer that creates a reverse index and document store, and gives pluggable low-level access to parts of the indexer path. Would be very easy to create a custom analyzer to create a trigram-based index (my guess is one already exists).
How fitting. I've been working on making Etsy houndd [0] available as SaaS. Houndd uses a trigram index with regexp and can search all the repos I care about in ms.
The demo page works. The signup just send me a confirmation email and the setup is still manual, so expect hours/day delay. Aiming to have this all automated by tomorrow. Appreciate any/all feedback.
Another shameless plug: Sourcegraph uses this for indexed search via Zoekt (it also has index-less live search). Check out the cmd/indexed-search code at https://github.com/sourcegraph/sourcegraph.
I can't believe they don't adopt something like https://github.com/etsy/hound. It's all right there, it's super easy to setup, and it's a game changer. Lightning fast search that you can link to. Check it out for your org, I was at Etsy when Kelly wrote it, and I've taken it with me to other jobs.
That's great. It seems like it only supports character-level trigrams though. Do you know of any tools that can create word-level trigrams from Postgres?
FWIW kythe.io is the modern successor that does this internally at Google. Haven't worked on it, but have written some code that's a client. Unfortunately I think the indexing pipelines aren't publically available.
Trigram is defined in the dictionary as "a group of three consecutive written units such as letters, syllables, or words." Thus using the term for a triplet of tokens seems appropriate.
I did write a book on the topic, developed the Elasticsearch learning to rank plugin, and have helped several Alexa top 50 sites deliver more accurate search
I’m only pointing out that search is oversold industry, but when you read about what Google actually does it’s more normal stuff that we would all probably do to solve a similar problem.
(for those curious it was a link to content at a community conference i help run - pointing out how hard ML can be for search. I admit a bit tangential perhaps)
Modern servers have a lot of compute resources and memory compared to 15 years back, before doing any optimizations I would start out with a naive full text search, which will most likely be fast enough. A full code search using regex is also more powerful for users. Imagine if you could use Regex when searching on Github or Google!? Source code use relative little space, If you strip out the markup/JS from web sites, they also take up relative little space. The only problem however is to educate users on how to do effective searches.
Even with modern computers doing a naive regex search on 1GiB+ of full text would take a while.
Memory isn't that fast. The trigrams helps you avoid needing to read every document on every search.
Even a traditional rdbms will start become slow with datasets in the 10-100GiB range.
Note that products like bigquery, snowflake, redshift...etc might be able to support but that is not Cheap or relatively fast
[+] [-] ravenstine|7 years ago|reply
https://github.com/SCPR/fire-tracker/blob/master/app/lib/sea...
Not the best code I've ever written by far, but it was a neat little experiment.
Basically, indexing each record with Soundex(phonetics) makes the search semi-tolerant to misspellings and also brings up similar-sounding results, while Damerau-Levenshtein is used to sort the results by closeness. I also had it index timestamps with
Here it is in action:
https://firetracker.scpr.org/archive
As you can see, you can misspell search terms and it will usually get it right. Granted, a lot of unrelated records usually come up in a search, but the important thing is the user should get what they're looking for in the first few results, usually the first.
Regex is a powerful thing.
[+] [-] softwaredoug|7 years ago|reply
One big reasons for this is the need for really good training data to do supervised learning on search problems. If you’re Google or Wikipedia, you can get this data. But many struggle to understand what a good result was for a query based on clicks and other user data due to low volumes of traffic.
[+] [-] atdt|7 years ago|reply
[+] [-] kqr|7 years ago|reply
The key is to use machine learning to generalize user behavior across multiple items in roughly the same category, rather than individual items. In fact, with tiny amounts of data you should pretty much never deal in individual items -- always in groups of related items.
[+] [-] kierenj|7 years ago|reply
[+] [-] gauravphoenix|7 years ago|reply
occam's razor
[+] [-] secure|7 years ago|reply
This is what I based Debian Code Search on: https://codesearch.debian.net/
See also https://codesearch.debian.net/research/bsc-thesis.pdf if you’re interested in details.
[+] [-] devoply|7 years ago|reply
[+] [-] blattimwind|7 years ago|reply
[+] [-] btown|7 years ago|reply
[+] [-] straws|7 years ago|reply
https://github.com/google/zoekt/blob/master/doc/design.md
[+] [-] hn_throwaway_99|7 years ago|reply
[+] [-] user5994461|7 years ago|reply
There is no information on the subject besides the 3 blog posts and the proof of concept.
[+] [-] shabble|7 years ago|reply
[+] [-] karmakaze|7 years ago|reply
The demo page works. The signup just send me a confirmation email and the setup is still manual, so expect hours/day delay. Aiming to have this all automated by tomorrow. Appreciate any/all feedback.
[0] https://github.com/etsy/hound [1] https://gitgrep.com
[+] [-] xooms|7 years ago|reply
[+] [-] unknown|7 years ago|reply
[deleted]
[+] [-] sqs|7 years ago|reply
[+] [-] rasmi|7 years ago|reply
[+] [-] petters|7 years ago|reply
[+] [-] hoorayimhelping|7 years ago|reply
[+] [-] manigandham|7 years ago|reply
https://sourcegraph.com/start
[+] [-] espeed|7 years ago|reply
[+] [-] enriquto|7 years ago|reply
[+] [-] mslot|7 years ago|reply
[+] [-] ta1234567890|7 years ago|reply
[+] [-] joatmon-snoo|7 years ago|reply
[+] [-] Thorrez|7 years ago|reply
https://cs.chromium.org
For example here's a hacky regex that finds lambdas:
https://cs.chromium.org/search/?q=%5CW%5C%5B%5B%5E%5C%5D%5D*...
[+] [-] ngnear|7 years ago|reply
[+] [-] techbio|7 years ago|reply
This is not the obvious search strategy for code, lacking semantic structure, though apparently it suits regex searchable indexes.
[+] [-] crawshaw|7 years ago|reply
[+] [-] dmoy|7 years ago|reply
[+] [-] sn41|7 years ago|reply
[+] [-] nestorD|7 years ago|reply
[+] [-] was_boring|7 years ago|reply
[deleted]
[+] [-] sctb|7 years ago|reply
[+] [-] softwaredoug|7 years ago|reply
I did write a book on the topic, developed the Elasticsearch learning to rank plugin, and have helped several Alexa top 50 sites deliver more accurate search
I’m only pointing out that search is oversold industry, but when you read about what Google actually does it’s more normal stuff that we would all probably do to solve a similar problem.
(for those curious it was a link to content at a community conference i help run - pointing out how hard ML can be for search. I admit a bit tangential perhaps)
[+] [-] PunchTornado|7 years ago|reply
[+] [-] rurban|7 years ago|reply
The last time someone posted a link to a hopeful successor at github: https://news.ycombinator.com/item?id=18022357
[+] [-] z3t4|7 years ago|reply
[+] [-] humbledrone|7 years ago|reply
[+] [-] cobookman|7 years ago|reply
Even a traditional rdbms will start become slow with datasets in the 10-100GiB range.
Note that products like bigquery, snowflake, redshift...etc might be able to support but that is not Cheap or relatively fast