However, for autocomplete you often want a weighted Trie because you have extra information you want to weight nodes by. An example with contacts is that you often want recent and frequent contacts.
My company has an open source trie implementation here for a client to do weighted contact auto complete: https://github.com/shortwave/trie
I first learned about Tries when I implemented a spell checker, almost 20 years ago now, with basic suggestions. It’s amazing how easy it is to get an efficient, 80% spell checker and recommendation engine implemented from scratch once you dig into it. Tries make for an efficient enough in memory lookup structure (space, storage, and compute), and are an obvious part of a suggestion engine as well. Coupled with a basic soundex and some word mangling + edit distance checking and you have a very functional system. Solving the last 20% takes an order of magnitude more work :)
There is another aspect of tries that make them really useful: you can do fast, fuzzy word searches on a trie.
In Typesense[1], I've implemented fuzzy search based on levenshtein damerau distance and it's incredibly fast. I've found this approach to be a much better (faster + more flexible) alternative to Peter Norvig's brute-force based spell-checker that is quite a popular post [2].
Does the trie help in memoizing the edit distance between multiple words? Or do you pay the O(n^2) cost per word? I'd love to hear details of how this works!
Tries are also used to efficiently store names of functions in macOS / iOS binaries (Mach-O format) and work great for that purpose because a lot of functions have the same prefixes [0].
Indexes in mongodb use a similar concept (they call that prefix compression [1]), and that allows them to store indexes more efficiently.
We use them at Pyroscope and we wrote a blog post about our storage design [2] There are even some animations :)
Trie is probably my favourite ds. I really enjoy this python implementation I learned while studying leetcode because its so succint. Really useful for interviews.
from collections import defaultdict
END = object()
def make_trie():
return defaultdict(make_trie)
def insert(trie, word):
for c in word:
trie = trie[c]
trie[END] = True
It's such a simple and elegant as data structure. If you only care about word lookup you can implement it in just 35 lines of modern JavaScript:
function trieBuilder(word_list) {
const root = {};
for (const word of words_list) {
let node = root;
for (const char of word) {
let nextNode = node[char];
if (!nextNode) node[char] = nextNode = {};
node = nextNode;
}
node._ = 1; // mark the nodes that are endings of real words
}
function findChildren(node, prefix, list, maxLength) {
if (node._ === 1) list.push(prefix);
for (const char in node) {
findChildren(node[char], prefix + char, list, maxLength);
if (list.length >= maxLength) return list;
}
return list;
}
function findSuffixes(prefix, maxLength) {
prefix = prefix.toLowerCase();
let node = root;
for (const char of prefix) {
let nextNode = node[char];
if (!nextNode) return [""];
node = nextNode;
}
let words = findChildren(node, prefix, [], maxLength);
return words;
}
return {root, findSuffixes};
}
There is an option to get all suffixes without traversing subtree, but it comes with extra O(N) memory where N is combined length of all stored words - depending on case might be acceptable since memory for storing words itself is O(N) anyway.
https://stackoverflow.com/a/29966616/2104560 (update 1 and update 3)
Note that to save memory, instead of using every character in the word, you may want to cap the depth of your trie to the first N chars. The words with shared prefixes are kept in an array at the leafs. During lookups, you’d scan the leaf array to find matches. You’d be trading space (lots of nested objects) vs speed (negligible for scanning small arrays).
Fredkin's contribution were to reversible computation and since Quantum computers are a kind of reversible computer, many ideas and gate notions carried over.
I wrote a trie based offensive language filter for a chat system used by the NHL's original streaming games broadcasts, way back in '99. It was written in C, and used the PerfectHash program to produce the hash table used by the trie to contain a dictionary of every known curse word, slang variant, and common misspellings in about a dozen languages. About 5 years ago I was looking at some open source chat systems, and I saw my old offensive language filter's entire build system in that open source project, pretty much just as I left it so many years ago. Someone else's name was on it, but in the C comments there's my name in the source files.
I used to hate tries, mostly bc I refused to take the time to understand them, but it is a pretty cool concept with a narrow but very practical use case. It does what it does really well, nothing more.
However, the idea of implementing in JS gives me the impression that the author is advocating implementing autocomplete on the client (browser) [1]. I would encourage developers not to do so. Depending on the data set to be rendered, that could be a big perf hit or not feasible in certain scenarios, but if the data set is not dynamic or has a small memory footprint (ie, states in the US), then it should be fine to have that data sent to the client to be processed on.
My general rule of thumb - limit business logic to the server and rendering logic to the browser.
You can implement server-side autocomplete - send the partial search string as a request to the server, which responds with the autocomplete list of strings to be rendered in the UI, debouncing [2] to limit the number of requests to the server (instead of requesting on every user change to the partial search string). Of course, this would have its trade offs (API reqs introduce their own perf concerns), but you want to go this route if your search data set is dynamic and sufficiently large.
[1] Not sure if that was the author's intention bc I cannot access the author's blog on my work machine.
I wrote a little Trie in Javascript for practice a while back. It flattens the data into a single array for better locality. I think it ends up being pretty quick for lookups (but not insertions).
If I understand correct, you're wondering how you might match "a bro do" with "a brown dog" or "a brown door", etc. Similar to what Google does. I've done this before. It's not a single data structure, but a process.
First, you normalize your strings to index (convert to lower case, throw out stop words such as "a", "the", "is", etc.) then you generate prefixes (or n-grams, I suppose). Say you want to do musicians and you have "Bob Dylan" as an entry. That would become "bob dylan" which would generate the prefixes "bo", "bob", "bob d", etc. You include spaces but once you hit a space, you start generating new prefixes. So you would also start doing "dy", "dyl", "dyla", and "dylan". The max length of the prefixes depends on how much memory you're willing to use.
Lets say Bob Dylan is mapped to some musician table in your DB and he has id 1000. I've done this with Redis, but you can use a trie as well. You'd store every prefix in the trie and the data on the node would be an array of musician ids. Now the magic: if you search "bo dyl" you would look up the prefix "bo", grab all the ids and then look up the prefix "dyl" and grab the ids (you would also try the full "bo dyl" first, but let's just assume nothing is there). Once you have those two arrays, you would take the intersection of them. Which, in our case, would leave an array containing id = 1000 along with any others that matched both.
That's, in general, how a simple autocomplete works. But you're better off using Redis or Elasticsearch if your data set is large. And if it's not, well... a simple hashmap is probably competitive and easier than tries.
There's a searching concept known as edit (levenshtein) distance that is close but not exactly what you're talking about. To rephrase your question: is there a datastructure that can pre-compute the levenshtein distance for a set or results?
Yes, and it's crazy complicated. See Levenshtein Automata[1]. In reality, I would let the professionals handle this by either using Open/ElasticSearch or Apache Lucene.
A quick hack for this is to create variations and include them in your main trie (with links to the canonical value/object/etc.). E.g. you might create a variation w/o vowels.
There are heuristics you can apply here rather than implementing full fuzziness or levenshtein comparators etc. Ultimately it depends on your use-case and what common variations users might try to use.
Not exactly what you want but a second trie with the key on the starting letters and the value the complete word might work for you. Your ABD example would work but something like
ABrD would not.
I remember having a class "competition" my freshman year of college to implement the fastest autocomplete system. It had to be Java, so I used JNI to call into a well-optimized set of C routines for constructing and querying a trie. I ended up winning by a lot - no one else even implemented a trie. I spent a lot of time making sure that I would have the fastest trie implementation, but that wasn't even necessary.
I first encountered Trie in code storing Unix network group mappings. Since IP addresses in a domain mostly vary at the end, a lot of IP addresses share the same first part. This greatly saved space and lookup time.
What's kind of funny was I was reading something that talked about tries as if they were an exotic data structure and said something along the lines of "most developers will never implement a trie" and I had just finished implementing a trie that day.
Yes, that's also the data-structure you want to use when working on a URL router for instance, in a http framework.
Although, unless you have 10,000 of different http routes, the performance gains might not be that significant all things considered.
I learned about tries by leetcoding. This is exactly why I liked leetcoding. Because tries are awesome, and I was already imagining what one could build with it.
[+] [-] rockwotj|4 years ago|reply
However, for autocomplete you often want a weighted Trie because you have extra information you want to weight nodes by. An example with contacts is that you often want recent and frequent contacts.
My company has an open source trie implementation here for a client to do weighted contact auto complete: https://github.com/shortwave/trie
[+] [-] bitexploder|4 years ago|reply
[+] [-] thatswrong0|4 years ago|reply
[+] [-] karterk|4 years ago|reply
In Typesense[1], I've implemented fuzzy search based on levenshtein damerau distance and it's incredibly fast. I've found this approach to be a much better (faster + more flexible) alternative to Peter Norvig's brute-force based spell-checker that is quite a popular post [2].
[1]: https://github.com/typesense/typesense [2]: https://norvig.com/spell-correct.html
[+] [-] bollu|4 years ago|reply
[+] [-] ignoramous|4 years ago|reply
[+] [-] petethepig|4 years ago|reply
Indexes in mongodb use a similar concept (they call that prefix compression [1]), and that allows them to store indexes more efficiently.
We use them at Pyroscope and we wrote a blog post about our storage design [2] There are even some animations :)
[0] https://www.m4b.io/reverse/engineering/mach/binaries/2015/03...
[1] https://docs.mongodb.com/manual/reference/glossary/#std-term...
[2] https://github.com/pyroscope-io/pyroscope/blob/main/docs/sto...
[+] [-] aleksiy123|4 years ago|reply
[+] [-] dmurray|4 years ago|reply
[+] [-] vanderZwan|4 years ago|reply
[+] [-] baur|4 years ago|reply
There is an option to get all suffixes without traversing subtree, but it comes with extra O(N) memory where N is combined length of all stored words - depending on case might be acceptable since memory for storing words itself is O(N) anyway. https://stackoverflow.com/a/29966616/2104560 (update 1 and update 3)
[+] [-] psadri|4 years ago|reply
[+] [-] ufo|4 years ago|reply
[+] [-] danbmil99|4 years ago|reply
[+] [-] Cybiote|4 years ago|reply
Fredkin's contribution were to reversible computation and since Quantum computers are a kind of reversible computer, many ideas and gate notions carried over.
[+] [-] bsenftner|4 years ago|reply
[+] [-] locusfox|4 years ago|reply
[deleted]
[+] [-] game_the0ry|4 years ago|reply
However, the idea of implementing in JS gives me the impression that the author is advocating implementing autocomplete on the client (browser) [1]. I would encourage developers not to do so. Depending on the data set to be rendered, that could be a big perf hit or not feasible in certain scenarios, but if the data set is not dynamic or has a small memory footprint (ie, states in the US), then it should be fine to have that data sent to the client to be processed on.
My general rule of thumb - limit business logic to the server and rendering logic to the browser.
You can implement server-side autocomplete - send the partial search string as a request to the server, which responds with the autocomplete list of strings to be rendered in the UI, debouncing [2] to limit the number of requests to the server (instead of requesting on every user change to the partial search string). Of course, this would have its trade offs (API reqs introduce their own perf concerns), but you want to go this route if your search data set is dynamic and sufficiently large.
[1] Not sure if that was the author's intention bc I cannot access the author's blog on my work machine.
[2] https://lodash.com/docs/4.17.15#debounce
[+] [-] kitd|4 years ago|reply
[1] https://htmx.org/examples/active-search/
[+] [-] locusfox|4 years ago|reply
[deleted]
[+] [-] 5tefan|4 years ago|reply
[+] [-] vanderZwan|4 years ago|reply
As in a spreadsheet-based prefix tree?
Can I see it?
[+] [-] baur|4 years ago|reply
It's https://en.wikipedia.org/wiki/Hash_array_mapped_trie which used in Scala's immutableMap https://dotty.epfl.ch/api/scala/collection/immutable/HashMap...
[+] [-] mgraczyk|4 years ago|reply
https://github.com/mgraczyk/fast-trie-js/blob/master/index.j...
I used it for this little demo:
https://assets.opentoken.com/demos/search/index.html
[+] [-] benrbray|4 years ago|reply
[+] [-] trinovantes|4 years ago|reply
[+] [-] deckard1|4 years ago|reply
First, you normalize your strings to index (convert to lower case, throw out stop words such as "a", "the", "is", etc.) then you generate prefixes (or n-grams, I suppose). Say you want to do musicians and you have "Bob Dylan" as an entry. That would become "bob dylan" which would generate the prefixes "bo", "bob", "bob d", etc. You include spaces but once you hit a space, you start generating new prefixes. So you would also start doing "dy", "dyl", "dyla", and "dylan". The max length of the prefixes depends on how much memory you're willing to use.
Lets say Bob Dylan is mapped to some musician table in your DB and he has id 1000. I've done this with Redis, but you can use a trie as well. You'd store every prefix in the trie and the data on the node would be an array of musician ids. Now the magic: if you search "bo dyl" you would look up the prefix "bo", grab all the ids and then look up the prefix "dyl" and grab the ids (you would also try the full "bo dyl" first, but let's just assume nothing is there). Once you have those two arrays, you would take the intersection of them. Which, in our case, would leave an array containing id = 1000 along with any others that matched both.
That's, in general, how a simple autocomplete works. But you're better off using Redis or Elasticsearch if your data set is large. And if it's not, well... a simple hashmap is probably competitive and easier than tries.
[+] [-] aelzeiny|4 years ago|reply
Yes, and it's crazy complicated. See Levenshtein Automata[1]. In reality, I would let the professionals handle this by either using Open/ElasticSearch or Apache Lucene.
[1]http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levensht...
[+] [-] js_chap|4 years ago|reply
[+] [-] padolsey|4 years ago|reply
There are heuristics you can apply here rather than implementing full fuzziness or levenshtein comparators etc. Ultimately it depends on your use-case and what common variations users might try to use.
[+] [-] TYMorningCoffee|4 years ago|reply
[+] [-] Rd6n6|4 years ago|reply
[+] [-] rschachte|4 years ago|reply
https://youtu.be/Ye2_C_Nw_Ow
[+] [-] nomaxx117|4 years ago|reply
[+] [-] 0xdky|4 years ago|reply
[+] [-] plondon514|4 years ago|reply
[+] [-] dhosek|4 years ago|reply
[+] [-] throw_m239339|4 years ago|reply
[+] [-] mettamage|4 years ago|reply