top | item 29078919

Trie in JavaScript: The data structure behind autocomplete

157 points| EntICOnc | 4 years ago |stackfull.dev

56 comments

order
[+] rockwotj|4 years ago|reply
Tries are fun structures!

However, for autocomplete you often want a weighted Trie because you have extra information you want to weight nodes by. An example with contacts is that you often want recent and frequent contacts.

My company has an open source trie implementation here for a client to do weighted contact auto complete: https://github.com/shortwave/trie

[+] bitexploder|4 years ago|reply
I first learned about Tries when I implemented a spell checker, almost 20 years ago now, with basic suggestions. It’s amazing how easy it is to get an efficient, 80% spell checker and recommendation engine implemented from scratch once you dig into it. Tries make for an efficient enough in memory lookup structure (space, storage, and compute), and are an obvious part of a suggestion engine as well. Coupled with a basic soundex and some word mangling + edit distance checking and you have a very functional system. Solving the last 20% takes an order of magnitude more work :)
[+] thatswrong0|4 years ago|reply
Oh sick we might have to switch to this for our autocomplete / autosuggestion functionality - precisely because it's lacking weighting :)
[+] karterk|4 years ago|reply
There is another aspect of tries that make them really useful: you can do fast, fuzzy word searches on a trie.

In Typesense[1], I've implemented fuzzy search based on levenshtein damerau distance and it's incredibly fast. I've found this approach to be a much better (faster + more flexible) alternative to Peter Norvig's brute-force based spell-checker that is quite a popular post [2].

[1]: https://github.com/typesense/typesense [2]: https://norvig.com/spell-correct.html

[+] bollu|4 years ago|reply
Does the trie help in memoizing the edit distance between multiple words? Or do you pay the O(n^2) cost per word? I'd love to hear details of how this works!
[+] petethepig|4 years ago|reply
Tries are also used to efficiently store names of functions in macOS / iOS binaries (Mach-O format) and work great for that purpose because a lot of functions have the same prefixes [0].

Indexes in mongodb use a similar concept (they call that prefix compression [1]), and that allows them to store indexes more efficiently.

We use them at Pyroscope and we wrote a blog post about our storage design [2] There are even some animations :)

[0] https://www.m4b.io/reverse/engineering/mach/binaries/2015/03...

[1] https://docs.mongodb.com/manual/reference/glossary/#std-term...

[2] https://github.com/pyroscope-io/pyroscope/blob/main/docs/sto...

[+] aleksiy123|4 years ago|reply
Trie is probably my favourite ds. I really enjoy this python implementation I learned while studying leetcode because its so succint. Really useful for interviews.

    from collections import defaultdict
    
    END = object()

    def make_trie():
       return defaultdict(make_trie)
    
    def insert(trie, word):
        for c in word:
            trie = trie[c]
        trie[END] = True
[+] dmurray|4 years ago|reply
I can't figure out how this works. Shouldn't insert return a reference to the new trie?
[+] vanderZwan|4 years ago|reply
It's such a simple and elegant as data structure. If you only care about word lookup you can implement it in just 35 lines of modern JavaScript:

    function trieBuilder(word_list) {
      const root = {};
      for (const word of words_list) {
        let node = root;
        for (const char of word) {
          let nextNode = node[char];
          if (!nextNode) node[char] = nextNode = {};
          node = nextNode;
        }
        node._ = 1; // mark the nodes that are endings of real words
      }

      function findChildren(node, prefix, list, maxLength) {
        if (node._ === 1) list.push(prefix);
        for (const char in node) {
          findChildren(node[char], prefix + char, list, maxLength);
          if (list.length >= maxLength) return list;
        }
        return list;
      }

      function findSuffixes(prefix, maxLength) {
        prefix = prefix.toLowerCase();
        let node = root;
        for (const char of prefix) {
          let nextNode = node[char];
          if (!nextNode) return [""];
          node = nextNode;
        }
        let words = findChildren(node, prefix, [], maxLength);
        return words;
      }

      return {root, findSuffixes};
    }

demo: https://observablehq.com/@jobleonard/autocomplete
[+] baur|4 years ago|reply
Nice implementation!

There is an option to get all suffixes without traversing subtree, but it comes with extra O(N) memory where N is combined length of all stored words - depending on case might be acceptable since memory for storing words itself is O(N) anyway. https://stackoverflow.com/a/29966616/2104560 (update 1 and update 3)

[+] psadri|4 years ago|reply
Note that to save memory, instead of using every character in the word, you may want to cap the depth of your trie to the first N chars. The words with shared prefixes are kept in an array at the leafs. During lookups, you’d scan the leaf array to find matches. You’d be trading space (lots of nested objects) vs speed (negligible for scanning small arrays).
[+] ufo|4 years ago|reply
Another way to save space is to use a directed acyclic graph. Specially if there are common suffixes that show up all the time.
[+] danbmil99|4 years ago|reply
Fun fact: the trie data structure was invented by Ed Fredkin, a pioneer in Quantum computing as well as cellular automata
[+] Cybiote|4 years ago|reply
This is the best kind of correct but it's an interesting phrasing since Fredkin is a kind of Quantum supremacy skeptic (hotkey-find Classicatopia in: https://windowsontheory.org/2017/10/30/the-different-forms-o...).

Fredkin's contribution were to reversible computation and since Quantum computers are a kind of reversible computer, many ideas and gate notions carried over.

[+] bsenftner|4 years ago|reply
I wrote a trie based offensive language filter for a chat system used by the NHL's original streaming games broadcasts, way back in '99. It was written in C, and used the PerfectHash program to produce the hash table used by the trie to contain a dictionary of every known curse word, slang variant, and common misspellings in about a dozen languages. About 5 years ago I was looking at some open source chat systems, and I saw my old offensive language filter's entire build system in that open source project, pretty much just as I left it so many years ago. Someone else's name was on it, but in the C comments there's my name in the source files.
[+] game_the0ry|4 years ago|reply
I used to hate tries, mostly bc I refused to take the time to understand them, but it is a pretty cool concept with a narrow but very practical use case. It does what it does really well, nothing more.

However, the idea of implementing in JS gives me the impression that the author is advocating implementing autocomplete on the client (browser) [1]. I would encourage developers not to do so. Depending on the data set to be rendered, that could be a big perf hit or not feasible in certain scenarios, but if the data set is not dynamic or has a small memory footprint (ie, states in the US), then it should be fine to have that data sent to the client to be processed on.

My general rule of thumb - limit business logic to the server and rendering logic to the browser.

You can implement server-side autocomplete - send the partial search string as a request to the server, which responds with the autocomplete list of strings to be rendered in the UI, debouncing [2] to limit the number of requests to the server (instead of requesting on every user change to the partial search string). Of course, this would have its trade offs (API reqs introduce their own perf concerns), but you want to go this route if your search data set is dynamic and sufficiently large.

[1] Not sure if that was the author's intention bc I cannot access the author's blog on my work machine.

[2] https://lodash.com/docs/4.17.15#debounce

[+] 5tefan|4 years ago|reply
I implemented them several times in Excel. Beats most lookups in terms of fun and speed.
[+] vanderZwan|4 years ago|reply
I'm sorry, did you just say you implemented prefix trees in Excel?

As in a spreadsheet-based prefix tree?

Can I see it?

[+] mgraczyk|4 years ago|reply
I wrote a little Trie in Javascript for practice a while back. It flattens the data into a single array for better locality. I think it ends up being pretty quick for lookups (but not insertions).

https://github.com/mgraczyk/fast-trie-js/blob/master/index.j...

I used it for this little demo:

https://assets.opentoken.com/demos/search/index.html

[+] trinovantes|4 years ago|reply
The basic trie only matches exact consecutive characters. Does anyone know the data structure/trie variant for matching partial sequential characters?

    e.g. ABD matches A Brown Dog
                     ^ ^     ^
[+] deckard1|4 years ago|reply
If I understand correct, you're wondering how you might match "a bro do" with "a brown dog" or "a brown door", etc. Similar to what Google does. I've done this before. It's not a single data structure, but a process.

First, you normalize your strings to index (convert to lower case, throw out stop words such as "a", "the", "is", etc.) then you generate prefixes (or n-grams, I suppose). Say you want to do musicians and you have "Bob Dylan" as an entry. That would become "bob dylan" which would generate the prefixes "bo", "bob", "bob d", etc. You include spaces but once you hit a space, you start generating new prefixes. So you would also start doing "dy", "dyl", "dyla", and "dylan". The max length of the prefixes depends on how much memory you're willing to use.

Lets say Bob Dylan is mapped to some musician table in your DB and he has id 1000. I've done this with Redis, but you can use a trie as well. You'd store every prefix in the trie and the data on the node would be an array of musician ids. Now the magic: if you search "bo dyl" you would look up the prefix "bo", grab all the ids and then look up the prefix "dyl" and grab the ids (you would also try the full "bo dyl" first, but let's just assume nothing is there). Once you have those two arrays, you would take the intersection of them. Which, in our case, would leave an array containing id = 1000 along with any others that matched both.

That's, in general, how a simple autocomplete works. But you're better off using Redis or Elasticsearch if your data set is large. And if it's not, well... a simple hashmap is probably competitive and easier than tries.

[+] aelzeiny|4 years ago|reply
There's a searching concept known as edit (levenshtein) distance that is close but not exactly what you're talking about. To rephrase your question: is there a datastructure that can pre-compute the levenshtein distance for a set or results?

Yes, and it's crazy complicated. See Levenshtein Automata[1]. In reality, I would let the professionals handle this by either using Open/ElasticSearch or Apache Lucene.

[1]http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levensht...

[+] padolsey|4 years ago|reply
A quick hack for this is to create variations and include them in your main trie (with links to the canonical value/object/etc.). E.g. you might create a variation w/o vowels.

There are heuristics you can apply here rather than implementing full fuzziness or levenshtein comparators etc. Ultimately it depends on your use-case and what common variations users might try to use.

[+] TYMorningCoffee|4 years ago|reply
Not exactly what you want but a second trie with the key on the starting letters and the value the complete word might work for you. Your ABD example would work but something like ABrD would not.
[+] Rd6n6|4 years ago|reply
Or for common misspellings on mobile?
[+] nomaxx117|4 years ago|reply
I remember having a class "competition" my freshman year of college to implement the fastest autocomplete system. It had to be Java, so I used JNI to call into a well-optimized set of C routines for constructing and querying a trie. I ended up winning by a lot - no one else even implemented a trie. I spent a lot of time making sure that I would have the fastest trie implementation, but that wasn't even necessary.
[+] 0xdky|4 years ago|reply
I first encountered Trie in code storing Unix network group mappings. Since IP addresses in a domain mostly vary at the end, a lot of IP addresses share the same first part. This greatly saved space and lookup time.
[+] dhosek|4 years ago|reply
What's kind of funny was I was reading something that talked about tries as if they were an exotic data structure and said something along the lines of "most developers will never implement a trie" and I had just finished implementing a trie that day.
[+] throw_m239339|4 years ago|reply
Yes, that's also the data-structure you want to use when working on a URL router for instance, in a http framework. Although, unless you have 10,000 of different http routes, the performance gains might not be that significant all things considered.
[+] mettamage|4 years ago|reply
I learned about tries by leetcoding. This is exactly why I liked leetcoding. Because tries are awesome, and I was already imagining what one could build with it.