Compressing Scrabble Dictionaries

[+] binarymax|11 years ago|reply

I'm not really a data structures guy, but I love anagrams. When i wrote the anagramica API, the simplest way that I could come up with a fast search was this:

    - Take a word and sort its characters.
    - Add it to a dictionary where the key is the sorted characters and the value is the word.
    - If the sorted characters already exist in the dictionary then add it to the list of words for the same key.

This gives O(log n) when you give it a list of letters and you need to find all the possible words.

What benefits do GADDAG offer over the above?

[+] lifthrasiir|11 years ago|reply

Since finding a set of possible moves given the current Scrabble board is not quite equal to finding anagrams. To make a move in Scrabble you need to find words that, for example, have both E and L characters which are separated by two other letters (i.e. /^.{0,}E..L.{0,}$/) when the board contains a row or column with the same pattern (E, followed by two spaces and then L). A simple anagram list cannot efficiently search for such patterns.

[+] nathell|11 years ago|reply

When you are building a Scrabble engine and need to construct the list of potential moves, GADDAGs come in handy as you are able to "cast" your search from subgraphs "anchored" at letters already on board.

A sample implementation (in not very idiomatic Clojure, not touched in years, and using DAWGs instead of GADDAGs): https://github.com/nathell/spleen/blob/master/src/pl/danielj...

[+] emillon|11 years ago|reply

I believe that there is no benefit in using a GADDAG. Finding an anagram is an easier problem as you know that you will use all the letters, so sorting them is the optimal solution. But this method does not work with scrabble since you would have (at least) to find all anagrams of all subsets, which adds an exponential step.

[+] dbaupp|11 years ago|reply

In O(log n), what is n? The number of characters? I would've thought that was O(n log n) (due to the sort).

[+] btn|11 years ago|reply

The node packing format he describes sounds a bit like a LOUDS tree [1], which stores the structure of a tree as a bit array (each node as a '1' for each child, plus a '0'---for a total of 2n-1 bits for a tree of n nodes), and the data in a separate packed array. It can't represent the node-deduplication (nodes with multiple parents), but I think it gives comparable compression: for the full word list of 3,213,156 nodes, the tree structure is 6,426,311 bits (0.76MB), plus 3,213,156 bytes of character data---for 3.83MB total.

The downside is that traversing the tree is a series of linear bit-counting operations---which can be painfully show without a bit of pre-caching.

[1]: http://www.cs.cmu.edu/afs/cs.cmu.edu/project/aladdin/wwwloca...

[+] unknown|11 years ago|reply

[deleted]

[+] SeanDav|11 years ago|reply

What open source dictionaries are being used by these scrabble programs? (I know I could bing (or DDG) the answer but would like to hear from an "insider" if possible.)

[+] baking|11 years ago|reply

Not an "insider" but wikipedia points to:

http://www.freescrabbledictionary.com/twl06.txt

[+] willvarfar|11 years ago|reply

The contest linked at the top of the article has a standard word list. You should enter! :)

[+] recursive|11 years ago|reply

It's free to join. There's no such thing as an insider. Anyway, they are using the Enhanced North American Benchmark Lexicon

[+] rickbradley|11 years ago|reply

This appears to be just reinvention of known algorithms on suffix trees. I recommend (and recommended as a comment on the original article) Dan Gusfield's book "Algorithms on Strings, Trees, and Sequences" which does a pretty thorough job of covering the relevant algorithms and data structures.

[+] willvarfar|11 years ago|reply

The GADDAG is a standard datastructure - the original paper, linked from the article, is from 1994.

The article doesn't pretend to invent the GADDAG, nor claim to compress it better than others, only to try and explain how to simplify and pack a GADDAG.

The steps would work on all DAGs generally. This is nothing new, but hopefully its new to some of us and a nice article.

19 comments