top | item 212221

A Spellchecker Used to Be a Major Feat of Software Engineering

47 points| breily | 18 years ago |prog21.dadgum.com | reply

22 comments

order
[+] nirmal|18 years ago|reply
http://www.norvig.com/spell-correct.html

A simple Python based spell checker along with some discussion of probability and the pitfalls of doing things the naive way.

[+] aston|18 years ago|reply
Norvig's code solves an even more difficult problem, spell correcting. But yeah, good read nonetheless.
[+] pistoriusp|18 years ago|reply
I remember bloom filters (http://en.wikipedia.org/wiki/Bloom_filter) been good for this sort of thing.
[+] jrockway|18 years ago|reply
Thanks for this link. Much more interesting than the article.
[+] aston|18 years ago|reply
The idea would be to read in the entire dictionary and mark the Bloom Filter for each word. Then whenever you get a word you want to test, see whether all of the places it would mark were previously marked. If it misses any, you know it's not a word. Unfortunately, if it doesn't miss, you're only sure it's probably a word (with the worst case probability controlled by the number of hash functions you use).
[+] pfedor|18 years ago|reply
Looking up a word in a dictionary is only part of what a spellchecker does. The other part is offering suggestions. This one is still not easy, especially if you want it to give reasonable suggestions for all languages. For example, the algorithm vim 7.0 uses (based on tries) was first described in a research paper but was not implemented until Bram Moolenaar did it, because the implementation was hard.
[+] dfranke|18 years ago|reply
I don't get what the big deal is. As long as the document you're checking fits in RAM, then you can just sort the words alphabetically and then check them in a single pass through the dictionary. If you really need random access, use a B+ tree.
[+] tom_rath|18 years ago|reply
Where do you plan to store the dictionary for reference?
[+] hugh|18 years ago|reply
Isn't alphabetically sorting a document containing (potentially) tens of thousands of words going to be pretty slow on a machine of this vintage? Isn't it a bit perverse to have a spellcheck which scales worse than O(N)?
[+] falsestprophet|18 years ago|reply
It still is I think. Nothing seems to compare to Google's spell check.
[+] ashleyw|18 years ago|reply
Yes, and in some situations, you can easily tap into googles spell check for you own automated needs.
[+] tptacek|18 years ago|reply
It was just as trivial then, in decent languages, as it is now. Conversely, if you make your data set large enough (say, the whole web as a corpus, not /usr/share/dict/words), the same problem is just as hard today.
[+] jrockway|18 years ago|reply
Executive summary: "programming was harder when there was less abstraction".

I was hoping for a discussion of some algorithms, but instead I just got a long winded "wow, progress is amazing" rant.

This post has inspired me to start a new blog where I state the obvious and then submit it to every social news site. My first article will be "the sky is blue" except I will explain that in 1000 words or more. With ads.

[+] xlnt|18 years ago|reply
now that you've given away the summary, i'm not going to read your post. and i was planning to click on lots of ads.