A spellchecker used to be a major feat of software engineering (2008)

[+] knodi123|3 years ago|reply

I read an article where a software engineer was about to go on a long plane flight, so he downloaded a file of all english words and tried to write a spell checker during his flight, with no internet access, just as an exercise. And when he landed, he had finished, and his algorithm was beautiful, simple, elegant, easy to understand, and never something I'd have come up with in a million years.

It was actually a little depressing - it made me realize that some people just have a "something" that I'm not going to get through hard work or whatnot.

*edit: ah, found it. https://norvig.com/spell-correct.html

[+] andersource|3 years ago|reply

To be fair, Peter Norvig [0] isn't just another software engineer, he's a hardcore CS researcher, co-authored the most popular (pre-DL) AI textbook, was head of NASA's Computational Sciences Division, etc.

[0] https://en.wikipedia.org/wiki/Peter_Norvig

[+] azurelake|3 years ago|reply

I mean, he definitely knew about the concept of edit distance: https://en.wikipedia.org/wiki/Edit_distance before writing this code, which is the most difficult part. Here's a practice problem: https://leetcode.com/problems/edit-distance/.

So in this case, the special "something" he has is quite literally the amount of hard work he put in to learn this stuff. And if you take this opportunity to learn about edit distance, you'll be one step closer to that special someone you want to be.

[+] JoshCole|3 years ago|reply

You know how there is that joke about reasoning about multi-dimensional structures?

"To deal with hyper-planes in a 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly. Everyone does it."

– Geoffrey Hinton, A geometrical view of perceptrons

Well, you actually can come up with Norvig's spellchecker very easily with a similar trick. Just tell yourself: to deal with a really hard problem, just argmax with an okayish heuristic. Everyone does it.

Solutions like this seem impossibly elegant because they tolerate a whole lot of error, but they are actually extremely obvious when you tolerate error.

[+] teshigahara|3 years ago|reply

If it makes you feel any better the original is not the same as what is displayed now[0] so it did not take just one flight to achieve elegance, and also included some bugs which were mentioned in the errata later[1].

[0] - https://web.archive.org/web/20070410053746/http://norvig.com...

[1] - https://web.archive.org/web/20150906100448/http://www.norvig... : ctrl+f "Update"

[+] TacticalCoder|3 years ago|reply

It's really nice and I'm a fan of the man but... Even back then it was a really simple spellchecker. Back then stuff like the "metaphone" and "double metaphone" algorithms already existed (suggesting candidates based on similarly sounding words). Fun algorithms 'em metaphones I'd say. Relatively easy to implement/port (been there, done that).

I'm sure there's better stuff now.

Also it's possible to better ranks candidates to fix typos by taking into account the layout of the keyboard (e.g. on a QWERTY keyboard is you see "wery" it is, despite having an edit distance of 1, highly unlikely the person typing meant to type "very". "weary" is much more likely). So just sorting by edit distance and then suggesting the most common word often doesn't result in the best candidate being shown first.

And then context. Ah, context. I type this and my browser isn't aware I'm talking about algorithms and not only says "metaphone" is incorrect but suggests "megaphone".

So... The matter still isn't settled if you ask me.

[+] GuB-42|3 years ago|reply

This kind of feat is the exact opposite of what the article is about.

This code is very simple, very short, very unoptimized and done in a few hours, free solo. The kind of code the article could have linked as an example of how simple it is now to write a spellchecker, in fact it explicitly talks about how you can do it in a few lines of Python, i.e. this.

What the article is about is complex, highly optimized code that took months to do, because computers at the time didn't even have enough space to store the complete word list uncompressed, neither in storage nor in memory, and even with compression, you had to be clever.

[+] stevenspasbo|3 years ago|reply

You're in for a bad time if you're going to be comparing yourself to Peter Norvig.

[+] wheels|3 years ago|reply

The prowess here is greatly overstated. His implementation would be pretty obvious for anyone that had read an Information Retrieval textbook any time in the couple decades between when these methods were devised and when he wrote that post.

That's kind of like watching someone use Red-Black Trees and assuming that they intuited them out of thin air. It's just remembering a basic algorithm that would be familiar to anyone in his particular field.

[+] feoren|3 years ago|reply

These ideas are extremely powerful. I built a spell-checker largely based on this article, by parsing English Wikipedia. At scale it needs a trie and a distance metric with better-scaling performance metric, but it works really well. These go together: your error metric is a priority-based multi-headed traversal of your trie -- this emulates comparing your word to every word in your dictionary; you can compare against a few million words very quickly.

Because it's based on Wikipedia, it can correct proper names too. It's very extensible: by changing the error model, you can get fuzzy auto-complete that can look at "Shwarz" and suggest "Schwarzenegger" (even though you missed the 'c'). You can extend it to looking at multiple words instead of just letters, for correcting common word-use issues as well.

[+] corobo|3 years ago|reply

Do you allow yourself to be bored at all?

I recently realised I've been missing my "something" for a while.. ya I've not had a dopamine free moment since doom scrolling became a thing.

Not comparing somethings, I probably couldn't write a spell checker much less a beautiful one, but my creative juices are much less clogged up since allowing myself to get bored.

I believe it's being able to combine a learned skillset with raw creativity (boredom) that gives a person that something you refer to.

[+] rjh29|3 years ago|reply

Worth noting he did not perform the "major feat of software engineering" like the OP says (given 256k ram or lower). He just put all the words into a hash. The main key is knowing about edit distance, if you know that, you can probably write the solution.

So yes he's very talented, but if you study enough computer science, you can also do it. Or don't. It's totally fine to be interested in other parts of engineering instead.

[+] alfalfasprout|3 years ago|reply

Don't get me wrong, Peter Norvig is awesome, but if you're already familiar with levenshtein or even hamming distances, it's pretty straightforward to hack together a spellchecker.

[+] rzimmerman|3 years ago|reply

Reminds me of a dumb project I did that takes a person's name and returns a name that is close, but not quite right: https://github.com/rzimmerman/shitty-autoreply

The idea was to set up an auto-reply to my email that looked automated, but surely couldn't be because it messed up the sender's name. Not nearly as clever as Peter Norvig's program and even less useful.

I'd say the bias toward common American names is a bug, but given that the goal of the project is to be obnoxious it's probably a feature.

[+] pixel_tracing|3 years ago|reply

To be fair even Norvig’s implementation wasn’t great for a mobile device. It can be improved with a more efficient trie structure…

[+] cortesoft|3 years ago|reply

It is a bit strange to me that it would be surprising that there are people way better at programming than you. The chance that you would be at the far right of the bell curve for any skill is pretty small, by definition. Almost everyone is going to not be the best at your thing. I don’t know why it would be depressing to not be the best.

[+] knicholes|3 years ago|reply

I used it to apply to a job at Twitch, as they tasked me on creating a spellchecker. Thanks, Peter! Also, they didn't give me an offer.

[+] kachurovskiy|3 years ago|reply

I've come to accept the fact that there are millions of people better than me at something. It's depressing and hopeless trying to be the best, there isn't even an objective measure. My goal is instead to be useful.

[+] lb1lf|3 years ago|reply

No one posted Martha Snow's epic poem yet?

Eye halve a spelling chequer It came with my pea sea It plainly marques four my revue Miss steaks eye kin knot sea.

Eye strike a quay and type a word And weight four it two say Weather eye am wrong oar write It shows me strait a weigh.

As soon as a mist ache is maid It nose bee fore two long And eye can put the error rite It's rare lea ever wrong.

Eye have run this poem threw it I am shore your pleased two no It's letter perfect awl the weigh My chequer tolled me sew.

[+] unknown|3 years ago|reply

[deleted]

[+] pornel|3 years ago|reply

For the record, ChatGPT can actually correct spelling of this (or it has seen spoilers somewhere…)

[+] jandrese|3 years ago|reply

It still blows my mind that when growing up I had a Commodore 64 word processor with spell check. To run the spell checker you had to hit the button to start the spell checker and then turn the floppy upside down while it ran through the document.

This means you had to store in a measly 64kb not only the code for the spell checker, the entire contents of the document, KERNAL overhead, and enough data for the spell checker to run. Remember that the Commodore 64 had an incredibly slow disk drive (300bps) and the disks only supported 160kb of data. Any scheme that required a lot of disk access would be terribly slow. Yet somehow it worked and wasn't outrageously slow. I suspect the program was unloading big chunks of its own program to aggressively reuse the memory space but even then it seems like magic.

[+] Xeoncross|3 years ago|reply

If you think this is great and/or immediately thought of using a trie to reduce storage costs, then let me introduce you to https://blog.burntsushi.net/transducers/ which pushes storage reuse over the edge to almost competing with compression like gzip while still allowing you to consider edit distance while finding possible alternatives.

typeaheads/autocomplete and spellchecking both need this so worth a read.

[+] polytely|3 years ago|reply

Great stuff, I never deal with stuff like this in my day job and was pleasantly surprised at how easy it was to follow what was going on, which means the author did a great job.

[+] denvaar|3 years ago|reply

Also recommend "Compact Data Structures" by Gonzalo Navarro if you're into compression and data structures.

[+] ccooffee|3 years ago|reply

Great article, thanks for sharing!

[+] dang|3 years ago|reply

It used to be a popular submission topic too (still is, and used to as well):

A spellchecker used to be a major feat of software engineering (2008) - https://news.ycombinator.com/item?id=25296900 - Dec 2020 (143 comments)

A Spellchecker Used to Be a Major Feat of Software Engineering (2008) - https://news.ycombinator.com/item?id=10789019 - Dec 2015 (29 comments)

A Spellchecker Used to Be a Major Feat of Software Engineering - https://news.ycombinator.com/item?id=4640658 - Oct 2012 (70 comments)

A Spellchecker Used To Be A Major Feat of Software Engineering - https://news.ycombinator.com/item?id=3466388 - Jan 2012 (61 comments)

A Spellchecker Used to Be a Major Feat of Software Engineering - https://news.ycombinator.com/item?id=212221 - June 2008 (22 comments)

[+] Willish42|3 years ago|reply

Pretty funny how the oldest one has the same top comment as this submission (https://norvig.com/spell-correct.html), 15 years later

[+] VLM|3 years ago|reply

Three other things to think about:

1980s memory models especially dos-era PC was a nightmare but you could at great effort work around it. The C / unix memory model is actually quite luxurious compared to programming on dos 3.3.

In the old days non-English support was pretty complicated. Now you rely on the OS to support UTF-8 and call it good, more or less. Was a bigger problem 'back in the day'.

In the old days you'd run a spell checker as a batch process, possibly not even part of the editor itself. The point I'm making is WRT speed, now a days speed doesn't matter, not because processors are faster (although they are, software always expands faster than hardware) but because we have integrated app with threading, so instead of saving this into a file, then exiting my browser and running a separate spell check app then exit and start the browser again and load this file back into it, the browser simply has a thread doing spell checking "on the fly". So in the old days it was VERY important WRT labor costs to spell check a 10000 word essay as fast as possible, and on a 8 mhz XT that was a trick indeed. But today my multi core multi ghz desktop merely needs to spell check on the fly as I'm typing and I can't be faster than 100 wpm or so. My desktop had "like one second or so" to decide that ghz and mhz are not words and put red squigglies under them whereas in the old days you'd have to spell check like 1000 words/second to keep the users happy.

So simple ineffective code survives in 2023 because it can lean on the OS, its not just that hardware is faster or larger.

[+] jrumbut|3 years ago|reply

A good one still is. When I got my most recent phone (Galaxy 22, but I jumped several versions in the upgrade so I don't know when the change happened), the spellcheck engine regressed massively from previous editions.

I'd be so curious to hear what the change was. It's absolutely awful to use!

[+] benj111|3 years ago|reply

Is it not based on your history? I know the suggestions seem to be. If so and it's stored locally it could just be that it's having to start again from scratch?

[+] bediger4000|3 years ago|reply

He's got a point: increase in minimum memory and increase in CPU performance means that superior programming isn't always necessary.

People regularly produce for fun what used to be a masters thesis level of work these days. I'll cite that YouTube video of someone making a reaction wheel system that can get an inverted Lego pendulum to stand up, and stay up under perturbations. The Lego part isn't the main work. The main work is software, a PID controller of the reaction wheel.

Something else has happened. There's enough people who know enough that the knowledge is just sloshing around, waiting for others to do cool things with it.

[+] simcop2387|3 years ago|reply

> He's got a point: increase in minimum memory and increase in CPU performance means that superior programming isn't always necessary.

Fun part about this is that it might actually make it slower! Between cache misses, pre-fetchers and branch mispredictions the more naive search may make things significantly more predictable for the out of order execution on modern processors that the result is actually faster when you don't try to be clever with some of this.

You still have to benchmark and not do things completely wacky but it's kind of interesting how things have changed because of the scale of the pieces.

[+] taeric|3 years ago|reply

The really mind blowing thing about that 2 meg dictionary, is that it can completely fit in cache for many computers nowadays. Just mind blowing that the optimized linear scan search today is competitive with faster methods from yesteryear. (Heck, the unoptimized scan is probably more than competitive.)

This does have me curious how big the ZDD for the dictionary would be. https://en.wikipedia.org/wiki/Zero-suppressed_decision_diagr...

[+] diceduckmonk|3 years ago|reply

It will even fit within 5mb of browser local storage.

[+] kens|3 years ago|reply

It's kind of amazing how many hard computer problems from the olden days were solved by having more memory. When I got started in computer graphics, there was a lot of research work on how to render images a scanline at a time, since that's all the memory you had. Once you could get enough memory to hold the entire image, you could use a Z-buffer and rendering became kind of trivial.

Another example is the post on HN yesterday about old videogames and how they used a bunch of different hacks such as sprites and tiles because there wasn't enough memory for the whole screen. https://nicole.express/2023/yes-popeye-the-sailor-man.html

[+] armchairhacker|3 years ago|reply

...but a really good grammar checker is still, to this day, a major feat.

Even Grammarly can't write sentences for you as good as a native speaker. ChatGPT and language models can, but ironically, they still require way more memory than fits on a desktop. Simple cases like conjugation can certainly be handled but there is a lot of nuance to make a sentence "flow".

[+] unknown|3 years ago|reply

[deleted]

[+] jiggawatts|3 years ago|reply

The “current hotness” for spellcheck would be to use a large language model (LLM) like GPT-3 that can use several paragraphs of context to disambiguate between alternatives.

However usefully accurate LLMs are still too big to work locally in a laptop or PC, so once again we’re back to having to liberally apply black magic to squeeze the models into something an integrated GPU can run in its “tiny” VRAM that’s actually gigabytes in size.

The more things change, the more they stay the same.

PS: Spell check is a trivially solved problem for most languages, but not all! Hungarian is notoriously difficult, and there are ongoing projects to improve this: https://en.m.wikipedia.org/wiki/Hunspell

[+] aflag|3 years ago|reply

While it sounds in theory LLMs could work for spell checking, was that ever attempted in practice?

[+] thangalin|3 years ago|reply

In some ways, I think computational linguistics has missed a mark. We have dictionaries, lexicons, grammar engines, spell checkers, pluralization rules, tries, quotation disambiguation, sentiment analyzers, word segmentation, and more. You'd think we could roll all these into a unified, standard definition format.

My KeenWrite editor, for instance, uses:

* https://github.com/DaveJarvis/KeenSpell (spell check, lexicon)

* https://github.com/DaveJarvis/KeenQuotes (curls straight quotes)

* https://github.com/DaveJarvis/KeenWrite/blob/main/R/pluraliz... (pluralization)

* https://lucene.apache.org/ (word segmentation)

I was looking at integrating LanguageTool[0] for grammar and realized that it has partial functionality for KeenQuotes (lexing and tokenization), duplicates the SymSpell algorithm used by KeenSpell, and because it offers grammar corrections it likely can pluralize words, as well.

Unifying those for English alone would be a massive undertaking. At present, KeenWrite parses prose in multiple, redundant passes. While this works, I think it'd be great to start with a basic lexer/tokenizer that can emit tokens at blazing speeds[1], which feeds into higher-level abstractions.

[0]: https://github.com/languagetool-org/languagetool

[1]: https://github.com/DaveJarvis/KeenQuotes/blob/main/src/main/...

[+] BashiBazouk|3 years ago|reply

And still is for most. I'm a horrible speller, but fairly consistent in my mistakes. Usually I use the wrong vowel with a similar sound in the middle of a word. It's amazing how few spell checkers can handle that. The google search bar is the undisputed king in spell checking in my experience. Microsoft word, even from way back, has been good. Thunderbird from about a decade ago was the most infuriatingly bad spellchecker I have ever encountered. I always recognize the correct spelling when I see it. So, in Thunderbird it would commonly refuse to suggest the word I was looking for but would often give me the "un" version of that word. I mean, come on. I can get the first letter right 9,999 times out of 10,000 and that is probably a gross underestimation. And the few I would get wrong are a handful of edge cases...

[+] eastbound|3 years ago|reply

Now be French like me and have words spelled with an E en English and A in French (I recommend/je recommande), and try telling Google that, despite them forcing themselves to guess that you are French despite having set the browser to English, the Google Settings, the OS, and using a VPN to ensure to get all ads and search results in English, you really, really want to be recommended the English spelling of words so that you don’t do en/an mistakes ever again…

[+] earlyam|3 years ago|reply

Random story. I was a young kid in the 90's and my dad took me by the biggest house I'd seen at that point in my life for some work party or something. Really nice landscaping, etc. It was clear the owner had A Lot Of Money.

I asked him afterwards how the guy got so rich and he told me that he wrote Microsoft's spellchecker.

[+] pcj-github|3 years ago|reply

Actually there's way more to it than loading /usr/share/dict/words into a perl hashtable and calling it done. A good spellchecker from scratch is still a massive engineering effort in 2023.

[+] twodave|3 years ago|reply

Now it’s just a major pain in my butt. At least the ones on phones. Half the time they replace a legitimate word with the one they thought I actually meant. They’re almost always wrong.

[+] PaulHoule|3 years ago|reply

I dunno, there was an article in Creative Computing magazine circa 1984 about how to code one up in BASIC that required a little creativity but made it look pretty easy. If you sort the words in the document it is pretty quick to check them against a dictionary on disk.

Turbo Lighting, though, took advantage of the comparably large memory of the IBM PC to do something that really blew people away

https://books.google.com/books?id=MC8EAAAAMBAJ&pg=PA7&lpg=PA...

[+] tbensky|3 years ago|reply

I remember using Turbo Lightning. It was incredible. You type a word in some DOS program, and when you hit the spacebar, a beep would sound if the word was spelled wrong. Back then Peter Norton was all the rage, so if you typed "Norton" TL would beep and suggest spellings like "notion, nothing, etc." to correct Norton. The suggestions felt so AIish and futuristic to me. Wow a "suggestion" for how to spell something? If you selected a replacement, it would stuff the keyboard buffer with backspaces, to erase the word, and retype the correction. Borland made incredible software back then. Great times.

[+] vbezhenar|3 years ago|reply

Yet in 2022 spell checking is still not good, at least when it comes to ordinary software. It checks with dictionaries, it seems that recently Google Chrome became a little bit better at it, but still I can completely write gibberish random with order words of and it wud no car as long as separate words are present in the dictionary. And that's while there's AI which generates awesome texts around.

I don't know English well, but one single thing that immensely helps me memorizing English is instant feedback when I'm writing texts with errors. I bet that good spell checker in Chrome would do much more to the world than many English courses (also helps with any other language, of course, but English is special as it's world's lingua franca nowadays).

[+] LAC-Tech|3 years ago|reply

Dadgum/Prog21 must be one of my favourite blogs of all time. Hugely influential on me. It's a shame he doesn't post anymore.

180 comments