top | item 44877387

(no title)

AlphaAndOmega0 | 6 months ago

I'd have liked more explanation of the actual solutions that programmers used at the time.

discuss

order

tgv|6 months ago

For checking? Just a lookup on disk (no db, just a large list with a custom index, then binary search in the retrieved block). Decoding anything was slow, and in-core was basically out of the question [1]. Caching was important, though, since just a handful of words make up 50% of the text.

I once built a spell checker plus corrector which had to run in 32kB under a DOS hotkey, interacting with some word processor. On top of that, it had to run from CD ROM, and respond within a second. I could do 4 lookups, in blocks of 8kB, which gave me the option to look up the word in normal order, in reverse order, and a phonetic transcription in both directions. Each 8kB block contained quite a few words, can't remember how many. Then counting the similarities, and returning them as a sorted list. It wasn't perfect, but worked reasonably well.

[1] Adding that for professional spell checking you'd need at least 100k lemmata plus all inflections plus information per word if you have to accept compounds/agglutination.

Someone|6 months ago

Aurornis|6 months ago

The article is about fitting large dictionaries into small memory footprints. Writing a 200K word spell checker on a machine with only 256K memory.

When you need to store your dictionary in under 1 byte per word, a trie won't cut it.

bazzargh|6 months ago

The limit given in the article is 360KB (on floppy). At that size, you can't use Tries, you need lossy compression. A Bloom filter can get you 1 in 359 false positives with the size of word list given https://hur.st/bloomfilter/?n=234936&p=&m=360KB&k=

The error rate goes up to 1 in 66 for 256KB (in memory only);

tetraodonpuffer|6 months ago

according to https://en.wikipedia.org/wiki/Ispell ispell (1971) already used Levenshtein Distance (although from the article it is not stated if this already existed in the original version, or if it was added in later years).

ksherlock|6 months ago

canucker2016|6 months ago

The first article I read about the techniques used in the spell program was the 1985 May issue of Communications of the ACM (CACM for those who know), https://dl.acm.org/toc/cacm/1985/28/5, in Jon Bentley's Programming Pearls column.

Not as much detail as the blog.codingconfessions.com article mentioned above, maybe some of the other/later techniques were added later on?

Link to the online version of the 1985 May Programming Pearls column: https://dl.acm.org/doi/10.1145/3532.315102

The PDF version of that article: https://dl.acm.org/doi/pdf/10.1145/3532.315102