top | item 46915362

(no title)

wolfgarbe | 24 days ago

Peter Norvig shows that an edit distance = 2 will cover 98.9% spelling errors. https://impythonist.wordpress.com/2014/03/18/peter-norvigs-2...

That's the reason why the default maximum edit distance of SymSpell is 2.

Now, all your 6 out of 6 examples are chosen from that 1.1% margin that is not covered by edit distance 2, presenting a rather unlikely high amount of errors within a single word.

The third-party SymSpell port from Justin Willaby, which you were using for benchmarking, clearly states that you need to set both maxEditDistance and dictionaryEditDistance to a higher number if you want to correct higher edit distances. That you neither used nor mentioned. This has nothing to do with accuracy; it is a choice regarding a performance vs. maximum edit distance tradeoff one can make according to the use case at hand.

https://github.com/justinwilaby/spellchecker-wasm?tab=readme...

pronnouncaition IS within edit distance 3, according to the Damerau-Levenshtein edit distance used by SymSpell. The reason is that adjacent transpositions are counted as a single dit. https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_di...

discuss

lexiathan|24 days ago

The examples that I chose for my benchmark demonstrate that Lexiathan maintains accuracy and performance even on severely degraded input. On less corrupted input, Lexiathan runs significantly faster and is even more accurate.

Lexiathan also doesn't have any edit distance parameters that need to be configured, so there is no "tuning" required. In particular, it's worth mentioning that using a very large dictionary, e.g. 500,000 words, often degrades accuracy rather than improves it, and likely increases memory usage and latency as well.

Regarding Norvig's 98.9% figure--this seems to be from Norvig's own made-up data. In the real world, users often generate misspellings that exceed 2 edit distances in many use cases (OCR, non-native speakers, medical/technical terminology, etc), and published text (often already spell-checked) doesn't reflect the same level of errors. And in any case, Norvig's spell-checker apparently only achieves an accuracy of 67% on its own chosen benchmarks, so clearly the 98.9% figure is not a realistic reflection of actual spell-checker performance, even for an edit distance of 2. Lexiathan is extremely accurate and retains high performance even on heavily degraded input, and the benchmark data (and demo) that I presented reflect that.