(no title)
lexiathan | 25 days ago
I don't believe my benchmark of SymSpell is misleading. I used the webassembly repository that is listed on your github: https://github.com/justinwilaby/spellchecker-wasm
Here is the code I used for my benchmark: https://gist.github.com/Eratosthenes/bf8a6d1463d2dfb907fa13c...
I reported the results faithfully and I believe these results reflect the performance that users would typically see running SymSpell in the browser, using the default configuration. If I had increased the edit distance, then the latency performance gap between Lexiathan and SymSpell would have been even larger, and then arguably I would have been gaming my metrics by not using SymSpell as it is configured.
Regarding dictionary size: The dictionary size (as you can verify from the gist) was 82k words. I didn't specify the dictionary size I used for Lexiathan, but it was 106k words.
Lastly, three of the words in the benchmark have edit distances greater than three:
distance("pronnouncaition", "pronunciation") = 4
distance("maggnificntally", "magnificently") = 4
distance("annnesteasialgist", "anesthesiologist") = 6
So I do not believe SymSpell would correct these even with the edit distance increased to 3.
wolfgarbe|25 days ago
That's the reason why the default maximum edit distance of SymSpell is 2.
Now, all your 6 out of 6 examples are chosen from that 1.1% margin that is not covered by edit distance 2, presenting a rather unlikely high amount of errors within a single word.
The third-party SymSpell port from Justin Willaby, which you were using for benchmarking, clearly states that you need to set both maxEditDistance and dictionaryEditDistance to a higher number if you want to correct higher edit distances. That you neither used nor mentioned. This has nothing to do with accuracy; it is a choice regarding a performance vs. maximum edit distance tradeoff one can make according to the use case at hand.
https://github.com/justinwilaby/spellchecker-wasm?tab=readme...
pronnouncaition IS within edit distance 3, according to the Damerau-Levenshtein edit distance used by SymSpell. The reason is that adjacent transpositions are counted as a single dit. https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_di...
lexiathan|25 days ago
Lexiathan also doesn't have any edit distance parameters that need to be configured, so there is no "tuning" required. In particular, it's worth mentioning that using a very large dictionary, e.g. 500,000 words, often degrades accuracy rather than improves it, and likely increases memory usage and latency as well.
Regarding Norvig's 98.9% figure--this seems to be from Norvig's own made-up data. In the real world, users often generate misspellings that exceed 2 edit distances in many use cases (OCR, non-native speakers, medical/technical terminology, etc), and published text (often already spell-checked) doesn't reflect the same level of errors. And in any case, Norvig's spell-checker apparently only achieves an accuracy of 67% on its own chosen benchmarks, so clearly the 98.9% figure is not a realistic reflection of actual spell-checker performance, even for an edit distance of 2. Lexiathan is extremely accurate and retains high performance even on heavily degraded input, and the benchmark data (and demo) that I presented reflect that.