top | item 46900828

Show HN: A spell-checker 380x faster than Hunspell, 5x faster than SymSpell

3 points| lexiathan | 25 days ago |lexiathan.com

Benchmarks and technical details: https://lexiathan.com

4 comments

order

wolfgarbe|24 days ago

Author of SymSpell here. Congrats on the launch of Lexiathan.

Unfortunately, the comparison of Lexiathan vs. Symspell on your website regarding accuracy is misleading.

1. SymSpell has two parameters to control the maximum edit distance. Once you set both to 3, then also terms with an edit distance of 3 are accurately corrected:

  pronnouncaition -> pronunciation

  inndappendent -> independent

  unegspeccted -> unexpected

  soggtwaee       -> software
2. SymSpell comes with dictionaries in several sizes. Once you load the 500_000 terms dictionary, then also the two remaining terms will be corrected:

  maggnificntally -> magnificently

  annnesteasialgist -> anesthesiologist
https://github.com/wolfgarbe/SymSpell/blob/master/SymSpell.B...

SymSpell accurately corrects all of your examples if used properly with the correct parameters and dictionary.

Apart from that, your methodology of comparing correction accuracy by cherry-picking specific terms without statistical significance, where your product seemingly performs better, is questionable.

One would use large public corpora to measure the percentage of accurately corrected terms as well as the percentage of false positives.

Because SymSpell is Open-Source, everyone can integrate it into their applications for free, modify the code, use different dictionaries in various languages, or add terms to existing ones.

https://github.com/wolfgarbe/SymSpell

https://github.com/wolfgarbe/symspell_rs

lexiathan|24 days ago

Hi wolfgarbe,

I don't believe my benchmark of SymSpell is misleading. I used the webassembly repository that is listed on your github: https://github.com/justinwilaby/spellchecker-wasm

Here is the code I used for my benchmark: https://gist.github.com/Eratosthenes/bf8a6d1463d2dfb907fa13c...

I reported the results faithfully and I believe these results reflect the performance that users would typically see running SymSpell in the browser, using the default configuration. If I had increased the edit distance, then the latency performance gap between Lexiathan and SymSpell would have been even larger, and then arguably I would have been gaming my metrics by not using SymSpell as it is configured.

Regarding dictionary size: The dictionary size (as you can verify from the gist) was 82k words. I didn't specify the dictionary size I used for Lexiathan, but it was 106k words.

Lastly, three of the words in the benchmark have edit distances greater than three:

distance("pronnouncaition", "pronunciation") = 4

distance("maggnificntally", "magnificently") = 4

distance("annnesteasialgist", "anesthesiologist") = 6

So I do not believe SymSpell would correct these even with the edit distance increased to 3.