(no title)
heikkilevanto | 2 months ago
Folding diacritics makes "vähä" (little) into "vaha" (wax).
Dropping stop words like "The" misses the word for "tea" (in rather old-fashioned finnish, but also in current Danish).
Stemming Finnish words is also much more complex, as we tend to append suffixes to the words instead of small words in front to the word. "talo" is "house", "talosta" is "from the house", "talostani" is "from my house", and "talostaniko" makes it a question "from my house?"
If that sounds too easy, consider Japanese. From what little I know they don't use whitespace to separate words, mix two phonetic alphabets with Chinese ideograms, etc.
philippemnoel|2 months ago
We (ParadeDB) use a search library called Tantivy under the hood, which supports stemming in Finnish, Danish and many other languages: https://docs.paradedb.com/documentation/token-filters/stemmi...
ashirviskas|2 months ago
I actually just started working on a data formatter that applies principles like these to drastically reduce the amount of tokens without decreasing the performance, like other formats do (looking at you, tson).