top | item 40023171

(no title)

pax | 1 year ago

What would be a programmatic approach to find a list of most rarely used words (in any language?). I'm thinking, loop a list of words from a dictionary, and see how many results a search engine would return (filtering out dictionary results) - it would take a while - most languages have some hundred thousand words.

discuss

order

wolverine876|1 year ago

There are various tools that professionals use to analyze this question. Google Books' Ngrams data seems to be popular.

Here's a paper whose methodology addresses some of your question:

Jean-Baptiste Michel et al. Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331, 176-182 (2011). DOI:10.1126/science.1199644

https://www.science.org/doi/10.1126/science.1199644

nilamo|1 year ago

Maybe download an archive of Wikipedia articles and build a word occurrence dictionary out of that to compare? It would be much faster than a ton of separate search queries