Now I understand somewhat better why I often end up ceasing midway through discourse, as my colloquists often end up interjecting requesting definition of whatever neologism has just emanated from lexicon into conversation.... a random sampling of words that I have just today been asked to define either do not feature in this list at all, or are 90,000+ in terms of use. Brobdingnagian, deleterious, autodidact, loquaciousness... are these really such strange vernacular?
I suppose this is what happens when you spend your formative years buried in literature - it probably doesn't help that I mispronounce all sorts, as books make poor elocutionists.
I do worry that as we further and further consolidate our vocabulary that we lose the breadth and depth of thought that nuanced words provide... so did Orwell...
Although a brief examination of your comment history suggests a satirical intent in what you have written, it has (at face value) some errors. If you truly intend to build your defense for this posturing polysyallabic puffery on the foundation of subtle distinctions in meaning you would be well served by properly understanding the customary meaning of the words in question, subtle or otherwise.
Taking, as an example, the final sentence of your first paragraph: "Brobdingnagian, deleterious, autodidact, loquaciousness... are these really such strange vernacular?" we find a sentence that is grammatically incorrect, with the singular form of vernacular. Assuming a typographically damaged plural does no good, as these words are not vernaculars. We do no better with the assumption of an elided indefinite article ("are these really such [a] strange vernacular?") as the answer is trivially yes, attempting conversation using only these words (Brobdingnagian, deleterious, autodidact, and loquaciousness) would be an exercise of Cnutian futility. As the recurring problem in finding meaning in this sentence is this definition may I suggest that perhaps the primary issue is in the choice of words? I would suggest the substitution of "words" in place of "vernacular", as this recovers a perfectly sensible rhetorical question. Perhaps, in keeping with the theme of subtle distinction in word choice, "obscure" could be substituted for "strange", signifying the strangeness is rooted in rarity of use, rather than e.g. etymology.
Ah, another person who talks like a book. I also get that quite regularly. I have a bit of a personal rule, though, where I never use a more complicated word when a simpler one would work. After all, conversation is at heart a means of communication.
Although we might get a bit of a thrill from talking over people's heads, it doesn't really mean much of anything. And all it takes is one misuse -- for example, calling Brobdingnagian a neologism -- and we tear ourselves down more than we could hope to build ourselves up.
I wonder how other languages compare to English. I know English is far from pure.
"The problem with defending the purity of the English language is that English is about as pure as a cribhouse whore. We don't just borrow words; on occasion, English has pursued other languages down alleyways to beat them unconscious and rifle their pockets for new vocabulary."
-- James Nicoll
What do you mean by "pure"? Lack of loanwoards? Lack of any linguistic changes (e.g., sound, meaning) for existing words? Lack of any innovation, i.e. no new words for new concepts/objects? You could try to find some of these things in languages spoken by rather isolated groups. Yet I don't think one should call such a language "pure", implying some kind or (moral) superiority.
The source material for this frequency count comes from Open Subtitles [http://www.opensubtitles.org/en/search]. Hence the frequencies here apply to spoken English, not written English. In written English the three most common words are "the", "of" and "and", whereas here they are "you", "I", and "the".
If you have the right kind of friends you can play "Who knows the most obscure word?". Everyone picks a word from memory, check each word's position (We used to use google ngrams, but this would work well.), whoever gets the least common word wins that round, repeat until it's not fun anymore. That's how I learned defenestrate, obsequious, and a few other words.
For a corpus much more complete on the long tail, try google books ngram viewer : https://books.google.com/ngrams. That uses full data from google's book-scanning endeavors, millions of books compared to millions of words in the originally linked article.
I love to play with this, and see how thought waxes and wanes - a good one is to stick in every soviet premier since Lenin, or 20th c. US presidents - really tells you rather a bit about the mindshare that these individuals had.
EDIT ADD: Zip's Law was not mentioned directly in the blog post but the reply clarified that the API returns a zipf score. However, a word's zipf ranking is dependent on the corpus used. The Wordapi "About" page[1] says most data came from Princeton WordNet but a sibling comment says it came from a subtitles compilation. If the project could clarify the data sources, it would be helpful.
EDIT: This looks like a case of a common programming antipattern: you don't care about the casing for comparison purposes, so instead of implementing a case-insensitive compare, you downcase the strings and call it a day. But that's inherently a loss of data, and not having that data will eventually come back to bite you.
200? no. Try 2,000. That should be representable of a barely usable level of the language. If you are what might be described as fluent, you're probably at >20,000. Take this headline:
JACUZZIS FOUND EFFECTIVE IN TREATING PHLEBITIS
"EFFECTIVE" and "IN" are the only two words found in the first 2,000 words sorted by frequency, although "FOUND" is close. So with only 200 words you'd understand:
WNPHMMVF SBHAQ RSSRPGVIR IN GERNGVAT CUYROVGVF
2,000 words would give you:
WNPHMMVF FOUND EFFECTIVE IN GERNGVAT CUYROVGVF
And some grammar knowledge would tell you
WNPHMMVF(N) FOUND EFFECTIVE IN GERNGVAT(V) CUYROVGVF(N)
Which is enough to know that you only really need to look up "TREATING" in the dictionary to understand the gist of the sentence But it'd hardly pass as a fluent understanding...
I took exactly this approach to learning passable German when I lived there a couple months. Unfortunately, most Germans would much rather practice their English on you than let you practice German on them. So I gave up and let them.
While the frequency of words drops off precipitously, there's also the fact that the set of all thoughts one might want to convey is incredibly vast, so within any given conversation there will probably be at least a few words which otherwise rarely appear.
Very nice product! As a sidenote: anyway knows a service that provides human associations? Eg What do you associate with "Hacker" as a person "Computer", "Night", "Internet"
It seems like a good idea, but to automate it you'd need to maybe scrap websites and create a count of words that appear in the same sentence. Or you could get really crazy and start comparing subject/object relationships, etc.
I poked around in the neighborhood of "Tremulous" and saw a lot of obvious data errors: "ladiesand","manklnd", "confldentlal", "howdare", "monment". Other words don't seem particularly rare: "productively", "areolas", "combusts", "lazier".
It seems like their long-tail data is full of misspellings. Try typing in "vacillate" and looking at the other words on the graph in relative frequency, for example.
No corpus is ever "clean". Depending on the type of corpus there might be many misspellings, so obviously they will occur in the graph alongside other low-frequency words.
[+] [-] madaxe_again|11 years ago|reply
I suppose this is what happens when you spend your formative years buried in literature - it probably doesn't help that I mispronounce all sorts, as books make poor elocutionists.
I do worry that as we further and further consolidate our vocabulary that we lose the breadth and depth of thought that nuanced words provide... so did Orwell...
[+] [-] aristus|11 years ago|reply
2. Never use a long word where a short one will do.
3. If it is possible to cut a word out, always cut it out.
4. Never use the passive where you can use the active.
5. Never use a foreign phrase, a scientific word, or a jargon word if you can think of an everyday English equivalent.
6. Break any of these rules sooner than say anything outright barbarous.
-- Orwell, "Politics and the English Language"
[+] [-] sigilion|11 years ago|reply
Taking, as an example, the final sentence of your first paragraph: "Brobdingnagian, deleterious, autodidact, loquaciousness... are these really such strange vernacular?" we find a sentence that is grammatically incorrect, with the singular form of vernacular. Assuming a typographically damaged plural does no good, as these words are not vernaculars. We do no better with the assumption of an elided indefinite article ("are these really such [a] strange vernacular?") as the answer is trivially yes, attempting conversation using only these words (Brobdingnagian, deleterious, autodidact, and loquaciousness) would be an exercise of Cnutian futility. As the recurring problem in finding meaning in this sentence is this definition may I suggest that perhaps the primary issue is in the choice of words? I would suggest the substitution of "words" in place of "vernacular", as this recovers a perfectly sensible rhetorical question. Perhaps, in keeping with the theme of subtle distinction in word choice, "obscure" could be substituted for "strange", signifying the strangeness is rooted in rarity of use, rather than e.g. etymology.
[+] [-] delluminatus|11 years ago|reply
Although we might get a bit of a thrill from talking over people's heads, it doesn't really mean much of anything. And all it takes is one misuse -- for example, calling Brobdingnagian a neologism -- and we tear ourselves down more than we could hope to build ourselves up.
[+] [-] qsymmachus|11 years ago|reply
[+] [-] Terr_|11 years ago|reply
Nah. I'm proud to say I got everything except Brobdingnagian, which on further investigation is more of a literary reference than a "real" word.
[+] [-] nzealand|11 years ago|reply
(Sorry, I tried to resist, but I could not.)
[+] [-] CapitalistCartr|11 years ago|reply
"The problem with defending the purity of the English language is that English is about as pure as a cribhouse whore. We don't just borrow words; on occasion, English has pursued other languages down alleyways to beat them unconscious and rifle their pockets for new vocabulary." -- James Nicoll
[+] [-] impostervt|11 years ago|reply
Dutch - http://crr.ugent.be/programs-data/subtitle-frequencies/subtl...
Chinese - http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2880003/
Greek - http://www.bcbl.eu/subtlex-gr/
There's a few others (Polish, French, etc) but I can't find the links for some reason.
[+] [-] eginhard|11 years ago|reply
[1] Zipf’s word frequency law in natural language: http://colala.bcs.rochester.edu/papers/piantadosi2014zipfs.p... [2] https://en.wikipedia.org/wiki/Zipf%27s_law
[+] [-] hellrich|11 years ago|reply
[+] [-] sp332|11 years ago|reply
[+] [-] madaxe_again|11 years ago|reply
It's nirvana.
[+] [-] japaget|11 years ago|reply
[+] [-] te_platt|11 years ago|reply
[+] [-] Terr_|11 years ago|reply
[+] [-] pdpi|11 years ago|reply
[+] [-] sdsykes|11 years ago|reply
[+] [-] dzdt|11 years ago|reply
[+] [-] madaxe_again|11 years ago|reply
[+] [-] jasode|11 years ago|reply
EDIT ADD: Zip's Law was not mentioned directly in the blog post but the reply clarified that the API returns a zipf score. However, a word's zipf ranking is dependent on the corpus used. The Wordapi "About" page[1] says most data came from Princeton WordNet but a sibling comment says it came from a subtitles compilation. If the project could clarify the data sources, it would be helpful.
[1]https://www.wordsapi.com/about
[+] [-] impostervt|11 years ago|reply
Regarding your update - I'll update the About page.
[+] [-] uberalex|11 years ago|reply
[+] [-] copsarebastards|11 years ago|reply
EDIT: This looks like a case of a common programming antipattern: you don't care about the casing for comparison purposes, so instead of implementing a case-insensitive compare, you downcase the strings and call it a day. But that's inherently a loss of data, and not having that data will eventually come back to bite you.
[+] [-] impostervt|11 years ago|reply
[+] [-] mrfusion|11 years ago|reply
[+] [-] maaku|11 years ago|reply
JACUZZIS FOUND EFFECTIVE IN TREATING PHLEBITIS
"EFFECTIVE" and "IN" are the only two words found in the first 2,000 words sorted by frequency, although "FOUND" is close. So with only 200 words you'd understand:
WNPHMMVF SBHAQ RSSRPGVIR IN GERNGVAT CUYROVGVF
2,000 words would give you:
WNPHMMVF FOUND EFFECTIVE IN GERNGVAT CUYROVGVF
And some grammar knowledge would tell you
WNPHMMVF(N) FOUND EFFECTIVE IN GERNGVAT(V) CUYROVGVF(N)
Which is enough to know that you only really need to look up "TREATING" in the dictionary to understand the gist of the sentence But it'd hardly pass as a fluent understanding...
[+] [-] mode80|11 years ago|reply
[+] [-] impostervt|11 years ago|reply
You could probably at least get around.
[+] [-] thinkpad20|11 years ago|reply
[+] [-] davedriesmans|11 years ago|reply
[+] [-] rspeer|11 years ago|reply
http://conceptnet5.media.mit.edu
[+] [-] impostervt|11 years ago|reply
[+] [-] jloughry|11 years ago|reply
[+] [-] dullcrisp|11 years ago|reply
[+] [-] yellowstuff|11 years ago|reply
[+] [-] jboggan|11 years ago|reply
[+] [-] eginhard|11 years ago|reply
[+] [-] justaman|11 years ago|reply
[+] [-] drpgq|11 years ago|reply
[+] [-] cpwright|11 years ago|reply