The Long Tail of the English Language

[+] madaxe_again|11 years ago|reply

Now I understand somewhat better why I often end up ceasing midway through discourse, as my colloquists often end up interjecting requesting definition of whatever neologism has just emanated from lexicon into conversation.... a random sampling of words that I have just today been asked to define either do not feature in this list at all, or are 90,000+ in terms of use. Brobdingnagian, deleterious, autodidact, loquaciousness... are these really such strange vernacular?

I suppose this is what happens when you spend your formative years buried in literature - it probably doesn't help that I mispronounce all sorts, as books make poor elocutionists.

I do worry that as we further and further consolidate our vocabulary that we lose the breadth and depth of thought that nuanced words provide... so did Orwell...

[+] aristus|11 years ago|reply

1. Never use a metaphor, simile, or other figure of speech which you are used to seeing in print.

2. Never use a long word where a short one will do.

3. If it is possible to cut a word out, always cut it out.

4. Never use the passive where you can use the active.

5. Never use a foreign phrase, a scientific word, or a jargon word if you can think of an everyday English equivalent.

6. Break any of these rules sooner than say anything outright barbarous.

-- Orwell, "Politics and the English Language"

[+] sigilion|11 years ago|reply

Although a brief examination of your comment history suggests a satirical intent in what you have written, it has (at face value) some errors. If you truly intend to build your defense for this posturing polysyallabic puffery on the foundation of subtle distinctions in meaning you would be well served by properly understanding the customary meaning of the words in question, subtle or otherwise.

Taking, as an example, the final sentence of your first paragraph: "Brobdingnagian, deleterious, autodidact, loquaciousness... are these really such strange vernacular?" we find a sentence that is grammatically incorrect, with the singular form of vernacular. Assuming a typographically damaged plural does no good, as these words are not vernaculars. We do no better with the assumption of an elided indefinite article ("are these really such [a] strange vernacular?") as the answer is trivially yes, attempting conversation using only these words (Brobdingnagian, deleterious, autodidact, and loquaciousness) would be an exercise of Cnutian futility. As the recurring problem in finding meaning in this sentence is this definition may I suggest that perhaps the primary issue is in the choice of words? I would suggest the substitution of "words" in place of "vernacular", as this recovers a perfectly sensible rhetorical question. Perhaps, in keeping with the theme of subtle distinction in word choice, "obscure" could be substituted for "strange", signifying the strangeness is rooted in rarity of use, rather than e.g. etymology.

[+] delluminatus|11 years ago|reply

Ah, another person who talks like a book. I also get that quite regularly. I have a bit of a personal rule, though, where I never use a more complicated word when a simpler one would work. After all, conversation is at heart a means of communication.

Although we might get a bit of a thrill from talking over people's heads, it doesn't really mean much of anything. And all it takes is one misuse -- for example, calling Brobdingnagian a neologism -- and we tear ourselves down more than we could hope to build ourselves up.

[+] qsymmachus|11 years ago|reply

Poe's law is making it difficult to come up with an appropriate response to this.

[+] Terr_|11 years ago|reply

> are these really such strange vernacular?

Nah. I'm proud to say I got everything except Brobdingnagian, which on further investigation is more of a literary reference than a "real" word.

[+] nzealand|11 years ago|reply

How very antidisestablishmentarianist of you.

(Sorry, I tried to resist, but I could not.)

[+] CapitalistCartr|11 years ago|reply

I wonder how other languages compare to English. I know English is far from pure.

"The problem with defending the purity of the English language is that English is about as pure as a cribhouse whore. We don't just borrow words; on occasion, English has pursued other languages down alleyways to beat them unconscious and rifle their pockets for new vocabulary." -- James Nicoll

[+] impostervt|11 years ago|reply

Subtlex has done word frequency counts in a number of languages:

Dutch - http://crr.ugent.be/programs-data/subtitle-frequencies/subtl...

Chinese - http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2880003/

Greek - http://www.bcbl.eu/subtlex-gr/

There's a few others (Polish, French, etc) but I can't find the links for some reason.

[+] eginhard|11 years ago|reply

Zipf's law [1,2] generally holds for a corpus in any natural language and can be applied to a lot of other things outside linguistics as well.

[1] Zipf’s word frequency law in natural language: http://colala.bcs.rochester.edu/papers/piantadosi2014zipfs.p... [2] https://en.wikipedia.org/wiki/Zipf%27s_law

[+] hellrich|11 years ago|reply

What do you mean by "pure"? Lack of loanwoards? Lack of any linguistic changes (e.g., sound, meaning) for existing words? Lack of any innovation, i.e. no new words for new concepts/objects? You could try to find some of these things in languages spoken by rather isolated groups. Yet I don't think one should call such a language "pure", implying some kind or (moral) superiority.

[+] sp332|11 years ago|reply

An interesting and large list of "English" words from other languages: https://en.wikipedia.org/wiki/Lists_of_English_words_by_coun...

[+] madaxe_again|11 years ago|reply

I like to shampoo my hair on my veranda in the jungle, then put on my cushy khaki pyjamas, while smoking a cheroot.

It's nirvana.

[+] japaget|11 years ago|reply

The source material for this frequency count comes from Open Subtitles [http://www.opensubtitles.org/en/search]. Hence the frequencies here apply to spoken English, not written English. In written English the three most common words are "the", "of" and "and", whereas here they are "you", "I", and "the".

[+] te_platt|11 years ago|reply

If you have the right kind of friends you can play "Who knows the most obscure word?". Everyone picks a word from memory, check each word's position (We used to use google ngrams, but this would work well.), whoever gets the least common word wins that round, repeat until it's not fun anymore. That's how I learned defenestrate, obsequious, and a few other words.

[+] Terr_|11 years ago|reply

I love "sesquipedalian", because it's so self-referential.

[+] pdpi|11 years ago|reply

First word I tried wasn't even present in the list (equivocate). How do you score that?

[+] sdsykes|11 years ago|reply

"Kajagoogoo" is the 96,714th most common word in the English language.

[+] dzdt|11 years ago|reply

For a corpus much more complete on the long tail, try google books ngram viewer : https://books.google.com/ngrams. That uses full data from google's book-scanning endeavors, millions of books compared to millions of words in the originally linked article.

[+] madaxe_again|11 years ago|reply

I love to play with this, and see how thought waxes and wanes - a good one is to stick in every soviet premier since Lenin, or 20th c. US presidents - really tells you rather a bit about the mindshare that these individuals had.

[+] jasode|11 years ago|reply

http://en.wikipedia.org/wiki/Zipf%27s_law

EDIT ADD: Zip's Law was not mentioned directly in the blog post but the reply clarified that the API returns a zipf score. However, a word's zipf ranking is dependent on the corpus used. The Wordapi "About" page[1] says most data came from Princeton WordNet but a sibling comment says it came from a subtitles compilation. If the project could clarify the data sources, it would be helpful.

[1]https://www.wordsapi.com/about

[+] impostervt|11 years ago|reply

The "frequency" score returned by Words API is the a Zipf score for the word. Ranges from ~1.6 to ~7.6.

Regarding your update - I'll update the About page.

[+] uberalex|11 years ago|reply

'I' doesn't seem to work. It is very common. https://books.google.com/ngrams/graph?content=I%2C+you&year_...

[+] copsarebastards|11 years ago|reply

Try lower-casing it.

EDIT: This looks like a case of a common programming antipattern: you don't care about the casing for comparison purposes, so instead of implementing a case-insensitive compare, you downcase the strings and call it a day. But that's inherently a loss of data, and not having that data will eventually come back to bite you.

[+] impostervt|11 years ago|reply

Should be fixed now.

[+] mrfusion|11 years ago|reply

So I'm wondering, if you just learned the 200 most popular words, you might get pretty far in learning a new language, no?

[+] maaku|11 years ago|reply

200? no. Try 2,000. That should be representable of a barely usable level of the language. If you are what might be described as fluent, you're probably at >20,000. Take this headline:

JACUZZIS FOUND EFFECTIVE IN TREATING PHLEBITIS

"EFFECTIVE" and "IN" are the only two words found in the first 2,000 words sorted by frequency, although "FOUND" is close. So with only 200 words you'd understand:

WNPHMMVF SBHAQ RSSRPGVIR IN GERNGVAT CUYROVGVF

2,000 words would give you:

WNPHMMVF FOUND EFFECTIVE IN GERNGVAT CUYROVGVF

And some grammar knowledge would tell you

WNPHMMVF(N) FOUND EFFECTIVE IN GERNGVAT(V) CUYROVGVF(N)

Which is enough to know that you only really need to look up "TREATING" in the dictionary to understand the gist of the sentence But it'd hardly pass as a fluent understanding...

[+] mode80|11 years ago|reply

I took exactly this approach to learning passable German when I lived there a couple months. Unfortunately, most Germans would much rather practice their English on you than let you practice German on them. So I gave up and let them.

[+] impostervt|11 years ago|reply

Probably depends on the language. A three-year old supposedly knows about 1,000 English words.

You could probably at least get around.

[+] thinkpad20|11 years ago|reply

While the frequency of words drops off precipitously, there's also the fact that the set of all thoughts one might want to convey is incredibly vast, so within any given conversation there will probably be at least a few words which otherwise rarely appear.

[+] davedriesmans|11 years ago|reply

Very nice product! As a sidenote: anyway knows a service that provides human associations? Eg What do you associate with "Hacker" as a person "Computer", "Night", "Internet"

[+] rspeer|11 years ago|reply

I work on ConceptNet, which does this.

http://conceptnet5.media.mit.edu

[+] impostervt|11 years ago|reply

It seems like a good idea, but to automate it you'd need to maybe scrap websites and create a count of words that appear in the same sentence. Or you could get really crazy and start comparing subject/object relationships, etc.

[+] jloughry|11 years ago|reply

The most striking thing about Randall Munroe's "Up Goer Five" comic [1] was that the word "computer" is on the list, but "thousand" isn't.

    "(Explained using only the ten hundred words people use the most often)"

[1] http://www.xkcd.com/1133/

[+] dullcrisp|11 years ago|reply

Well to be fair, thousand is a number, so analysis of written text will find it a lot rarer than it is actually used/spoken.

[+] yellowstuff|11 years ago|reply

I poked around in the neighborhood of "Tremulous" and saw a lot of obvious data errors: "ladiesand","manklnd", "confldentlal", "howdare", "monment". Other words don't seem particularly rare: "productively", "areolas", "combusts", "lazier".

[+] jboggan|11 years ago|reply

It seems like their long-tail data is full of misspellings. Try typing in "vacillate" and looking at the other words on the graph in relative frequency, for example.

[+] eginhard|11 years ago|reply

No corpus is ever "clean". Depending on the type of corpus there might be many misspellings, so obviously they will occur in the graph alongside other low-frequency words.

[+] justaman|11 years ago|reply

TIL "fucking" is the 214th most popular word in the English language.

[+] drpgq|11 years ago|reply

I'm guessing "icing" is more popular in Canada.

[+] cpwright|11 years ago|reply

Only if you're thinking of icing on roads (or in hockey), but not if you're thinking of icing on top of a cake.

54 comments