The Oxford Advanced Learner’s Dictionary has a “Defining vocabulary” that they claim is used to write almost all definitions (I used the Fifth edition, where it is appendix 10). It’s about 8½ pages, with 5 columns of about 63 lines, so about 2,700 words.
It doesn’t list inflections, proper names, adjectives for colors such as yellowish and words used in an entry that derive from that entry (the dictionary mentions blearily and bleary-eyed being used in the definition of bleary)
They also say they occassionally had to use a word not in the list, but don’t say how often they had to. Those words _are_ defined in the dictionary, so it is possible that the reference graph does not have any cycles.
Considering the list seems to contain both "big" and "large", my guess is that there's quite a bit of overlap in the words used because they expect the 3000 are generally known and can be relied on. This means that if we were going to optimize for size, we could probably get to a much small number of words, and use those to define the others.
I didn't go searching, big was literally the first word on the list I read after going down a few pages, and I wondered about large, so I searched for it. I just looked a bit more, and there's "child", "childhood", and "grandchild", which while not the same problem, does illustrate that they are fairly liberal with their inclusions because they appear to want to use the minimum vocabulary to define something idiomatically, which is a slightly different question than what's the minimum required.
This problem actually seems to share a lot in common with database normalization.[1]
That's useful. There's Basic English, with about 1000 words. Using Basic English well is hard. During WWII, the BBC broadcast news to the British Empire countries in Basic English. George Orwell did some of the translations. He found translating to Basic English to be a political act. Ambiguity did not translate. He had to make political statements unambiguous.
That's where 1984's "Newspeak" came from. See "Orwell, the Lost Writings".
Yeah, but there's always Popper's observation hovering in the background concerning definitions: 'all definitions involve the use words which themselves remain undefined'. Now if particular constituents of language (nouns, verbs, qualifiers) have empirical referents (EG, oak tree) then something other than words can be supplied to buttress and shape consensus for any formulated definitions, using words which themselves have empirical referents. But with conceptual referents (EG, democracy) definitions become subjective and lack clear capacity for unambiguous validation. So a definition of a concept which resonates with one individual based on their understanding of its verbiage may be dissonant for another based on that individual's understanding of the content of the definition.
It makes sense that it doesn't list joined words like bleary-eyed when its definition is obvious from the constituents bleary and eye, or inflected words because inflections like -ish and -ly each have the same meaning when modifying other words. But what about phrasal and prepositional verbs such as put off, put up, and put up with, where their meaning can't usually be deduced from the constituents such as put, off, up, and with ?
The work of Ana Wierzbicka and Cliff Goddard studied 'Semantic Primes', 'the set of semantic concepts that are innately understood but cannot be expressed in simpler terms'.
The combination of a set of semantic primes and the rules of combining them forms a 'Natural Semantic Metalanguage' , which is the core from which all the words in a given language would be built up.
I am a little suprised that toki pona ("language of good", https://en.m.wikipedia.org/wiki/Toki_Pona) is not mentioned. It is a language that consists of about 125 words, which aims to make you think about describing complicated subjects. To give an example: The concept "friend" could both be described as "good man" or "man good to me" depending on whether you think your friend is intrinsically good.
Admittedly, the original question is specifically about the English language, but toki pona is a nice experiment related to this.
An interesting related talk, touching on the minimality and expressiveness of both natural and computer languages, is Guy Steele's 1998 talk "Growing a Language":
This will not yield a minimal set; in a cycle, it is only necessary to remove at least one word. The problem is thus to delete the minimum number of vertices to remove all cycles. This is the NP-hard Feedback Vertex Set problem. Here's a paper that solves it for a dictionary (there is some more): https://arxiv.org/abs/0911.5703
Looks like somebody made txt and json versions of the Oxford Unabridged English Dictionary here: https://github.com/adambom/dictionary. The json version should let build up the graph structures you're talking about pretty easily.
But you will come across a lot of words used in definitions that could easily be replaced with more common words. In some cases the change to the definition would be tiny, in others it might be more significant.
Definitions are not enough to fully capture the meaning of a word. In order to do that you need full language modelling and to ground words into other sensory modalities, plus the word in relation to actions taken in various situations when the word was used.
GPT-2 (of recent OpenAI fame) uses 1.5 billion parameters and, though capable of interesting results, is far from human level. It also uses just text so it's incomplete.
Another interesting metric is Bits Per Character - BPC. The state of the art is around 1.06 on English Wikipedia. This measures the average achievable compression on character sequences and doesn't include the size of the model, just the size of the compressed sequence.
That's true but it's almost inherent in what a dictionary is, i.e. to catalogue the canonical semantic meaning of words, not to provide a complete model of language and its contextual variables.
I used to work for Pearson Longman, and one of their USPs was that their defining vocabulary was significantly smaller than the main competitors, namely OUP and CUP. Longman's was just over 2000 (about 2100 IIRC), whereas OUP's was approx 3000.
Even then, one is rather constrained and definitions frequently cross-referenced other words to bootstrap the definition.
Words in the English language are not the same as computer code. I'm not sure you can fully define most words in terms of other words -- hence the variety. Dictionaries generally only provide rough sketches of the meaning of a word. Even synonyms can have slightly different subtexts, connotations, and histories. Hell, individual words have wildly different meanings depending on context.
It sticks to a basic vocabulary, has an entry for every word it uses, and goes heavy on examples and pictures in preference to formal definitions. (And it's monolingual even though written mainly for learners in North America.)
I don't have it to check, but estimating from memory: around 2000 to 4000 words. I found it useful while bootstrapping up from Duolingo.
That is actually a really interesting challenge: to have a completely self-contained dictionary. Especially in 1963, before modern automation, the proofreading required must have been a Herculean task.
Perhaps this could be some kind of measure for answering this question in and of itself: what is the smallest useful self-contained natural language dictionary that one can write?
If it goes heavy on examples and pictures, then it can probably give a more relaxed definition for words, knowing the context will be picked up from the pics and examples. Do you find that true?
It depends on what is meant by "define". If we are allowed to use existing words in a language, L, to create a new language, L', then use expressions in L' to define each word in L, a single word w, originally in L, suffices.
The idea is to first index each word v in the lexicon of L (including w), starting at 1 and ending at n, whatever is the number of distinct words in the language. Alternatively, you can index _meanings_. Then (should be obvious where I'm going with this by this point) you map a sequence S_k of repetitions of w of length k in [1,n] to each k'th word, v_k, in L. So now L' is the language of n sequences S_1,...,S_n of w each of which maps to a word (or meaning) in L. And you have "defined" L in terms of a single word, the word w.
But that's probably not at all what the reddit poster had in mind.
However, it should be noted that natural language is such that there's really no reason that we have many words- it's just convenient and helps us create new utterances without having to create long sequences of one word, as above. The important ability in human language is that we can combine words to create new utterances, forever- which we can do with one word just as well as with a few thousand.
Finally, I suspect that if there was a minimal set of (more than one!) words sufficient to define all other words (meanings) in a language, all natural languages would converge to about that number of words- which I really don't think is the case.
I once found (plausibly from another HN commenter) a text based adventure where (almost?) all the words used were replaced with alternative English-sounding nonsense words, but have never rediscovered the link.
I feel this would be of interest to the thread, if anyone knows what I'm talking about or knows how to successfully Google for such a thing.
I took Websters dictionary from the project Gutenberg site. I started with 95712 words. After the initial throwing away of words that weren’t in any definitions, I was down to 4489 words. After expanding them, and throwing away words that weren’t in the expanded definitions, I was down to 3601 words. Setting recursive definitions as atoms and continuing got me down to 2565 words.
"In Thing Explainer: Complicated Stuff in Simple Words, things are explained in the style of Up Goer Five, using only drawings and a vocabulary of the 1,000 (or "ten hundred") most common words."
It depends on how well you want to "define" something. Wikipedia describing a duck:
Duck is the common name for a large number of species in the waterfowl family Anatidae which also includes swans and geese. Ducks are divided among several subfamilies in the family Anatidae; they do not represent a monophyletic group (the group of all descendants of a single common ancestral species) but a form taxon, since swans and geese are not considered ducks. Ducks are mostly aquatic birds, mostly smaller than the swans and geese, and may be found in both fresh water and sea water. Ducks are sometimes confused with several types of unrelated water birds with similar forms, such as loons or divers, grebes, gallinules, and coots.
The words needed to define a universal turing machine (and a program to simulate a human brain, but that doesn't require additional words).
We could extend it to cover words not conceivable by humans, and any universe, by using a program to simulate those, but (1) I assume the question implicitly assumes human words, though (2) it wouldn't require more words anyway.
Oh, you absolutely wouldn't simplify anything by doing this - ideas that used to be encompassed by a single word would have paragraph-long descriptions.
It's just a thought experiment about how much you could optimize one dimension (number of words) if you didn't care at all about optimization anywhere else in language.
Occasionally it's useful to use a different word simply because one can; sometimes the facilitous utility of alternate mots juste serves its own purpose.
Wow, I had no idea there was such a thing as simple.wikipedia.com! It apparently tries to follow 'Basic English'[0] that's comprised of only 850 words. The difference between the simple version[1] of artificial neural networks is a lot more approachable than the normal version[2]!
[+] [-] Someone|7 years ago|reply
It doesn’t list inflections, proper names, adjectives for colors such as yellowish and words used in an entry that derive from that entry (the dictionary mentions blearily and bleary-eyed being used in the definition of bleary)
They also say they occassionally had to use a word not in the list, but don’t say how often they had to. Those words _are_ defined in the dictionary, so it is possible that the reference graph does not have any cycles.
So, I guess 3,000 is a good first guess.
[+] [-] kbenson|7 years ago|reply
I didn't go searching, big was literally the first word on the list I read after going down a few pages, and I wondered about large, so I searched for it. I just looked a bit more, and there's "child", "childhood", and "grandchild", which while not the same problem, does illustrate that they are fairly liberal with their inclusions because they appear to want to use the minimum vocabulary to define something idiomatically, which is a slightly different question than what's the minimum required.
This problem actually seems to share a lot in common with database normalization.[1]
1: https://en.wikipedia.org/wiki/Database_normalization
[+] [-] Animats|7 years ago|reply
That's where 1984's "Newspeak" came from. See "Orwell, the Lost Writings".
[+] [-] EdwardCoffin|7 years ago|reply
[1] https://www.smartcom.vn/the_oxford_3000.pdf
[+] [-] escherplex|7 years ago|reply
[+] [-] vorg|7 years ago|reply
[+] [-] aasasd|7 years ago|reply
(Though again I'm unsure if the endless English phrasal verbs are counted as distinct in these estimates, not doing which would probably be cheating.)
[+] [-] mrb|7 years ago|reply
It's impossible. An English dictionary defined using English words has to have cycles.
[+] [-] mjgeddes|7 years ago|reply
https://en.wikipedia.org/wiki/Semantic_primes
The combination of a set of semantic primes and the rules of combining them forms a 'Natural Semantic Metalanguage' , which is the core from which all the words in a given language would be built up.
https://en.wikipedia.org/wiki/Natural_semantic_metalanguage
The current agreed-upon number of semantic primes is 65 (see list at wikipedia links above).
That means that any English word can be defined using a lexicon of about 65 concepts in the English natural semantic metalanguage.
[+] [-] superice|7 years ago|reply
Admittedly, the original question is specifically about the English language, but toki pona is a nice experiment related to this.
[+] [-] gojomo|7 years ago|reply
Video: https://www.youtube.com/watch?v=_ahvzDzKdB0
PDF: https://www.cs.virginia.edu/~evans/cs655/readings/steele.pdf
Prior HN discussion: https://news.ycombinator.com/item?id=16847691, https://news.ycombinator.com/item?id=2359174, & others
[+] [-] fginionio|7 years ago|reply
0. Get a dictionary.
1. Form a directed graph, with an edge from each word to every word that uses that word in its definition.
2. Remove all words that have no outgoing edges.
3. If you removed some words, go to step 1. Otherwise, all words left in the dictionary are minimal.
EDIT: If anyone knows of a machine-readable dictionary, I'd love to actually do this.
[+] [-] hairtuq|7 years ago|reply
[+] [-] heyitsguay|7 years ago|reply
[+] [-] excalibur|7 years ago|reply
[+] [-] xiler|7 years ago|reply
https://wordnet.princeton.edu/
[+] [-] doxos|7 years ago|reply
[+] [-] visarga|7 years ago|reply
GPT-2 (of recent OpenAI fame) uses 1.5 billion parameters and, though capable of interesting results, is far from human level. It also uses just text so it's incomplete.
https://blog.openai.com/better-language-models/
Another interesting metric is Bits Per Character - BPC. The state of the art is around 1.06 on English Wikipedia. This measures the average achievable compression on character sequences and doesn't include the size of the model, just the size of the compressed sequence.
https://arxiv.org/pdf/1808.04444.pdf
[+] [-] Emma_Goldman|7 years ago|reply
[+] [-] arooaroo|7 years ago|reply
Even then, one is rather constrained and definitions frequently cross-referenced other words to bootstrap the definition.
[+] [-] chasing|7 years ago|reply
[+] [-] akozak|7 years ago|reply
[+] [-] abecedarius|7 years ago|reply
It sticks to a basic vocabulary, has an entry for every word it uses, and goes heavy on examples and pictures in preference to formal definitions. (And it's monolingual even though written mainly for learners in North America.)
I don't have it to check, but estimating from memory: around 2000 to 4000 words. I found it useful while bootstrapping up from Duolingo.
[+] [-] vanderZwan|7 years ago|reply
That is actually a really interesting challenge: to have a completely self-contained dictionary. Especially in 1963, before modern automation, the proofreading required must have been a Herculean task.
Perhaps this could be some kind of measure for answering this question in and of itself: what is the smallest useful self-contained natural language dictionary that one can write?
EDIT: Oh, fginionio came up with an intuitive approach to do this automatically below: https://news.ycombinator.com/item?id=19332041
[+] [-] degenerate|7 years ago|reply
[+] [-] herogreen|7 years ago|reply
[+] [-] YeGoblynQueenne|7 years ago|reply
The idea is to first index each word v in the lexicon of L (including w), starting at 1 and ending at n, whatever is the number of distinct words in the language. Alternatively, you can index _meanings_. Then (should be obvious where I'm going with this by this point) you map a sequence S_k of repetitions of w of length k in [1,n] to each k'th word, v_k, in L. So now L' is the language of n sequences S_1,...,S_n of w each of which maps to a word (or meaning) in L. And you have "defined" L in terms of a single word, the word w.
But that's probably not at all what the reddit poster had in mind.
However, it should be noted that natural language is such that there's really no reason that we have many words- it's just convenient and helps us create new utterances without having to create long sequences of one word, as above. The important ability in human language is that we can combine words to create new utterances, forever- which we can do with one word just as well as with a few thousand.
Finally, I suspect that if there was a minimal set of (more than one!) words sufficient to define all other words (meanings) in a language, all natural languages would converge to about that number of words- which I really don't think is the case.
[+] [-] Veedrac|7 years ago|reply
I feel this would be of interest to the thread, if anyone knows what I'm talking about or knows how to successfully Google for such a thing.
[+] [-] AnIdiotOnTheNet|7 years ago|reply
Finally, here you are. At the delcot of tondam, where doshes deave. But the doshery lutt is crenned with glauds.
Glauds! How rorm it would be to pell back to the bewl and distunk them, distunk the whole delcot, let the drokes uncren them.
But you are the gostak. The gostak distims the doshes. And no glaud will vorl them from you.
It has been on my to-play list for some time but I haven't got around to it yet.
https://ifdb.tads.org/viewgame?id=w5s3sv43s3p98v45
[+] [-] unknown|7 years ago|reply
[deleted]
[+] [-] kybernetikos|7 years ago|reply
I took Websters dictionary from the project Gutenberg site. I started with 95712 words. After the initial throwing away of words that weren’t in any definitions, I was down to 4489 words. After expanding them, and throwing away words that weren’t in the expanded definitions, I was down to 3601 words. Setting recursive definitions as atoms and continuing got me down to 2565 words.
[+] [-] feyman_r|7 years ago|reply
"In Thing Explainer: Complicated Stuff in Simple Words, things are explained in the style of Up Goer Five, using only drawings and a vocabulary of the 1,000 (or "ten hundred") most common words."
https://xkcd.com/thing-explainer/
[+] [-] aaron695|7 years ago|reply
To understand duck you must see a duck (Eat a duck, pet a duck, smell a duck, hear a duck)
Perhaps you could cheat and uses pixels and coordinates to use English to draw photos and videos to explain ducks.
[+] [-] drewrv|7 years ago|reply
Duck is the common name for a large number of species in the waterfowl family Anatidae which also includes swans and geese. Ducks are divided among several subfamilies in the family Anatidae; they do not represent a monophyletic group (the group of all descendants of a single common ancestral species) but a form taxon, since swans and geese are not considered ducks. Ducks are mostly aquatic birds, mostly smaller than the swans and geese, and may be found in both fresh water and sea water. Ducks are sometimes confused with several types of unrelated water birds with similar forms, such as loons or divers, grebes, gallinules, and coots.
But you could also describe a duck in two simple words: "water bird". Apparently that's a real term: https://en.wikipedia.org/wiki/Water_bird
[+] [-] kylek|7 years ago|reply
See Genesis 2:19-20 (and its placement/context). God shows Adam forms to be named.
[+] [-] MrOxiMoron|7 years ago|reply
[+] [-] singularity2001|7 years ago|reply
https://en.wikipedia.org/wiki/Functional_completeness
Hope you are one of the 10000 lucky ones whose mind is blown for the first time.
Or another one: "1"
https://en.wikipedia.org/wiki/Unary_coding
[+] [-] adrianN|7 years ago|reply
[+] [-] hyperpallium|7 years ago|reply
We could extend it to cover words not conceivable by humans, and any universe, by using a program to simulate those, but (1) I assume the question implicitly assumes human words, though (2) it wouldn't require more words anyway.
[+] [-] ggggtez|7 years ago|reply
The baby learns the words via example, not by definitions.
[+] [-] catach|7 years ago|reply
[+] [-] WhitneyLand|7 years ago|reply
You could have 100 synonyms with the same "definition" but 100 different shades of meaning, implied degree of strength, or connotations.
You don't necessarily simplify anything by making people add additional words get across those subtleties.
Of course of some are useless equivalents, but many aren't.
[+] [-] magneticnorth|7 years ago|reply
It's just a thought experiment about how much you could optimize one dimension (number of words) if you didn't care at all about optimization anywhere else in language.
[+] [-] pbhjpbhj|7 years ago|reply
Not all synonyms amount to useless equivalents.
Occasionally it's useful to use a different word simply because one can; sometimes the facilitous utility of alternate mots juste serves its own purpose.
[+] [-] Swizec|7 years ago|reply
No such thing as a synonym. On the face of it yes many words share meanings. But a mutt is not the same as a dog despite what thesaurus.com says
[+] [-] panarky|7 years ago|reply
What is the minimum number of words needed to define everything else?
[+] [-] bonoboTP|7 years ago|reply
[+] [-] taternuts|7 years ago|reply
0: https://simple.wikipedia.org/wiki/Basic_English
1: https://simple.wikipedia.org/wiki/Artificial_neural_network
2: https://en.wikipedia.org/wiki/Artificial_neural_network