> “Boiling water” isn’t “water that happens to be boiling.” It’s a hazard, a cooking stage, a state of matter
I guess we'll have to disagree then, because "boiling water" is "water that's boiling" to me. It's not a different state of matter to "water", that would be "steam". It being a hazard doesn't mean it's a singular concept, same as "wet floor"
Yeah, if "boiling water" is one word, what about boiling sugar? Boiling milk? Boiling volcano? Boiling soup?
Adding two words together creates a new and different concept. The permutations necessary to represent every concept ever formed by combining two or more different words would be endless.
Some of them on the list, like black hole, do make sense. That's a very distinct thing. It's not a hole in the conventional sense and it's not really black. Boiling water, though, is water. And it's boiling.
Your "to me" is actually problematic, because it legitimizes this nonsensical idea and turns words and their meaning into something purely individualistic, which cannot end well for the current, but even more so for the next generation.
I can confirm that "boiling water" definitively is "water that's boiling" and that two words, which are supposedly one word, definitely are not one word.
> Traditional dictionaries skip almost all such phrases, because they contain spaces.
Yes, because they're phrases, not words. I don't even understand what's surprising about this. Sure, the entire article talks about how dictionaries contain _some_ phrases; but it's clear it's not many of them. Dictionaries are for words, not phrases.
Boiling water is mostly same as boiling anything. So I would just have "boiling". No need for "boiling water". I see no reason why boiling water could not just be covered by whatever general boiling entry covers.
Agree. You can of course treat "Boiling water" in its gerund form where it functions as a noun:
"Boiling water should be performed in a metal pot".
> It’s a hazard, a cooking stage, a state of matter
All of these are ancillary and depend on context, but in every one of these downstream cases the same underlying process is happening: the water is boiling.
I would have agreed with you before they pointed out that "frozen water" gets a word: ice. Honestly, I think it's reasonable: people deal with frozen water far more than they do boiling water, but it changes it from a case of "what are they talking about?" to "okay, where do we draw the line?" for me.
I’m so glad I’m not going insane. I don’t see any examples on that site that I agree are ‘one word’. Sure they’re singular concepts but so what? Are we going to have singular words to describe all adjective noun pairs now?
Surprised that no comment mentioned that there is a standard term (not a word :P) for the set of words that denominates a particular concept: nominal syntagm. Such as "boiling water" and also "that green parrot we saw yesterday over the left branch".
Also the slider examples are abysmal. "I love you", "Go home" and "How are you" are not words by any stretch of imagination. For someone who makes word games, I don't see a particularly deep love of words here.
Added a note: "'I love you' isn't opaque, but it's tight enough to put on a tile." The familiar end of the spectrum picks up collocations that are transparent but loaded — I'm not claiming they're words in the traditional sense, but they're useful vocabulary for word games, which is where I'm coming from.
Funnily enough, "nominal syntagm" is, itself, not in the OED or Wiktionary. But Wiktionary has "syntagme nominal" as the French translation for "noun phrase".
You really have to love the human messiness of language!
A nominal syntagm is a somewhat overlapping concept, but deviates slightly from the direct discussion taking place. The more appropriate standard term here is: open compound word. Or, as one might say casually: word.
A compound word isn't just a phrase. The latter is a group of words that indicate a single concept. The former is a new word that has a distinct meaning from the subwords that compose it. "I love you" is an example of a clausal phrase. The meaning is entirely evident from the words that compose it. In contrast, a "hot dog" is not a particularly warm canine, and has its own OED entry [0] as a compound word.
And some of the entries on this list are wrong. "Good night" exists in OED as "goodnight" [1] because there are multiple ways it's used. One is the clausal phrase "I hope you have a good night", which can be modified by changing the adjective, e.g. "great night" or "terrible night". "Goodnight" the bedtime ritual can't be modified the same way, so OED chooses to write it as a compound word without spaces.
There are nearly half a million compound phrases that aren’t in any dictionary—simply because they contain spaces. “Boiling water.” “Saturday night.” “Help me.”
I would hope that none of those examples were taking up space in a dictionary.
It's quite interesting that "boiling water" in many Slavic languages is actually a separate word (and not derived from "water", but from "boiling"; similar how the author mentions "ice" being used instead of "frozen water").
The first two I kind of understand what the author means. But "help me" and "severe pain" made me think that I'm just not the right public for this text.
While 'this analysis would not have been possible without LLM', I am not sure the LLM analysis was well reviewed after it has been done. From the obscure/familiar word list, some of the n-grams, e.g. "is resource", "seq size", "db xref" surely happen in the wild (we well know), but I would doubt that we can argue they are missing from the dictionary. Knowing the realm, I would argue none of them are words, not even collocations. If "is resource" is, why not, "has resource"?
So while the path is surely interesting, this analysis does miss scrutiny, which you would expect from a high-level LLM analysis.
The very bottom of the slider is there to illustrate where LLM artifacts and Wiktionary noise live — it's not presented as legitimate vocabulary. The slider lets you see the full quality gradient, including where it breaks down.
In addition to what others have pointed out, many of these aren't actually missing from traditional dictionaries: they're just inflected differently. So your example lists phrases like "operating systems", "immune systems" and "solar systems" as missing from traditional dictionaries, but at least the online OED and M-W have "operating system", "immune system" and "solar system" in them. It's just that your script is apparently listing the plural as a separate phrase.
On languages other than English: in general, different languages do word division very differently. At least in German and Dutch, many of those phrasal verbs are separable, meaning that they are one word in the infinitive but are multiple words in the present tense. So for example, where in English you would say "I log in to the website", in Dutch it would be "Ik log in op de website". "Log in" is two words in both cases, but in Dutch it's the separated form of the single-word separable verb inloggen ("I must log in now" = "Ik moet nu inloggen"). The verb is indeed separable in that the two words often don't end up next to each other: "I log in quickly" = "Ik log snel in".
Dutch, like German, has lots of compounds. But there are also agglutinative languages, which have even more complex compound words, perhaps comprising a whole sentence in another language. Eg (from Wikipedia) Turkish "evlerinizdenmiş" = "(he/she/it) was (apparently/said to be) from your houses" or Plains Cree "paehtāwāēwesew" = "he is heard by higher powers"; and these aren't corner cases, that's how the language works.
The author of this article just hasn’t been taught how to use a dictionary. The words aren’t “missing”, they’re just indexed under one of their parts. For example “wait upon” would be located within the entry for “wait”.
Collocation dictionaries are lists of collocations. The reason they're absent from single word dictionaries is because there's about 25x more collocations than single words.
Yeah — added a note below the slider that the obscure end is noise (LLM artifacts, jargon fragments, Wiktionary debris) and that where to draw the line is up to the reader. It was always intended to show the full gradient including where it breaks down, but that wasn't stated
> But roughly 15% are plausible: “wooden chair,” “morning coffee.” That’s still 30 billion sensible pairs.
(1) Who counted those? Whence those numbers?
(2) The examples are normal two-word phrases with one word modifying the other, often categorised as an adjective. The examples are counter-examples to the very claim made in that article.
(3) Using Clause to brainstorm s.t. is a weird thing to say...
(4) I would say the use of 'lexicalized' is wrong or at least uncommon. It usually refers to specialised semantics of something that could be interpreted generically, too. Like 'sleeping bag'. Or indeed 'cold feet'. Lexicalisation may involve deleting spaces, like 'hotdog'. And I am pretty sure lexicalised phrasal words are usually intensionally listed in dictionaries. And so 'ice' is not lexicalised 'frozen water', but it is not overtly a phrase but is a separate atomic word.
My bad. there's a little sidebar about it, but I put it lower after the chart because there wasn't room. You might still not find my logic on the 15% satisfying, but it's there.
Off the top of my head, peanut butter, black hole, and amusement park are concepts that can't be easily intuited by just combining the two singular terms, but I also wouldn't consider them as phrases.
"Peanut butter" would be dealt with by including a reference under the "butter" entry. Something like:
'N, culinary. A paste made of ground up nuts, sometimes with additional oils and other ingredients. E.g. "peanut butter", "almond butter".'
"Amusement park", same. Falls very much under the "place of recreation" definition of "park".
"Black hole" is maybe a bit different, because it's a scientific term - and certainly in a science dictionary would be included as a two-word item - but, for consistency, in a regular dictionary should be handled identically to the above, with a note on the word "hole".
While including noun phrases as singular entities in a word game is entirely appropriate, I don't think the OP has formed a rigorous definition of the concept that they are trying to describe. I agree with the other comment which suggests that they need some instruction / practice using a dictionary.
We act as if some languages have "compound words" that can encompass entire sentences (subject & object attaching to the verb as prefixes or suffixes) while others don't form compounds, and most are somewhere in between. But these are all statements about lexicographic conventions and say nothing about the languages. In reality all languages are muddles sprawling across a multidimensional continuum, and they abso-frigging-lutely do n't sit neatly in such pigeonholes.
This is a great comparison. We're arguing about the definition of "word", and attempting to expand it to include edge cases where two words with separate meanings have a different atomic meaning when combined.
We could have a similar debate about whether common suffixes and prefixes should be regarded as individual words.
Much like "planets" don't really exist as a separate natural object, words don't really exist in natural languages. They are artificial concepts, and therefore we will always have edge cases.
I would argue that it is still a useful discussion, as it sheds light on the nature of language (or of celestial bodies), even if the definitions defy the same rigour as mathematical concepts.
Ha — you're probably right that it would have been less controversial. But I kept it precisely because it's arguable. Added a parenthetical acknowledging the HN debate and framing it as on-the-fence by design
"Monkey wrench" is a word already found in the dictionary, so it wouldn't be a useful example. It already met the bar.
The article is questioning why some words don't meet the bar for inclusion in the dictionary. The word "boiling water" is one such word that it sees as being on the fence. The comments here demonstrate exactly why it is on the fence, but it remains unclear exactly what would be necessary for it to tip towards inclusion.
Oh, geeze. The progressive transparency effect on the words towards the "obscure" end of their spectrum made the later pages impossible for me to read.
I suspect the entire list was produced by an AI entity which had not been prompted to avoid giving offense. I predict a range of (tedious) opinions about whether a prohibition on that particular word is an appropriate inclusion in a system prompt.
That's also not a term I've - thankfully! - ever heard, so I've no idea if it's hallucinated. This is not an invitation, HN, to define or explain it to me.
I'm currently reading Cormack McCarthy's Suttree (my first of his novels) — just an exceptional polymath capable of painting complicated scenery with words dozenly scattered throughout paragraphs [0].
My favorite adjective he's coördinated is "burntwing", used to describe moths spiraling downwards after passing through candleflames. If I had crafted such a descriptive contraction, my former styling would've been "burnt-wing", had I even been capable of generating such concise imagery [1].
McCarthy's stylings have helped me to reduce hyphenations in my own writings — reducing their usage mainly to contractedwords which might be all-too-confusing without them.
[0] pg104 has ten words that I do not know their definitions, yet through context they work to advance the storyline of character racists (book is set in 1950s).
[1] decades ago, during college burnout, I was searching for the essense of "burntwing" — reduced to writing a professor about "feeling like a burning airplane in tailspin." My trajectory back then was definitely burntwing.
Wait til you read Blood Meridian. The imagery he created with words, some of them his own creations, is just ... beyond compare. I'm reading The Road now, which comes from the same place. I can only read either in small doses. It's very intense, and the passages deserve to be read carefully.
Another contemporary writer who worked with new words in a very creative way was Gene Wolfe in The Book of the New Sun. Some were inventions using Greek, French, or Latin roots. Others were forgotten terms which he resurrected. Someone compiled a dictionary, Lexicon Urthus, which discusses the origins of certain terms and their placement within the series.
One of the axes this analysis seems to be missing is the subtle spectrum from "multi-word expressions" to "idioms". Traditional lexicographers have long published separate idioms books, such as the Merriam-Webster New World American Idioms Handbook and the Oxford Dictionary of Idioms.
Wiktionary doesn't need to make that distinction between MWEs and Idioms and tends to conflate MWEs and Idioms as there is no separate "Wikidiom". Arguably, that multi-book confusion runs deep on the internet because Urban Dictionary should probably be fully titled the Urban Dictionary of Idioms and Slang.
It's not just page limits but also categorical limits and classic lexicographers would build multiple books/volumes, not just settle on one "dictionary". Classic scholars would often have a "reference shelf" with multiple dictionaries, books of idioms, thesauri, and more. The CD-ROM and then the internet has kind of tunnel visioned that this entire shelf can be merely "one app".
It’s probably a thing, especially with loan-words (eg.: “avant garde”), and there are probably much better examples… But the examples in the article make no sense to me.
The difference between phrases and "words with spaces" is addressed.
The confusion might be that this seems to be a spectrum rather than a binary phenomon.
We have single words at one extreme, ordinary sentences at the other, and in the middle we have idiomatic assemblies of words that span a range of substitutability.
"Hot dog" and "Saturday night" are arguably great examples, because they exist at the opposite extremes of the spectrum. Saturday night can retain some of the original meaning following substitution, whereas hot dog almost deserves a hyphen.
but can't you basically make anything a composite noun in German? That it's a single word doesn't really help you decided if it has enough presence unto itself to be defined in the dictionary.
Seems like they would have just as much of a problem since the issue is delineating when a "phrase" becomes a "word"
In Dutch we indeed happily do this even for English loanwords like "creditcard" or something more obscure like "lockpick". When in doubt, remove the space.
As far as my limited knowledge of linguistics goes, the technical term is actually "collocations."
To me, any discussion of this topic that doesn't mention collocations signals an amateurish approach.
I also disagree with the premise that "this was not possible before LLM." That's nonsense. Linguists created many dictionaries of collocations for different languages, so that work is precisely what they did!
(Before any LLM zealots attack me, yes, it is now possible to have a more exhaustive list of collocations thanks to LLMs. This doesn't contradict my point.)
AIUI, collocations are just "words that often go together". It doesn't signal any unconventional meaning to the construction, that would make it a proper idiom.
It appears to me that the author is trying too hard to make a point: "merry-go-round" is a single compound word that several dictionaries contain; "canned goods" is not commonly used[1] (more of a bureaucratic jargon), and people would just say "cans"(US) or "preserves" (UK); "household chores" is simply "chores", as the word is no longer commonly used outside the house context; "coffee break ritual" is not a concept in English-speaking countries so it would make no sense to have it in a dictionary, and so many of the examples are exactly that.
[1] I wonder how many here have ever been told something like "Prithee, husband, bring back a dozen canned goods from the market, for in the meanwhile I shall do my household chores".
I didn't take that the author was trying to make a point, rather pondering about when a word becomes worthy of inclusion in the dictionary. "Boiling water" is a word that hasn't (yet) been deemed worthy. On the other hand, "hot water" is a word that is included in the dictionary, just as "still water", "warm water", etc. are. Somewhere between those included and "boiling water" is line where a word becomes worthy of inclusion. But where, exactly, is that line?
Hospital bills feels like a pretty ordinary compound to me - not like "good morning" or "ginger ale" where you can't just use what you know about the two words to figure out what the compound must mean.
Some cases are basically impossible "Crash blossoms" you don't stand any chance without knowing why we call them that
Some are middling difficult, "Home Secretary" requires that you know every meaning for the two words and then you happen to pick the correct obscure meaning, a "Secretary" could be in charge, and "Home" could mean the entire country as distinct from everywhere else.
But "Hospital bills" doesn't seem even marginally difficult
Two related compound words from a Norwegian dialect, both mean "fish food":
Fiskemat
Fiskmat
The latter means food made from fish, the former means food for fish. Standard varieties of Norwegian only use the former to mean both, to the annoyance of many old fishermen.
This maybe illustrates why the author's examples such as boiling water aren't so weird. Yes, in English it means water that's boiling, but you have to know that. It could for instance have meant water for boiling, like "cooling water" means water for cooling say in a nuclear reactor, not water which is in the process of getting cool.
>Spanish carves up time with precision English lacks: madrugada for the pre-dawn hours, atardecer for late afternoon waning into evening. The mid-day nap was so compelling we adopted the siesta into English.
"I used to smoke marijuana. But I’ll tell you something: I would only smoke it in the late evening. Oh, occasionally the early evening, but usually the late evening -- or the mid evening. Just the early evening, mid evening and late evening. Occasionally, early afternoon, early midafternoon, or perhaps the late-midafternoon. Oh, sometimes the early-mid-late-early morning... But never at dusk." -Steve Martin
I disagree these belong in a traditional dictionary.
I could, however be convinced these could be documented/defined in a separate document, especially from the perspective you are coming from (word games).
Hah, I wonder how thick a German, Dutch or Afrikaans dictionary would be if it included all possible spaceless compound words. Literally any concept can be compounded together to make a new word.
Roovleisslaghuisinspekteur =
Rooi = red
Vleis = meat
Slag = butcher
Huis = house
Inspekteur = inspector
"Inspector who controls the quality of red meat in butcheries"
Dictionaries containing spaced compounds were not scalable with print media. The printed OED was encyclopedic in scale. Compound dictionaries are more than feasible now. Arguing whether a collection of commonly used words are expressions or concepts or even single "spaced words" is beside the point. Simply identify these differences and classify them in the compendium.
English has words with spaces. Boiling water isn't one of them, but in general, if you can't insert another adjective between an adjective-noun pair, it's linguistically a compound word that we happen to write with a space. "Fast food" is a good example. It's not simply an adjective-noun pair, as demonstrated by the fact that you sound like a crazy person if you try to insert literally anything between "fast" and "food" in "I eat too much fast food". The "fast food" can be modified all you like, as in "I eat too much lukewarm fast food", "I eat too much depressing fast food", but you can't treat "fast" as merely an adjective of "food", else "I eat too much fast, filling food" wouldn't strip the sentence of the implication I eat at McDonalds or whatever.
Dictionaries are a mixed bag at best. If you apply David Kaplan’s character/content distinction from Demonstratives, you have to ask: should pure indexicals, which are essentially 'contentless' pointers be treated the same way as standard words? Let alone the thousands of rigid designators in this dataset that map directly to specific objects in the real world. At a certain point, is there no room left for encyclopedias?
A mixture of melting ice and water suitable for drinking has a word: ice water. It's not a adjective noun phrase. It has a more specific meaning than just the two words together. You can order an ice water at a restaurant
I got into solving the NYT crossword during Covid. I couldn’t solve a Monday when I started; now I do Mondays downs-only and look forward to Saturdays. Along the way, I developed a sixth sense for when an answer will be more than one word. I’ve thought a lot about it and can’t really describe how I do it. (Some other puzzles clarify if an answer spans multiple words, but I find the ambiguity adds to the fun.)
Do you think this comes from a gradual internalization of a real linguistic concept? Or it more a familiarity with common (if unspoken) conventions of the puzzle makers?
I suspect the answer isn't binary, but it's interesting to think about.
This "sixth sense" phenomenon seems to pop up a lot. Crosswords are a great example. The sense some people are getting for detecting LLM output might be another.
There are an infinite number of describable concepts that don't get a specific word. That doesn't mean the whole description is a "word with spaces."
It's just part of how language works that when there isn't a single word carrying the meaning you want, you put multiple words together and they can mean the thing together.
Even though there isn't a specific word for that, I wouldn't say "It's just part of how language works that when there isn't a single word carrying the meaning you want, you put multiple words together and they can mean the thing together" actually is one big word with spaces in it.
It's a bunch of words together that carry a more specific meaning when put together in that order.
I was afraid that no one would bring this up. I’m developing a strange relationship with Wikipedia over the commonplace role it serves as an online resource. But I appreciate how it normalized the practice of looking things up and to get a general overview of a thing, leading to internal and external references. Credit is due to search engines also in this case I reckon.
If the compound words all have single word entries in the dictionary that when combined mean the same thing what is the point?
Water: transparent, odorless, tasteless liquid
Boiling: having reached the boiling point
Boiling Water: transparent, odorless, tasteless liquid which has reached the boiling point
If Boiling Water had some other completely different meaning that has nothing to do with the individual words then sure, maybe, otherwise this is completely redundant and opinionated.
As an English native, I'd rather see proper nouns in a dictionary before seeing "compound words".
Personally, I don't agree that "boiling water" is a word (with a space) - I would refer to it as a phrase if it had specific meaning, but it just seems like an ordinary pairing of adjective and noun. Also, if a word can contain a space, then what is the meaning of "words" as there doesn't seem an easy way to distinguish between a "compound word" and a common phrase. Is "barking dog" a pair of words, a compound word or a phrase? (It's a pair of words in my mind)
I'm the author. Updated the article based on this thread — thanks to everyone who pushed back. Changes: reframed "boiling water" as deliberately on-the-fence rather than asserting it's a word; added a note that the obscure end of the slider is noise; acknowledged collocations as the established term; added a German/Dutch/Norwegian section on how other languages handle the space problem; softened "wasn't possible before LLMs" to "wasn't practical"; and threaded the concept of "loaded" throughout as the key distinction. Many of the specific examples came directly from commenters here — credited below.
I'd really love to see the prompt(s) you used with Claude. The way the article was written I mistakenly thought you would expand upon that in a footnote or sidebar.
Very cool project! Reminds me Chiang's great short story 'The Truth of Fact, the Truth of Feeling':
> “If you speak slowly, you pause very briefly after each word. Thatʼs why we leave a space in those places when we write. Like this: How. Many. Years. Old. Are. You?” He wrote on his paper as he spoke, leaving a space every time he paused: Anyom a ou kuma a me?
> “But you speak slowly because youʼre a foreigner. Iʼm Tiv, so I donʼt pause when I speak. Shouldnʼt my writing be the same?”
voidUpdate|5 days ago
I guess we'll have to disagree then, because "boiling water" is "water that's boiling" to me. It's not a different state of matter to "water", that would be "steam". It being a hazard doesn't mean it's a singular concept, same as "wet floor"
kdheiwns|5 days ago
Adding two words together creates a new and different concept. The permutations necessary to represent every concept ever formed by combining two or more different words would be endless.
Some of them on the list, like black hole, do make sense. That's a very distinct thing. It's not a hole in the conventional sense and it's not really black. Boiling water, though, is water. And it's boiling.
5o1ecist|5 days ago
Your "to me" is actually problematic, because it legitimizes this nonsensical idea and turns words and their meaning into something purely individualistic, which cannot end well for the current, but even more so for the next generation.
I can confirm that "boiling water" definitively is "water that's boiling" and that two words, which are supposedly one word, definitely are not one word.
RHSeeger|5 days ago
> Traditional dictionaries skip almost all such phrases, because they contain spaces.
Yes, because they're phrases, not words. I don't even understand what's surprising about this. Sure, the entire article talks about how dictionaries contain _some_ phrases; but it's clear it's not many of them. Dictionaries are for words, not phrases.
globular-toast|5 days ago
- Don't put your hand in water that's boiling,
- Add the pasta to water that's boiling,
- That saucepan is full of water that's boiling.
If "boiling water" were a distinct word, all of these sentences would change meaning compare to their idiomatic counterparts.
Ekaros|5 days ago
m-schuetz|5 days ago
vunderba|5 days ago
All of these are ancillary and depend on context, but in every one of these downstream cases the same underlying process is happening: the water is boiling.
gcanyon|5 days ago
RiverCrochet|5 days ago
adamauckland|4 days ago
The chef was out the back, boiling water.
The chef was out the back. Boiling water had spilled everywhere.
The seas had turned to boiling water.
I dunno, could be down to interpretation.
jonplackett|5 days ago
unknown|5 days ago
[deleted]
Beijinger|5 days ago
harperlee|5 days ago
Also the slider examples are abysmal. "I love you", "Go home" and "How are you" are not words by any stretch of imagination. For someone who makes word games, I don't see a particularly deep love of words here.
Edit: Obligatory reference to Borges's Tlön: https://en.wikipedia.org/wiki/Tl%C3%B6n,_Uqbar,_Orbis_Tertiu...
michaeld123|5 days ago
georgefrowny|5 days ago
You really have to love the human messiness of language!
win311fwg|5 days ago
AlotOfReading|7 days ago
And some of the entries on this list are wrong. "Good night" exists in OED as "goodnight" [1] because there are multiple ways it's used. One is the clausal phrase "I hope you have a good night", which can be modified by changing the adjective, e.g. "great night" or "terrible night". "Goodnight" the bedtime ritual can't be modified the same way, so OED chooses to write it as a compound word without spaces.
[0] https://www.oed.com/dictionary/hot-dog_n
[1] https://www.oed.com/dictionary/goodnight_n
BergAndCo|6 days ago
[deleted]
dec0dedab0de|7 days ago
I would hope that none of those examples were taking up space in a dictionary.
jakub_g|7 days ago
gligierko|7 days ago
unknown|7 days ago
[deleted]
simlevesque|7 days ago
DonHopkins|5 days ago
Yeah, I agree! Fuck ICE!
thmpp|7 days ago
michaeld123|7 days ago
less_less|5 days ago
On languages other than English: in general, different languages do word division very differently. At least in German and Dutch, many of those phrasal verbs are separable, meaning that they are one word in the infinitive but are multiple words in the present tense. So for example, where in English you would say "I log in to the website", in Dutch it would be "Ik log in op de website". "Log in" is two words in both cases, but in Dutch it's the separated form of the single-word separable verb inloggen ("I must log in now" = "Ik moet nu inloggen"). The verb is indeed separable in that the two words often don't end up next to each other: "I log in quickly" = "Ik log snel in".
Dutch, like German, has lots of compounds. But there are also agglutinative languages, which have even more complex compound words, perhaps comprising a whole sentence in another language. Eg (from Wikipedia) Turkish "evlerinizdenmiş" = "(he/she/it) was (apparently/said to be) from your houses" or Plains Cree "paehtāwāēwesew" = "he is heard by higher powers"; and these aren't corner cases, that's how the language works.
urbandw311er|5 days ago
kelseyfrog|7 days ago
Collocation dictionaries are lists of collocations. The reason they're absent from single word dictionaries is because there's about 25x more collocations than single words.
unknown|7 days ago
[deleted]
georgefrowny|5 days ago
Presumably if the word thesaurus was actually "synonym dictionary" it would likewise be absent.
michaeld123|5 days ago
WesolyKubeczek|5 days ago
Well I can't even.
michaeld123|5 days ago
csmantle|5 days ago
floxy|5 days ago
beeforpork|5 days ago
(1) Who counted those? Whence those numbers?
(2) The examples are normal two-word phrases with one word modifying the other, often categorised as an adjective. The examples are counter-examples to the very claim made in that article.
(3) Using Clause to brainstorm s.t. is a weird thing to say...
(4) I would say the use of 'lexicalized' is wrong or at least uncommon. It usually refers to specialised semantics of something that could be interpreted generically, too. Like 'sleeping bag'. Or indeed 'cold feet'. Lexicalisation may involve deleting spaces, like 'hotdog'. And I am pretty sure lexicalised phrasal words are usually intensionally listed in dictionaries. And so 'ice' is not lexicalised 'frozen water', but it is not overtly a phrase but is a separate atomic word.
=> I don't get the point.
michaeld123|5 days ago
yifanl|5 days ago
eszed|5 days ago
'N, culinary. A paste made of ground up nuts, sometimes with additional oils and other ingredients. E.g. "peanut butter", "almond butter".'
"Amusement park", same. Falls very much under the "place of recreation" definition of "park".
"Black hole" is maybe a bit different, because it's a scientific term - and certainly in a science dictionary would be included as a two-word item - but, for consistency, in a regular dictionary should be handled identically to the above, with a note on the word "hole".
While including noun phrases as singular entities in a word game is entirely appropriate, I don't think the OP has formed a rigorous definition of the concept that they are trying to describe. I agree with the other comment which suggests that they need some instruction / practice using a dictionary.
fuzzy_biscuit|5 days ago
MarkusQ|7 days ago
We act as if some languages have "compound words" that can encompass entire sentences (subject & object attaching to the verb as prefixes or suffixes) while others don't form compounds, and most are somewhere in between. But these are all statements about lexicographic conventions and say nothing about the languages. In reality all languages are muddles sprawling across a multidimensional continuum, and they abso-frigging-lutely do n't sit neatly in such pigeonholes.
Wobbles42|7 days ago
We could have a similar debate about whether common suffixes and prefixes should be regarded as individual words.
Much like "planets" don't really exist as a separate natural object, words don't really exist in natural languages. They are artificial concepts, and therefore we will always have edge cases.
I would argue that it is still a useful discussion, as it sheds light on the nature of language (or of celestial bodies), even if the definitions defy the same rigour as mathematical concepts.
oh_my_goodness|5 days ago
michaeld123|5 days ago
9rx|5 days ago
The article is questioning why some words don't meet the bar for inclusion in the dictionary. The word "boiling water" is one such word that it sees as being on the fence. The comments here demonstrate exactly why it is on the fence, but it remains unclear exactly what would be necessary for it to tip towards inclusion.
globular-toast|5 days ago
Dwedit|5 days ago
eszed|5 days ago
I suspect the entire list was produced by an AI entity which had not been prompted to avoid giving offense. I predict a range of (tedious) opinions about whether a prohibition on that particular word is an appropriate inclusion in a system prompt.
That's also not a term I've - thankfully! - ever heard, so I've no idea if it's hallucinated. This is not an invitation, HN, to define or explain it to me.
unknown|5 days ago
[deleted]
ProllyInfamous|5 days ago
My favorite adjective he's coördinated is "burntwing", used to describe moths spiraling downwards after passing through candleflames. If I had crafted such a descriptive contraction, my former styling would've been "burnt-wing", had I even been capable of generating such concise imagery [1].
McCarthy's stylings have helped me to reduce hyphenations in my own writings — reducing their usage mainly to contractedwords which might be all-too-confusing without them.
[0] pg104 has ten words that I do not know their definitions, yet through context they work to advance the storyline of character racists (book is set in 1950s).
[1] decades ago, during college burnout, I was searching for the essense of "burntwing" — reduced to writing a professor about "feeling like a burning airplane in tailspin." My trajectory back then was definitely burntwing.
tolerance|5 days ago
ilamont|5 days ago
Another contemporary writer who worked with new words in a very creative way was Gene Wolfe in The Book of the New Sun. Some were inventions using Greek, French, or Latin roots. Others were forgotten terms which he resurrected. Someone compiled a dictionary, Lexicon Urthus, which discusses the origins of certain terms and their placement within the series.
WorldMaker|5 days ago
Wiktionary doesn't need to make that distinction between MWEs and Idioms and tends to conflate MWEs and Idioms as there is no separate "Wikidiom". Arguably, that multi-book confusion runs deep on the internet because Urban Dictionary should probably be fully titled the Urban Dictionary of Idioms and Slang.
It's not just page limits but also categorical limits and classic lexicographers would build multiple books/volumes, not just settle on one "dictionary". Classic scholars would often have a "reference shelf" with multiple dictionaries, books of idioms, thesauri, and more. The CD-ROM and then the internet has kind of tunnel visioned that this entire shelf can be merely "one app".
danesparza|7 days ago
I think maybe the word the author is looking for is 'phrase'
epgui|7 days ago
Wobbles42|7 days ago
The confusion might be that this seems to be a spectrum rather than a binary phenomon.
We have single words at one extreme, ordinary sentences at the other, and in the middle we have idiomatic assemblies of words that span a range of substitutability.
"Hot dog" and "Saturday night" are arguably great examples, because they exist at the opposite extremes of the spectrum. Saturday night can retain some of the original meaning following substitution, whereas hot dog almost deserves a hyphen.
alecbz|7 days ago
dominicrose|5 days ago
ndr42|7 days ago
English: cream of mushroom soup
Spanisch: sopa cremosa de champiñones
German: Champignoncremesuppe
looperhacks|7 days ago
It has some compound words. But including too many of them would quickly get out of hand
ticulatedspline|7 days ago
Seems like they would have just as much of a problem since the issue is delineating when a "phrase" becomes a "word"
agmater|7 days ago
Shorel|5 days ago
To me, any discussion of this topic that doesn't mention collocations signals an amateurish approach.
I also disagree with the premise that "this was not possible before LLM." That's nonsense. Linguists created many dictionaries of collocations for different languages, so that work is precisely what they did!
(Before any LLM zealots attack me, yes, it is now possible to have a more exhaustive list of collocations thanks to LLMs. This doesn't contradict my point.)
Examples of collocation dictionaries:
https://www.freecollocation.com/
https://ozdic.com/.
zozbot234|5 days ago
michaeld123|5 days ago
ragall|5 days ago
[1] I wonder how many here have ever been told something like "Prithee, husband, bring back a dozen canned goods from the market, for in the meanwhile I shall do my household chores".
9rx|5 days ago
inkcapmushroom|5 days ago
below43|7 days ago
tialaramex|7 days ago
Some cases are basically impossible "Crash blossoms" you don't stand any chance without knowing why we call them that
Some are middling difficult, "Home Secretary" requires that you know every meaning for the two words and then you happen to pick the correct obscure meaning, a "Secretary" could be in charge, and "Home" could mean the entire country as distinct from everywhere else.
But "Hospital bills" doesn't seem even marginally difficult
soperj|7 days ago
vintermann|5 days ago
Fiskemat Fiskmat
The latter means food made from fish, the former means food for fish. Standard varieties of Norwegian only use the former to mean both, to the annoyance of many old fishermen.
This maybe illustrates why the author's examples such as boiling water aren't so weird. Yes, in English it means water that's boiling, but you have to know that. It could for instance have meant water for boiling, like "cooling water" means water for cooling say in a nuclear reactor, not water which is in the process of getting cool.
DonHopkins|5 days ago
"I used to smoke marijuana. But I’ll tell you something: I would only smoke it in the late evening. Oh, occasionally the early evening, but usually the late evening -- or the mid evening. Just the early evening, mid evening and late evening. Occasionally, early afternoon, early midafternoon, or perhaps the late-midafternoon. Oh, sometimes the early-mid-late-early morning... But never at dusk." -Steve Martin
HardwareLust|5 days ago
I could, however be convinced these could be documented/defined in a separate document, especially from the perspective you are coming from (word games).
beAbU|7 days ago
Roovleisslaghuisinspekteur =
Rooi = red
Vleis = meat
Slag = butcher
Huis = house
Inspekteur = inspector
"Inspector who controls the quality of red meat in butcheries"
waffletower|5 days ago
dirtikiti|5 days ago
"Words" don't have "spaces."
Phrases are made of words separated by spaces.
"Boiling Water" is not a word.
"Water" is a word. A noun, the subject.
"Boiling" is a word. An adjective, in this case. Which modifies the subject.
I don't know if you're trying to be clever, but you're not.
OkayPhysicist|5 days ago
NoPicklez|5 days ago
It's pretty obvious "boiling water" shouldn't be in the dictionary to be begin with but "boiling" and "water" should be.
Unless I'm just not understanding it
DaedalusII|5 days ago
Entschädigungsleistungen - compensation benefits
Wiederbeschaffungskosten - replacement value
Kraftfahrzeughaftpflichtversicherung - motor vehicle liability insurance
Donaudampfschifffahrtsgesellschaftskapitän - Danube steamboat captain
Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz - beef labeling regulation law
Retr0id|5 days ago
speak_plainly|7 days ago
happycat5000|7 days ago
grantpitt|7 days ago
kgwgk|7 days ago
pvillano|7 days ago
hagbard_c|7 days ago
ice - water - steam
languagehacker|5 days ago
johnhamlin|7 days ago
Wobbles42|7 days ago
I suspect the answer isn't binary, but it's interesting to think about.
This "sixth sense" phenomenon seems to pop up a lot. Crosswords are a great example. The sense some people are getting for detecting LLM output might be another.
alecbz|7 days ago
unknown|7 days ago
[deleted]
wlesieutre|5 days ago
It's just part of how language works that when there isn't a single word carrying the meaning you want, you put multiple words together and they can mean the thing together.
Even though there isn't a specific word for that, I wouldn't say "It's just part of how language works that when there isn't a single word carrying the meaning you want, you put multiple words together and they can mean the thing together" actually is one big word with spaces in it.
It's a bunch of words together that carry a more specific meaning when put together in that order.
wanda|4 days ago
robertclaus|5 days ago
tolerance|5 days ago
invalidusernam3|5 days ago
Water: transparent, odorless, tasteless liquid
Boiling: having reached the boiling point
Boiling Water: transparent, odorless, tasteless liquid which has reached the boiling point
If Boiling Water had some other completely different meaning that has nothing to do with the individual words then sure, maybe, otherwise this is completely redundant and opinionated.
ndsipa_pomu|5 days ago
Personally, I don't agree that "boiling water" is a word (with a space) - I would refer to it as a phrase if it had specific meaning, but it just seems like an ordinary pairing of adjective and noun. Also, if a word can contain a space, then what is the meaning of "words" as there doesn't seem an easy way to distinguish between a "compound word" and a common phrase. Is "barking dog" a pair of words, a compound word or a phrase? (It's a pair of words in my mind)
aaroninsf|7 days ago
pvillano|7 days ago
gcr|5 days ago
bombcar|5 days ago
johnhamlin|7 days ago
michaeld123|5 days ago
s1mon|5 days ago
anotherhue|7 days ago
grantpitt|7 days ago
> “If you speak slowly, you pause very briefly after each word. Thatʼs why we leave a space in those places when we write. Like this: How. Many. Years. Old. Are. You?” He wrote on his paper as he spoke, leaving a space every time he paused: Anyom a ou kuma a me?
> “But you speak slowly because youʼre a foreigner. Iʼm Tiv, so I donʼt pause when I speak. Shouldnʼt my writing be the same?”
bbstats|5 days ago
riffraff|5 days ago
hmokiguess|7 days ago
pvillano|7 days ago
"Eachother" feels as natural as "somebody", "nobody", "anybody" to me
JackFr|7 days ago
retr0rocket|7 days ago
[deleted]